Open y-shida-tg opened 1 month ago
The only reason this would happen is if your plugin on the node isn't actually pointing to this config. Did you launch the plugin pointing to this config map and then update the label on the node to point to the particular time-slicing
config within that config map?
Thank you for your reply.
I created dp-example-config0.yaml
and dp-example-config1.yaml
, applied the config with helm upgrade -i nvdp nvdp/nvidia-device-plugin
, and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
. However, the node information still shows only 1 GPU as follows:
Capacity: nvidia.com/gpu: 1 Allocatable: nvidia.com/gpu: 1
The following config is added to the node labels: Labels: nvidia.com/device-plugin.config=config0
Are there any other items to check in the device plugin or settings to be configured on the node side? Currently, the node side has only been modified for compatibility with the cri-o device plugin.
Detailed file contents and commands are shown below.
version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid sharing: timeSlicing: resources:
version: v1 flags: migStrategy: "mixed" # Only change from config0.yaml failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid
--version=0.16.1 \
--namespace nvidia-device-plugin \
--create-namespace \
--set config.default=config0 \
--set-file config.map.config0=dp-example-config0.yaml \
--set-file config.map.config1=dp-example-config1.yaml
nvidia.com/device-plugin.config=config0
Name: kube-root-ca.crt
Namespace: nvidia-device-plugin
Labels:
-----BEGIN CERTIFICATE----- MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt 2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP 5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31 p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB /wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7 4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU 9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB RNI= -----END CERTIFICATE-----
Events:
Name: nvdp-nvidia-device-plugin-configs Namespace: nvidia-device-plugin Labels: app.kubernetes.io/instance=nvdp app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=nvidia-device-plugin app.kubernetes.io/version=0.16.1 helm.sh/chart=nvidia-device-plugin-0.16.1 Annotations: meta.helm.sh/release-name: nvdp meta.helm.sh/release-namespace: nvidia-device-plugin
version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid sharing: timeSlicing: resources:
version: v1 flags: migStrategy: "mixed" # Only change from config0.yaml failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid
Events:
Name: onp1-4-r750
Roles:
NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node MemoryPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status Addresses: InternalIP: fc00:a000::14 Hostname: onp1-4-r750 Capacity: cpu: 112 ephemeral-storage: 2737838616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395422092Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 112 ephemeral-storage: 2523192064328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395319692Ki nvidia.com/gpu: 0 pods: 110 System Info: Machine ID: d4e91833fac54bb0b9458e38819fdf2b System UUID: 4c4c4544-0046-5110-8051-c3c04f395633 Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d Kernel Version: 5.10.57 OS Image: CentOS Linux 8 Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.23.5 Kubelet Version: v1.23.6 Kube-Proxy Version: v1.23.6 PodCIDR: 1100:0:0:1::/64 PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24 Non-terminated Pods: (3 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 350m (0%) 100m (0%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
■Node status (after starting nvidia-device-plugin-daemonset)
Name: onp1-4-r750
Roles:
NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node MemoryPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status Addresses: InternalIP: fc00:a000::14 Hostname: onp1-4-r750 Capacity: cpu: 112 ephemeral-storage: 2737838616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395422092Ki nvidia.com/gpu: 1 pods: 110 Allocatable: cpu: 112 ephemeral-storage: 2523192064328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395319692Ki nvidia.com/gpu: 1 pods: 110 System Info: Machine ID: d4e91833fac54bb0b9458e38819fdf2b System UUID: 4c4c4544-0046-5110-8051-c3c04f395633 Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d Kernel Version: 5.10.57 OS Image: CentOS Linux 8 Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.23.5 Kubelet Version: v1.23.6 Kube-Proxy Version: v1.23.6 PodCIDR: 1100:0:0:1::/64 PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24 Non-terminated Pods: (4 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d kube-system nvidia-device-plugin-daemonset-drdv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18s Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 350m (0%) 100m (0%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
I'm confused by this step that you reference:
and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The static
deployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm
is the preferred installation method for the plugin).
The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The staticdeployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm is the preferred installation method for the plugin).
According to the document, I tried to proceed only with helm operations, but the node information is as follows, and the nvidia-device-plugin was not running.
Capacity: nvidia.com/gpu: 0 Allocatable: nvidia.com/gpu: 0
I believe the issue is that the nvidia-device-plugin does not start with helm operations. Are there any items to check?
Below are the command and configuration details.
##Contents of the config file
# cat dp-example-config0.yaml
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
## In the case of using the dp-example-config0.yaml on the bulletin board
-----------------------------------------------------------------------
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
# cat dp-example-config1.yaml
version: v1
flags:
migStrategy: "mixed" # Only change from config0.yaml
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
## Apply config file
# helm search repo nvdp --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/gpu-feature-discovery 0.16.2 0.16.2 A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for the nvidia-device-plugin on Ku...
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.16.2 \
--namespace nvidia-device-plugin \
--create-namespace \
--set config.default=config0 \
--set-file config.map.config0=dp-example-config0.yaml \
--set-file config.map.config1=dp-example-config1.yaml
# kubectl label nodes onp1-4-r750 --overwrite \
nvidia.com/device-plugin.config=config0
## Checking the contents of the config map
# kubectl describe configmaps -n nvidia-device-plugin
Name: kube-root-ca.crt
Namespace: nvidia-device-plugin
Labels: <none>
Annotations: kubernetes.io/description:
Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...
Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a
txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt
2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP
5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE
BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31
p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO
icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j
b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl
aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP
PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7
4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU
9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB
RNI=
-----END CERTIFICATE-----
BinaryData
====
Events: <none>
Name: nvdp-nvidia-device-plugin-configs
Namespace: nvidia-device-plugin
Labels: app.kubernetes.io/instance=nvdp
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nvidia-device-plugin
app.kubernetes.io/version=0.16.1
helm.sh/chart=nvidia-device-plugin-0.16.1
Annotations: meta.helm.sh/release-name: nvdp
meta.helm.sh/release-namespace: nvidia-device-plugin
Data
====
config0:
----
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
config1:
----
version: v1
flags:
migStrategy: "mixed" # Only change from config0.yaml
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
BinaryData
====
Events: <none>
## Node status (without nvidia-device-plugin-daemonset)
# kubectl describe nodes onp1-4-r750
Name: onp1-4-r750
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=onp1-4-r750
kubernetes.io/os=linux
nvidia.com/device-plugin.config=config0
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
projectcalico.org/IPv6Address: fc00:a000::14/64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 18 Jun 2024 18:00:26 +0900
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: onp1-4-r750
AcquireTime: <unset>
RenewTime: Tue, 15 Oct 2024 18:11:08 +0900
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: fc00:a000::14
Hostname: onp1-4-r750
Capacity:
cpu: 112
ephemeral-storage: 2737838616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 395422092Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 112
ephemeral-storage: 2523192064328
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 395319692Ki
nvidia.com/gpu: 0
pods: 110
System Info:
Machine ID: d4e91833fac54bb0b9458e38819fdf2b
System UUID: 4c4c4544-0046-5110-8051-c3c04f395633
Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d
Kernel Version: 5.10.57
OS Image: CentOS Linux 8
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.23.5
Kubelet Version: v1.23.6
Kube-Proxy Version: v1.23.6
PodCIDR: 1100:0:0:1::/64
PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d
kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d
kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 350m (0%) 100m (0%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events: <none>
## Node status (after nvidia-device-plugin-daemonset launch)
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
# kubectl describe nodes onp1-4-r750
Name: onp1-4-r750
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=onp1-4-r750
kubernetes.io/os=linux
nvidia.com/device-plugin.config=config0
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
projectcalico.org/IPv6Address: fc00:a000::14/64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 18 Jun 2024 18:00:26 +0900
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: onp1-4-r750
AcquireTime: <unset>
RenewTime: Tue, 15 Oct 2024 18:12:09 +0900
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: fc00:a000::14
Hostname: onp1-4-r750
Capacity:
cpu: 112
ephemeral-storage: 2737838616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 395422092Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 112
ephemeral-storage: 2523192064328
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 395319692Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: d4e91833fac54bb0b9458e38819fdf2b
System UUID: 4c4c4544-0046-5110-8051-c3c04f395633
Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d
Kernel Version: 5.10.57
OS Image: CentOS Linux 8
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.23.5
Kubelet Version: v1.23.6
Kube-Proxy Version: v1.23.6
PodCIDR: 1100:0:0:1::/64
PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d
kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d
kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d
kube-system nvidia-device-plugin-daemonset-drdv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 350m (0%) 100m (0%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events: <none>
We have been trying since then, but have not been able to resolve this issue. If you have anything to try, I would appreciate a reply.
Referring to “GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes”, we have implemented the " NVIDIA device plugin for Kubernetes" and are trying out time slicing, but encountering issues. Specifically, the GPU capacity is displayed as follows, with only “1” GPU capacity shown instead of “4” (expected to be 4 due to replicas: 4 in the YAML). What could be the reason why “Capacity” is not increasing?
times.yaml
Hardware Information: Server: PowerEdge R750 (SKU=090E, ModelName=PowerEdge R750) CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
GPGPU Information: GPGPU: A100 80GB CUDA Version: 12.2 Driver Version: 535.54.03 nvidia-container-runtime: runc version 1.0.2、spec: 1.0.2-dev、go: go1.16.7、libseccomp: 2.5.1
Linux Information: OS: CentOS Linux release 8.5.2111 k8s environment: kubectl version: Client Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.6”, GitCommit: “ad3338546da947756e8a88aa6822e9c11e7eac22”, GitTreeState: “clean”, BuildDate: “2022-04-14T08:49:13Z”, GoVersion: “go1.17.9”, Compiler: “gc”, Platform: “linux/amd64”} Server Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.17”, GitCommit: “953be8927218ec8067e1af2641e540238ffd7576”, GitTreeState: “clean”, BuildDate: “2023-02-22T13:27:46Z”, GoVersion: “go1.19.6”, Compiler: “gc”, Platform: “linux/amd64”} crio version: 1.23.5
NVIDIA device plugin for Kubernetes version used: v0.16.1