The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
[kubectl describe pod nvidia-device-plugin-daemonset-26gpb -n operators
Name: nvidia-device-plugin-daemonset-26gpb
Namespace: operators
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-device-plugin
Node: 10.23.29.206/10.23.29.206
Start Time: Thu, 20 Jun 2024 13:00:51 +0300
Labels: app=nvidia-device-plugin-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=67b7bb9bb
helm.sh/chart=gpu-operator-v24.3.0
pod-template-generation=22
Annotations:
Status: Running
IP: 172.31.4.172
IPs:
IP: 172.31.4.172
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID: containerd://e740b47deae826518e8a175ac1cd6da46e357266157087b45dfee218afcb3809
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:2edc1d4ed555830e70010c82558936198f5faa86fc29ecf5698219145102cfcc
Port:
Host Port:
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 20 Jun 2024 13:00:52 +0300
Finished: Thu, 20 Jun 2024 13:00:52 +0300
Ready: True
Restart Count: 0
Environment:
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro)
Containers:
nvidia-device-plugin:
Container ID: containerd://62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7
Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04
Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:1aff0e9f0759758f87cb158d78241472af3a76cdc631f01ab395f997fa80f707
Port:
Host Port:
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: Others("other error: container_linux.go:340: starting container process caused \"process_linux.go:380: container init caused \\"rootfs_linux.go:61: mounting \\\\"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\\" to rootfs \\\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\\\\" at \\\\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\\\" caused \\\\"not a directory\\\\"\\"\""): unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 03:00:00 +0300
Finished: Fri, 21 Jun 2024 13:11:32 +0300
Ready: False
Restart Count: 288
Environment:
PASS_DEVICE_SPECS: true
FAIL_ON_INIT_ERROR: true
DEVICE_LIST_STRATEGY: envvar
DEVICE_ID_STRATEGY: uuid
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
MPS_ROOT: /run/nvidia/mps
MIG_STRATEGY: mixed
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/bin/entrypoint.sh from nvidia-device-plugin-entrypoint (ro,path="entrypoint.sh")
/dev/shm from mps-shm (rw)
/host from host-root (ro)
/mps from mps-root (rw)
/run/nvidia from run-nvidia (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bn4jn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-device-plugin-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-device-plugin-entrypoint
Optional: false
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
mps-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps
HostPathType: DirectoryOrCreate
mps-shm:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps/shm
HostPathType:
kube-api-access-bn4jn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true
Tolerations: gpu=dedicated:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
Normal Pulled 21m (x285 over 24h) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04" already present on machine
Warning BackOffStart 98s (x6670 over 24h) kubelet Back-off restarting failed container nvidia-device-plugin in pod nvidia-device-plugin-daemonset-26gpb_operators(13f0751a-487d-4d4a-bc69-fa89fef819cf) ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[kubectl logs -n operators nvidia-device-plugin-daemonset-26gpb --all-containers
libcontainer: container init failed to execcontainer_linux.go:340: starting container process caused "process_linux.go:380: container init caused \"rootfs_linux.go:61: mounting \\"/run/nvidia/driver/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\" to rootfs \\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs\\" at \\"/run/containerd/io.containerd.runtime.v2.task/k8s.io/62c3b528a6ab14f20d329fe9341efcc675178248b1ed4eadd2ca2038d1a77ad7/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1\\" caused \\"not a directory\\"\"" ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ nvidia-smi
Fri Jun 21 10:18:27 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:00:0D.0 Off | On |
| N/A 36C P0 70W / 300W | 1MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 0 0 0 | 1MiB / 81221MiB | 98 0 | 7 0 5 1 1 |
| | 1MiB / 131072MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
I have an issue with the nvidia-device-plugin-daemonset
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: Directory host-root: Type: HostPath (bare host directory volume) Path: / HostPathType:
cdi-root: Type: HostPath (bare host directory volume) Path: /var/run/cdi HostPathType: DirectoryOrCreate mps-root: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps HostPathType: DirectoryOrCreate mps-shm: Type: HostPath (bare host directory volume) Path: /run/nvidia/mps/shm HostPathType:
kube-api-access-bn4jn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:
Normal Pulled 21m (x285 over 24h) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubuntu22.04" already present on machine Warning BackOffStart 98s (x6670 over 24h) kubelet Back-off restarting failed container nvidia-device-plugin in pod nvidia-device-plugin-daemonset-26gpb_operators(13f0751a-487d-4d4a-bc69-fa89fef819cf) ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:00:0D.0 Off | On | | N/A 36C P0 70W / 300W | 1MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | 0 0 0 0 | 1MiB / 81221MiB | 98 0 | 7 0 5 1 1 | | | 1MiB / 131072MiB | | | +------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+] Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com