Open ATP-55 opened 1 year ago
@Amrutayan can you describe nvidia-container-toolkit-daemonset
to see what image is being used?
I have used gpu-operator V22.9. So using image:
repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04
@Amrutayan Can you use the v1.11.0-ubi8
toolkit image instead? Please see the discussion here.
I have tried but still error remain the same:
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://502006a8772e498c6ba4f874fdacee353208489ff114b790cf4d82cc4334b7c9
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:efb88937f73434994d1bbadc87b492a1df047aa9f8d6e9f5ec3b09536e6e7691
Port:
Logs: NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mfvcd 0/1 Init:CrashLoopBackOff 4 3m54s gpu-node-feature-discovery-master-64864bd756-5skpq 1/1 Running 0 20h gpu-node-feature-discovery-worker-7w8pp 1/1 Running 0 20h gpu-node-feature-discovery-worker-c8tfr 1/1 Running 0 20h gpu-node-feature-discovery-worker-ms6q5 1/1 Running 0 20h gpu-node-feature-discovery-worker-qnkdr 1/1 Running 0 20h gpu-node-feature-discovery-worker-rngnf 1/1 Running 0 20h gpu-node-feature-discovery-worker-t6d6z 1/1 Running 1 7h13m gpu-node-feature-discovery-worker-ws25n 1/1 Running 0 20h gpu-node-feature-discovery-worker-zz8rp 1/1 Running 0 20h gpu-operator-7bdd8bf555-kcfhp 1/1 Running 0 20h nvidia-container-toolkit-daemonset-wkssm 1/1 Running 0 4m8s nvidia-dcgm-exporter-fqpv7 0/1 Init:CrashLoopBackOff 4 3m56s nvidia-device-plugin-daemonset-4bvwr 0/1 Init:CrashLoopBackOff 4 4m8s nvidia-operator-validator-zfm5z 0/1 Init:CrashLoopBackOff 4 3m54s
kubectl describe po nvidia-dcgm-exporter-fqpv7
d to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 3m2s (x4 over 3m44s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m2s (x4 over 3m44s) kubelet Created container toolkit-validation Warning Failed 3m1s (x4 over 3m44s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 30s (x17 over 3m43s) kubelet Back-off restarting failed container
kubectl describe po nvidia-device-plugin-daemonset-4bvwr
Warning FailedCreatePodSandBox 4m55s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused" Warning Failed 4m22s (x3 over 4m39s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 3m56s (x4 over 4m39s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m56s (x4 over 4m39s) kubelet Created container toolkit-validation Warning BackOff 92s (x16 over 4m38s) kubelet Back-off restarting failed container
kubectl describe po gpu-feature-discovery-mfvcd
Normal Created 4m48s (x4 over 5m30s) kubelet Created container driver-validation Warning Failed 4m47s (x4 over 5m30s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m10s (x17 over 5m29s) kubelet Back-off restarting failed container
kubectl describe po nvidia-operator-validator-zfm5z
Normal Created 5m23s (x4 over 6m5s) kubelet Created container toolkit-validation Warning Failed 5m22s (x4 over 6m5s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m50s (x17 over 6m3s) kubelet Back-off restarting failed container
can you please take a look and suggest please?
@cdesiniotis Can you please suggest?
Can you check if the driver root
is set correctly to /
in this case in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
?
I don't know if that helps your case, but I had the same error and increasing pod's memory request and limits to at least 1G solved the issue.
I've encountered Auto-detected mode as 'legacy'
when accidentally specifying a device that did not exist.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? Yeskubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
Trying to deploy gpu operator -22.9.0 on SLES 15 SP4. Worker node is already installed with Nvidia-Driver. But getting error on below pods: nvidia-dcgm-exporter-2xlz8 0/1 Init:CrashLoopBackOff 7 12m nvidia-device-plugin-daemonset-nmb7r 0/1 Init:CrashLoopBackOff 7 12m nvidia-operator-validator-v8xn9 0/1 Init:CrashLoopBackOff 7 12m gpu-feature-discovery-9t6sp 0/1 Init:CrashLoopBackOff 7 12m
2. Steps to reproduce the issue
Install GPU operator Helm chart 22.9.0 on SLES 15 SP4
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
SGH123VZMJ:/home/edison/atp/gpu-operator-22.9.0/deployments/gpu-operator # kubectl get po NAME READY STATUS RESTARTS AGE gpu-feature-discovery-9t6sp 0/1 Init:RunContainerError 3 61s gpu-node-feature-discovery-master-64864bd756-fthcl 1/1 Running 0 84s gpu-node-feature-discovery-worker-2l2l6 1/1 Running 0 84s gpu-node-feature-discovery-worker-4wnzw 1/1 Running 0 84s gpu-node-feature-discovery-worker-8bhmc 1/1 Running 0 84s gpu-node-feature-discovery-worker-8vp78 1/1 Running 0 84s gpu-node-feature-discovery-worker-k5jjd 1/1 Running 0 84s gpu-node-feature-discovery-worker-lfbgd 1/1 Running 0 84s gpu-node-feature-discovery-worker-sj9m9 1/1 Running 0 84s gpu-node-feature-discovery-worker-twbms 1/1 Running 0 84s gpu-node-feature-discovery-worker-zfkqt 1/1 Running 0 84s gpu-operator-7bdd8bf555-pvxz5 1/1 Running 0 84s nvidia-container-toolkit-daemonset-2hdpv 1/1 Running 0 63s nvidia-dcgm-exporter-2xlz8 0/1 Init:RunContainerError 3 62s nvidia-device-plugin-daemonset-nmb7r 0/1 Init:RunContainerError 3 63s nvidia-operator-validator-v8xn9 0/1 Init:RunContainerError 3 63s[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
POD Events: gpu-feature-discovery-9t6sp
Events: Type Reason Age From Message
Normal Scheduled 98s default-scheduler Successfully assigned default/gpu-feature-discovery-9t6sp to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 98s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 44s (x4 over 84s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 44s (x4 over 84s) kubelet Created container toolkit-validation Warning Failed 44s (x4 over 84s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 4s (x8 over 83s) kubelet Back-off restarting failed container
POD: nvidia-operator-validator-v8xn9
Events: Type Reason Age From Message
Normal Scheduled 3m8s default-scheduler Successfully assigned default/nvidia-operator-validator-v8xn9 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 3m7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 85s (x5 over 2m54s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 85s (x5 over 2m54s) kubelet Created container driver-validation Warning Failed 85s (x5 over 2m53s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 73s (x9 over 2m52s) kubelet Back-off restarting failed container
POD: nvidia-device-plugin-daemonset-nmb7r
Events: Type Reason Age From Message
Normal Scheduled 2m56s default-scheduler Successfully assigned default/nvidia-device-plugin-daemonset-nmb7r to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 74s (x5 over 2m42s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 74s (x5 over 2m42s) kubelet Created container toolkit-validation Warning Failed 74s (x5 over 2m42s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 63s (x9 over 2m41s) kubelet Back-off restarting failed container
POD: nvidia-dcgm-exporter-2xlz8
Events: Type Reason Age From Message
Normal Scheduled 2m47s default-scheduler Successfully assigned default/nvidia-dcgm-exporter-2xlz8 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 58s (x5 over 2m35s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 58s (x5 over 2m35s) kubelet Created container toolkit-validation Warning Failed 58s (x5 over 2m35s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 58s (x9 over 2m34s) kubelet Back-off restarting failed container
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
ls -la /run/nvidia
total 4 drwxr-xr-x 4 root root 100 Nov 17 03:23 . drwxr-xr-x 33 root root 880 Nov 17 03:17 .. drwxr-xr-x 2 root root 40 Nov 17 03:17 driver -rw-r--r-- 1 root root 6 Nov 17 03:23 toolkit.pid drwxr-xr-x 2 root root 60 Nov 17 03:22 validations
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
total 12912 drwxr-xr-x 3 root root 4096 Nov 17 03:23 . drwxr-xr-x 3 root root 21 Nov 17 03:23 .. drwxr-xr-x 3 root root 38 Nov 17 03:23 .config lrwxrwxrwx 1 root root 32 Nov 17 03:23 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0 -rw-r--r-- 1 root root 2959384 Nov 17 03:23 libnvidia-container-go.so.1.11.0 lrwxrwxrwx 1 root root 29 Nov 17 03:23 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 195856 Nov 17 03:23 libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 154 Nov 17 03:23 nvidia-container-cli -rwxr-xr-x 1 root root 47472 Nov 17 03:23 nvidia-container-cli.real -rwxr-xr-x 1 root root 342 Nov 17 03:23 nvidia-container-runtime -rwxr-xr-x 1 root root 350 Nov 17 03:23 nvidia-container-runtime-experimental -rwxr-xr-x 1 root root 203 Nov 17 03:23 nvidia-container-runtime-hook -rwxr-xr-x 1 root root 2142088 Nov 17 03:23 nvidia-container-runtime-hook.real -rwxr-xr-x 1 root root 3771792 Nov 17 03:23 nvidia-container-runtime.experimental -rwxr-xr-x 1 root root 4079040 Nov 17 03:23 nvidia-container-runtime.real lrwxrwxrwx 1 root root 29 Nov 17 03:23 nvidia-container-toolkit -> nvidia-container-runtime-hook[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
total 0 drwxr-xr-x 2 root root 40 Nov 17 03:17 . drwxr-xr-x 4 root root 100 Nov 17 03:23 ..Note: nvidia-computeG05-470.129.06-150400.54.1.x86_64 Driver installed as an RPM on the worker node.
Result of nvidia-smi: Thu Nov 17 03:31:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:08.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:09.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
journalctl -u kubelet > kubelet.logs