NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.81k stars 295 forks source link

Getting Error: "stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown" while deploying gpu operator -22.9.0 on SLES 15 SP4 #443

Open ATP-55 opened 1 year ago

ATP-55 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

Trying to deploy gpu operator -22.9.0 on SLES 15 SP4. Worker node is already installed with Nvidia-Driver. But getting error on below pods: nvidia-dcgm-exporter-2xlz8 0/1 Init:CrashLoopBackOff 7 12m nvidia-device-plugin-daemonset-nmb7r 0/1 Init:CrashLoopBackOff 7 12m nvidia-operator-validator-v8xn9 0/1 Init:CrashLoopBackOff 7 12m gpu-feature-discovery-9t6sp 0/1 Init:CrashLoopBackOff 7 12m

2. Steps to reproduce the issue

Install GPU operator Helm chart 22.9.0 on SLES 15 SP4

3. Information to attach (optional if deemed irrelevant)

POD Events: gpu-feature-discovery-9t6sp

Events: Type Reason Age From Message


Normal Scheduled 98s default-scheduler Successfully assigned default/gpu-feature-discovery-9t6sp to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 98s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 44s (x4 over 84s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 44s (x4 over 84s) kubelet Created container toolkit-validation Warning Failed 44s (x4 over 84s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 4s (x8 over 83s) kubelet Back-off restarting failed container

POD: nvidia-operator-validator-v8xn9

Events: Type Reason Age From Message


Normal Scheduled 3m8s default-scheduler Successfully assigned default/nvidia-operator-validator-v8xn9 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 3m7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 85s (x5 over 2m54s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 85s (x5 over 2m54s) kubelet Created container driver-validation Warning Failed 85s (x5 over 2m53s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 73s (x9 over 2m52s) kubelet Back-off restarting failed container

POD: nvidia-device-plugin-daemonset-nmb7r

Events: Type Reason Age From Message


Normal Scheduled 2m56s default-scheduler Successfully assigned default/nvidia-device-plugin-daemonset-nmb7r to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 74s (x5 over 2m42s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 74s (x5 over 2m42s) kubelet Created container toolkit-validation Warning Failed 74s (x5 over 2m42s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 63s (x9 over 2m41s) kubelet Back-off restarting failed container

POD: nvidia-dcgm-exporter-2xlz8

Events: Type Reason Age From Message


Normal Scheduled 2m47s default-scheduler Successfully assigned default/nvidia-dcgm-exporter-2xlz8 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 58s (x5 over 2m35s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 58s (x5 over 2m35s) kubelet Created container toolkit-validation Warning Failed 58s (x5 over 2m35s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 58s (x9 over 2m34s) kubelet Back-off restarting failed container

default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]

total 4 drwxr-xr-x 4 root root 100 Nov 17 03:23 . drwxr-xr-x 33 root root 880 Nov 17 03:17 .. drwxr-xr-x 2 root root 40 Nov 17 03:17 driver -rw-r--r-- 1 root root 6 Nov 17 03:23 toolkit.pid drwxr-xr-x 2 root root 60 Nov 17 03:22 validations

Note: nvidia-computeG05-470.129.06-150400.54.1.x86_64 Driver installed as an RPM on the worker node.

Result of nvidia-smi: Thu Nov 17 03:31:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:08.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:09.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

cdesiniotis commented 1 year ago

@Amrutayan can you describe nvidia-container-toolkit-daemonset to see what image is being used?

ATP-55 commented 1 year ago

I have used gpu-operator V22.9. So using image:

repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04

shivamerla commented 1 year ago

@Amrutayan Can you use the v1.11.0-ubi8 toolkit image instead? Please see the discussion here.

ATP-55 commented 1 year ago

I have tried but still error remain the same:

Containers: nvidia-container-toolkit-ctr: Container ID: containerd://502006a8772e498c6ba4f874fdacee353208489ff114b790cf4d82cc4334b7c9 Image: nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8 Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:efb88937f73434994d1bbadc87b492a1df047aa9f8d6e9f5ec3b09536e6e7691 Port: Host Port: Command: bash -c Args: [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-toolkit /usr/local/nvidia State: Running Started: Thu, 24 Nov 2022 13:45:20 +0000

Logs: NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mfvcd 0/1 Init:CrashLoopBackOff 4 3m54s gpu-node-feature-discovery-master-64864bd756-5skpq 1/1 Running 0 20h gpu-node-feature-discovery-worker-7w8pp 1/1 Running 0 20h gpu-node-feature-discovery-worker-c8tfr 1/1 Running 0 20h gpu-node-feature-discovery-worker-ms6q5 1/1 Running 0 20h gpu-node-feature-discovery-worker-qnkdr 1/1 Running 0 20h gpu-node-feature-discovery-worker-rngnf 1/1 Running 0 20h gpu-node-feature-discovery-worker-t6d6z 1/1 Running 1 7h13m gpu-node-feature-discovery-worker-ws25n 1/1 Running 0 20h gpu-node-feature-discovery-worker-zz8rp 1/1 Running 0 20h gpu-operator-7bdd8bf555-kcfhp 1/1 Running 0 20h nvidia-container-toolkit-daemonset-wkssm 1/1 Running 0 4m8s nvidia-dcgm-exporter-fqpv7 0/1 Init:CrashLoopBackOff 4 3m56s nvidia-device-plugin-daemonset-4bvwr 0/1 Init:CrashLoopBackOff 4 4m8s nvidia-operator-validator-zfm5z 0/1 Init:CrashLoopBackOff 4 3m54s

kubectl describe po nvidia-dcgm-exporter-fqpv7

d to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 3m2s (x4 over 3m44s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m2s (x4 over 3m44s) kubelet Created container toolkit-validation Warning Failed 3m1s (x4 over 3m44s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 30s (x17 over 3m43s) kubelet Back-off restarting failed container

kubectl describe po nvidia-device-plugin-daemonset-4bvwr

Warning FailedCreatePodSandBox 4m55s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused" Warning Failed 4m22s (x3 over 4m39s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 3m56s (x4 over 4m39s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m56s (x4 over 4m39s) kubelet Created container toolkit-validation Warning BackOff 92s (x16 over 4m38s) kubelet Back-off restarting failed container

kubectl describe po gpu-feature-discovery-mfvcd

Normal Created 4m48s (x4 over 5m30s) kubelet Created container driver-validation Warning Failed 4m47s (x4 over 5m30s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m10s (x17 over 5m29s) kubelet Back-off restarting failed container

kubectl describe po nvidia-operator-validator-zfm5z

Normal Created 5m23s (x4 over 6m5s) kubelet Created container toolkit-validation Warning Failed 5m22s (x4 over 6m5s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m50s (x17 over 6m3s) kubelet Back-off restarting failed container

ATP-55 commented 1 year ago

can you please take a look and suggest please?

ATP-55 commented 1 year ago

@cdesiniotis Can you please suggest?

shivamerla commented 1 year ago

Can you check if the driver root is set correctly to / in this case in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml?

cmisale commented 1 year ago

I don't know if that helps your case, but I had the same error and increasing pod's memory request and limits to at least 1G solved the issue.

figuernd commented 1 week ago

I've encountered Auto-detected mode as 'legacy' when accidentally specifying a device that did not exist.