Getting Error: "stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown" while deploying gpu operator -22.9.0 on SLES 15 SP4

ATP-55 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? No. SUSE Linux Enterprise Server 15 SP4
[ ] Are you running Kubernetes v1.13+? Yes. K8s v1.21.10
[ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? containerd github.com/containerd/containerd v1.6.1
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes? Yes
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Trying to deploy gpu operator -22.9.0 on SLES 15 SP4. Worker node is already installed with Nvidia-Driver. But getting error on below pods: nvidia-dcgm-exporter-2xlz8 0/1 Init:CrashLoopBackOff 7 12m nvidia-device-plugin-daemonset-nmb7r 0/1 Init:CrashLoopBackOff 7 12m nvidia-operator-validator-v8xn9 0/1 Init:CrashLoopBackOff 7 12m gpu-feature-discovery-9t6sp 0/1 Init:CrashLoopBackOff 7 12m

2. Steps to reproduce the issue

Install GPU operator Helm chart 22.9.0 on SLES 15 SP4

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces SGH123VZMJ:/home/edison/atp/gpu-operator-22.9.0/deployments/gpu-operator # kubectl get po NAME READY STATUS RESTARTS AGE gpu-feature-discovery-9t6sp 0/1 Init:RunContainerError 3 61s gpu-node-feature-discovery-master-64864bd756-fthcl 1/1 Running 0 84s gpu-node-feature-discovery-worker-2l2l6 1/1 Running 0 84s gpu-node-feature-discovery-worker-4wnzw 1/1 Running 0 84s gpu-node-feature-discovery-worker-8bhmc 1/1 Running 0 84s gpu-node-feature-discovery-worker-8vp78 1/1 Running 0 84s gpu-node-feature-discovery-worker-k5jjd 1/1 Running 0 84s gpu-node-feature-discovery-worker-lfbgd 1/1 Running 0 84s gpu-node-feature-discovery-worker-sj9m9 1/1 Running 0 84s gpu-node-feature-discovery-worker-twbms 1/1 Running 0 84s gpu-node-feature-discovery-worker-zfkqt 1/1 Running 0 84s gpu-operator-7bdd8bf555-pvxz5 1/1 Running 0 84s nvidia-container-toolkit-daemonset-2hdpv 1/1 Running 0 63s nvidia-dcgm-exporter-2xlz8 0/1 Init:RunContainerError 3 62s nvidia-device-plugin-daemonset-nmb7r 0/1 Init:RunContainerError 3 63s nvidia-operator-validator-v8xn9 0/1 Init:RunContainerError 3 63s
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

POD Events: gpu-feature-discovery-9t6sp

Events: Type Reason Age From Message

Normal Scheduled 98s default-scheduler Successfully assigned default/gpu-feature-discovery-9t6sp to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 98s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 44s (x4 over 84s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 44s (x4 over 84s) kubelet Created container toolkit-validation Warning Failed 44s (x4 over 84s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 4s (x8 over 83s) kubelet Back-off restarting failed container

POD: nvidia-operator-validator-v8xn9

Events: Type Reason Age From Message

Normal Scheduled 3m8s default-scheduler Successfully assigned default/nvidia-operator-validator-v8xn9 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 3m7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 85s (x5 over 2m54s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 85s (x5 over 2m54s) kubelet Created container driver-validation Warning Failed 85s (x5 over 2m53s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 73s (x9 over 2m52s) kubelet Back-off restarting failed container

POD: nvidia-device-plugin-daemonset-nmb7r

Events: Type Reason Age From Message

Normal Scheduled 2m56s default-scheduler Successfully assigned default/nvidia-device-plugin-daemonset-nmb7r to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 74s (x5 over 2m42s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 74s (x5 over 2m42s) kubelet Created container toolkit-validation Warning Failed 74s (x5 over 2m42s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 63s (x9 over 2m41s) kubelet Back-off restarting failed container

POD: nvidia-dcgm-exporter-2xlz8

Events: Type Reason Age From Message

Normal Scheduled 2m47s default-scheduler Successfully assigned default/nvidia-dcgm-exporter-2xlz8 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 58s (x5 over 2m35s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 58s (x5 over 2m35s) kubelet Created container toolkit-validation Warning Failed 58s (x5 over 2m35s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 58s (x9 over 2m34s) kubelet Back-off restarting failed container

[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime

default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]

[ ] NVIDIA shared directory: ls -la /run/nvidia

total 4 drwxr-xr-x 4 root root 100 Nov 17 03:23 . drwxr-xr-x 33 root root 880 Nov 17 03:17 .. drwxr-xr-x 2 root root 40 Nov 17 03:17 driver -rw-r--r-- 1 root root 6 Nov 17 03:23 toolkit.pid drwxr-xr-x 2 root root 60 Nov 17 03:22 validations

[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit total 12912 drwxr-xr-x 3 root root 4096 Nov 17 03:23 . drwxr-xr-x 3 root root 21 Nov 17 03:23 .. drwxr-xr-x 3 root root 38 Nov 17 03:23 .config lrwxrwxrwx 1 root root 32 Nov 17 03:23 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0 -rw-r--r-- 1 root root 2959384 Nov 17 03:23 libnvidia-container-go.so.1.11.0 lrwxrwxrwx 1 root root 29 Nov 17 03:23 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 195856 Nov 17 03:23 libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 154 Nov 17 03:23 nvidia-container-cli -rwxr-xr-x 1 root root 47472 Nov 17 03:23 nvidia-container-cli.real -rwxr-xr-x 1 root root 342 Nov 17 03:23 nvidia-container-runtime -rwxr-xr-x 1 root root 350 Nov 17 03:23 nvidia-container-runtime-experimental -rwxr-xr-x 1 root root 203 Nov 17 03:23 nvidia-container-runtime-hook -rwxr-xr-x 1 root root 2142088 Nov 17 03:23 nvidia-container-runtime-hook.real -rwxr-xr-x 1 root root 3771792 Nov 17 03:23 nvidia-container-runtime.experimental -rwxr-xr-x 1 root root 4079040 Nov 17 03:23 nvidia-container-runtime.real lrwxrwxrwx 1 root root 29 Nov 17 03:23 nvidia-container-toolkit -> nvidia-container-runtime-hook
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver total 0 drwxr-xr-x 2 root root 40 Nov 17 03:17 . drwxr-xr-x 4 root root 100 Nov 17 03:23 ..

Note: nvidia-computeG05-470.129.06-150400.54.1.x86_64 Driver installed as an RPM on the worker node.

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

[ ] kubelet logs journalctl -u kubelet > kubelet.logs

cdesiniotis commented 1 year ago

@Amrutayan can you describe nvidia-container-toolkit-daemonset to see what image is being used?

ATP-55 commented 1 year ago

I have used gpu-operator V22.9. So using image:

repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04

shivamerla commented 1 year ago

@Amrutayan Can you use the v1.11.0-ubi8 toolkit image instead? Please see the discussion here.

ATP-55 commented 1 year ago

I have tried but still error remain the same:

Containers: nvidia-container-toolkit-ctr: Container ID: containerd://502006a8772e498c6ba4f874fdacee353208489ff114b790cf4d82cc4334b7c9 Image: nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8 Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:efb88937f73434994d1bbadc87b492a1df047aa9f8d6e9f5ec3b09536e6e7691 Port: Host Port: Command: bash -c Args: [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-toolkit /usr/local/nvidia State: Running Started: Thu, 24 Nov 2022 13:45:20 +0000

Logs: NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mfvcd 0/1 Init:CrashLoopBackOff 4 3m54s gpu-node-feature-discovery-master-64864bd756-5skpq 1/1 Running 0 20h gpu-node-feature-discovery-worker-7w8pp 1/1 Running 0 20h gpu-node-feature-discovery-worker-c8tfr 1/1 Running 0 20h gpu-node-feature-discovery-worker-ms6q5 1/1 Running 0 20h gpu-node-feature-discovery-worker-qnkdr 1/1 Running 0 20h gpu-node-feature-discovery-worker-rngnf 1/1 Running 0 20h gpu-node-feature-discovery-worker-t6d6z 1/1 Running 1 7h13m gpu-node-feature-discovery-worker-ws25n 1/1 Running 0 20h gpu-node-feature-discovery-worker-zz8rp 1/1 Running 0 20h gpu-operator-7bdd8bf555-kcfhp 1/1 Running 0 20h nvidia-container-toolkit-daemonset-wkssm 1/1 Running 0 4m8s nvidia-dcgm-exporter-fqpv7 0/1 Init:CrashLoopBackOff 4 3m56s nvidia-device-plugin-daemonset-4bvwr 0/1 Init:CrashLoopBackOff 4 4m8s nvidia-operator-validator-zfm5z 0/1 Init:CrashLoopBackOff 4 3m54s

kubectl describe po nvidia-dcgm-exporter-fqpv7

d to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 3m2s (x4 over 3m44s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m2s (x4 over 3m44s) kubelet Created container toolkit-validation Warning Failed 3m1s (x4 over 3m44s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 30s (x17 over 3m43s) kubelet Back-off restarting failed container

kubectl describe po nvidia-device-plugin-daemonset-4bvwr

Warning FailedCreatePodSandBox 4m55s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused" Warning Failed 4m22s (x3 over 4m39s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 3m56s (x4 over 4m39s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m56s (x4 over 4m39s) kubelet Created container toolkit-validation Warning BackOff 92s (x16 over 4m38s) kubelet Back-off restarting failed container

kubectl describe po gpu-feature-discovery-mfvcd

Normal Created 4m48s (x4 over 5m30s) kubelet Created container driver-validation Warning Failed 4m47s (x4 over 5m30s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m10s (x17 over 5m29s) kubelet Back-off restarting failed container

kubectl describe po nvidia-operator-validator-zfm5z

Normal Created 5m23s (x4 over 6m5s) kubelet Created container toolkit-validation Warning Failed 5m22s (x4 over 6m5s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m50s (x17 over 6m3s) kubelet Back-off restarting failed container

ATP-55 commented 1 year ago

can you please take a look and suggest please?

ATP-55 commented 1 year ago

@cdesiniotis Can you please suggest?

shivamerla commented 1 year ago

Can you check if the driver root is set correctly to / in this case in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml?

cmisale commented 1 year ago

I don't know if that helps your case, but I had the same error and increasing pod's memory request and limits to at least 1G solved the issue.

figuernd commented 1 week ago

I've encountered Auto-detected mode as 'legacy' when accidentally specifying a device that did not exist.

NVIDIA / gpu-operator