NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.75k stars 282 forks source link

Permissions issues: `initialization error: nvml error: insufficient permissions` #679

Open leobenkel opened 6 months ago

leobenkel commented 6 months ago

Main issue: not able to use GPU inside minikube due to permission issues.

1. Quick Debug Information

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3888 G /usr/lib/xorg/Xorg 416MiB | | 0 N/A N/A 4220 G /usr/bin/gnome-shell 113MiB | | 0 N/A N/A 7233 G ...irefox/3941/usr/lib/firefox/firefox 476MiB | | 0 N/A N/A 8787 G ...irefox/3941/usr/lib/firefox/firefox 151MiB | | 0 N/A N/A 9794 G ...irefox/3941/usr/lib/firefox/firefox 41MiB | | 0 N/A N/A 31467 G ...sion,SpareRendererForSitePerProcess 71MiB | | 0 N/A N/A 116653 G ...,WinRetrieveSuggestionsOnlyOnDemand 31MiB | | 0 N/A N/A 132611 C+G warp-terminal 20MiB | +---------------------------------------------------------------------------------------+


### 2. Issue or feature description

using minikube, k8s, helm and gpu-operator.

i am getting:

Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown


for `nvidia-operator-validator`

### 3. Steps to reproduce the issue

I think i have something broken in my persmission/user setup and i am running out of ideas on how to resolve it.

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`

kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-qmktd 0/1 Init:0/1 0 9m28s gpu-operator-574c687b59-pcjwr 1/1 Running 0 10m gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9vvk8 1/1 Running 0 10m gpu-operator-node-feature-discovery-master-d8597d549-qqkpv 1/1 Running 0 10m gpu-operator-node-feature-discovery-worker-xcwnx 1/1 Running 0 10m nvidia-container-toolkit-daemonset-r8ktc 1/1 Running 0 9m28s nvidia-dcgm-exporter-mhxx4 0/1 Init:0/1 0 9m28s nvidia-device-plugin-daemonset-v79cd 0/1 Init:0/1 0 9m28s nvidia-operator-validator-ptj47 0/1 Init:CrashLoopBackOff 6 (3m28s ago) 9m28s


 - [x] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`

kubectl get ds -n gpu-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 9m57s gpu-operator-node-feature-discovery-worker 1 1 1 1 1 10m nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 9m57s nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 9m57s nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 9m57s nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 9m57s nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 9m57s nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 9m57s


 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`

kubectl describe pod -n gpu-operator nvidia-operator-validator-ptj47 Name: nvidia-operator-validator-ptj47 Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-operator-validator Node: minikube/192.168.49.2 Start Time: Mon, 11 Mar 2024 13:46:54 +0100 Labels: app=nvidia-operator-validator app.kubernetes.io/managed-by=gpu-operator app.kubernetes.io/part-of=gpu-operator controller-revision-hash=74c7484fb6 helm.sh/chart=gpu-operator-v23.9.2 pod-template-generation=1 Annotations: Status: Pending IP: 10.244.0.16 IPs: IP: 10.244.0.16 Controlled By: DaemonSet/nvidia-operator-validator Init Containers: driver-validation: Container ID: docker://871f1cc1d632838d5e168db3bfe66f10cba3c84c070366cd36c654955e891f6f Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703 Port: Host Port: Command: sh -c Args: nvidia-validator State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 11 Mar 2024 13:47:03 +0100 Finished: Mon, 11 Mar 2024 13:47:03 +0100 Ready: True Restart Count: 0 Environment: WITH_WAIT: true COMPONENT: driver Mounts: /host from host-root (ro) /host-dev-char from host-dev-char (rw) /run/nvidia/driver from driver-install-path (rw) /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro) toolkit-validation: Container ID: docker://71762f7b569cd2ceba213aa845fe6c2598cec3889dfdf0902f9ef68f273cf622 Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703 Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: ContainerCannotRun Message: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Exit Code: 128 Started: Mon, 11 Mar 2024 13:52:54 +0100 Finished: Mon, 11 Mar 2024 13:52:54 +0100 Ready: False Restart Count: 6 Environment: NVIDIA_VISIBLE_DEVICES: all WITH_WAIT: false COMPONENT: toolkit Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro) cuda-validation: Container ID: Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 Image ID: Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: WITH_WAIT: false COMPONENT: cuda NODE_NAME: (v1:spec.nodeName) OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace) VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro) plugin-validation: Container ID: Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 Image ID: Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: COMPONENT: plugin WITH_WAIT: false WITH_WORKLOAD: false MIG_STRATEGY: single NODE_NAME: (v1:spec.nodeName) OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace) VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro) Containers: nvidia-operator-validator: Container ID: Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2 Image ID: Port: Host Port: Command: sh -c Args: echo all validations are successful; sleep infinity State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate driver-install-path: Type: HostPath (bare host directory volume) Path: /run/nvidia/driver HostPathType: host-root: Type: HostPath (bare host directory volume) Path: / HostPathType: host-dev-char: Type: HostPath (bare host directory volume) Path: /dev/char HostPathType: kube-api-access-cjskm: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Normal Scheduled 10m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-ptj47 to minikube Normal Pulling 10m kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" Normal Pulled 10m kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 2.116s (7.99s including waiting) Normal Created 10m kubelet Created container driver-validation Normal Started 10m kubelet Started container driver-validation Warning Failed 10m (x3 over 10m) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 8m59s (x5 over 10m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" already present on machine Normal Created 8m58s (x5 over 10m) kubelet Created container toolkit-validation Warning Failed 8m58s (x2 over 9m49s) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 32s (x46 over 10m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-ptj47_gpu-operator(7c2a5005-4339-4674-82c7-244051860212)


 - [x] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`

kubectl logs -n gpu-operator nvidia-operator-validator-ptj47 Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init) Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ptj47" is waiting to start: PodInitializing


 - [x] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`

not able to

 - [x] containerd logs `journalctl -u containerd > containerd.log`

it is huge and dont seem to have anything relevant. i can post it later if needed.

## extra:

ls -l /dev/nvidia* crw-rw---- 1 root vglusers 195, 0 Mar 11 11:24 /dev/nvidia0 crw-rw---- 1 root vglusers 195, 255 Mar 11 11:24 /dev/nvidiactl crw-rw---- 1 root vglusers 195, 254 Mar 11 11:24 /dev/nvidia-modeset crw-rw-rw- 1 root root 508, 0 Mar 11 11:24 /dev/nvidia-uvm crw-rw-rw- 1 root root 508, 1 Mar 11 11:24 /dev/nvidia-uvm-tools

/dev/nvidia-caps: total 0 cr-------- 1 root root 511, 1 Mar 11 11:30 nvidia-cap1 cr--r--r-- 1 root root 511, 2 Mar 11 11:30 nvidia-cap2

getent group vglusers vglusers:x:1002:leo,root

minikube ssh docker@minikube:~$ ls -l /dev/nvidia* crw-rw---- 1 root 1002 195, 254 Mar 11 12:40 /dev/nvidia-modeset crw-rw-rw- 1 root root 508, 0 Mar 11 10:24 /dev/nvidia-uvm crw-rw-rw- 1 root root 508, 1 Mar 11 10:24 /dev/nvidia-uvm-tools crw-rw---- 1 root 1002 195, 0 Mar 11 10:24 /dev/nvidia0 crw-rw---- 1 root 1002 195, 255 Mar 11 10:24 /dev/nvidiactl

/dev/nvidia-caps: total 0 cr-------- 1 root root 511, 1 Mar 11 12:40 nvidia-cap1 cr--r--r-- 1 root root 511, 2 Mar 11 12:40 nvidia-cap2 docker@minikube:~$

leobenkel commented 6 months ago

More informations:

docker run -it --rm     --privileged     -e DISPLAY=$DISPLAY --runtime=nvidia --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix nvidia/cuda:11.6.2-base-ubuntu20.04 bash
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Already exists
a3d20efe6db8: Pull complete
bfdf8ce43b67: Pull complete
ad14f66bfcf9: Pull complete
1056ff735c59: Pull complete
Digest: sha256:a0dd581afdbf82ea9887dd077aebf9723aba58b51ae89acb4c58b8705b74179b
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
root@3e5d5a4aa5f2:/# nvidia-smi
Fri Mar 15 08:40:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P0              33W /  80W |   2624MiB /  8192MiB |     55%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@3e5d5a4aa5f2:/#