Open leobenkel opened 6 months ago
More informations:
docker run -it --rm --privileged -e DISPLAY=$DISPLAY --runtime=nvidia --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix nvidia/cuda:11.6.2-base-ubuntu20.04 bash
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Already exists
a3d20efe6db8: Pull complete
bfdf8ce43b67: Pull complete
ad14f66bfcf9: Pull complete
1056ff735c59: Pull complete
Digest: sha256:a0dd581afdbf82ea9887dd077aebf9723aba58b51ae89acb4c58b8705b74179b
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
root@3e5d5a4aa5f2:/# nvidia-smi
Fri Mar 15 08:40:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 55C P0 33W / 80W | 2624MiB / 8192MiB | 55% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@3e5d5a4aa5f2:/#
Main issue: not able to use GPU inside minikube due to permission issues.
1. Quick Debug Information
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3888 G /usr/lib/xorg/Xorg 416MiB | | 0 N/A N/A 4220 G /usr/bin/gnome-shell 113MiB | | 0 N/A N/A 7233 G ...irefox/3941/usr/lib/firefox/firefox 476MiB | | 0 N/A N/A 8787 G ...irefox/3941/usr/lib/firefox/firefox 151MiB | | 0 N/A N/A 9794 G ...irefox/3941/usr/lib/firefox/firefox 41MiB | | 0 N/A N/A 31467 G ...sion,SpareRendererForSitePerProcess 71MiB | | 0 N/A N/A 116653 G ...,WinRetrieveSuggestionsOnlyOnDemand 31MiB | | 0 N/A N/A 132611 C+G warp-terminal 20MiB | +---------------------------------------------------------------------------------------+
Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-qmktd 0/1 Init:0/1 0 9m28s gpu-operator-574c687b59-pcjwr 1/1 Running 0 10m gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9vvk8 1/1 Running 0 10m gpu-operator-node-feature-discovery-master-d8597d549-qqkpv 1/1 Running 0 10m gpu-operator-node-feature-discovery-worker-xcwnx 1/1 Running 0 10m nvidia-container-toolkit-daemonset-r8ktc 1/1 Running 0 9m28s nvidia-dcgm-exporter-mhxx4 0/1 Init:0/1 0 9m28s nvidia-device-plugin-daemonset-v79cd 0/1 Init:0/1 0 9m28s nvidia-operator-validator-ptj47 0/1 Init:CrashLoopBackOff 6 (3m28s ago) 9m28s
kubectl get ds -n gpu-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 9m57s gpu-operator-node-feature-discovery-worker 1 1 1 1 1 10m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 9m57s
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 9m57s
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 9m57s
nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 9m57s
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 9m57s
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 9m57s
kubectl describe pod -n gpu-operator nvidia-operator-validator-ptj47 Name: nvidia-operator-validator-ptj47 Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-operator-validator Node: minikube/192.168.49.2 Start Time: Mon, 11 Mar 2024 13:46:54 +0100 Labels: app=nvidia-operator-validator app.kubernetes.io/managed-by=gpu-operator app.kubernetes.io/part-of=gpu-operator controller-revision-hash=74c7484fb6 helm.sh/chart=gpu-operator-v23.9.2 pod-template-generation=1 Annotations:
Status: Pending
IP: 10.244.0.16
IPs:
IP: 10.244.0.16
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: docker://871f1cc1d632838d5e168db3bfe66f10cba3c84c070366cd36c654955e891f6f
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 11 Mar 2024 13:47:03 +0100
Finished: Mon, 11 Mar 2024 13:47:03 +0100
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
toolkit-validation:
Container ID: docker://71762f7b569cd2ceba213aa845fe6c2598cec3889dfdf0902f9ef68f273cf622
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
Exit Code: 128
Started: Mon, 11 Mar 2024 13:52:54 +0100
Finished: Mon, 11 Mar 2024 13:52:54 +0100
Ready: False
Restart Count: 6
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port:
Host Port:
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-cjskm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
Normal Scheduled 10m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-ptj47 to minikube Normal Pulling 10m kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" Normal Pulled 10m kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 2.116s (7.99s including waiting) Normal Created 10m kubelet Created container driver-validation Normal Started 10m kubelet Started container driver-validation Warning Failed 10m (x3 over 10m) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 8m59s (x5 over 10m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" already present on machine Normal Created 8m58s (x5 over 10m) kubelet Created container toolkit-validation Warning Failed 8m58s (x2 over 9m49s) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 32s (x46 over 10m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-ptj47_gpu-operator(7c2a5005-4339-4674-82c7-244051860212)
kubectl logs -n gpu-operator nvidia-operator-validator-ptj47 Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init) Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ptj47" is waiting to start: PodInitializing
ls -l /dev/nvidia* crw-rw---- 1 root vglusers 195, 0 Mar 11 11:24 /dev/nvidia0 crw-rw---- 1 root vglusers 195, 255 Mar 11 11:24 /dev/nvidiactl crw-rw---- 1 root vglusers 195, 254 Mar 11 11:24 /dev/nvidia-modeset crw-rw-rw- 1 root root 508, 0 Mar 11 11:24 /dev/nvidia-uvm crw-rw-rw- 1 root root 508, 1 Mar 11 11:24 /dev/nvidia-uvm-tools
/dev/nvidia-caps: total 0 cr-------- 1 root root 511, 1 Mar 11 11:30 nvidia-cap1 cr--r--r-- 1 root root 511, 2 Mar 11 11:30 nvidia-cap2
getent group vglusers vglusers:x:1002:leo,root
minikube ssh docker@minikube:~$ ls -l /dev/nvidia* crw-rw---- 1 root 1002 195, 254 Mar 11 12:40 /dev/nvidia-modeset crw-rw-rw- 1 root root 508, 0 Mar 11 10:24 /dev/nvidia-uvm crw-rw-rw- 1 root root 508, 1 Mar 11 10:24 /dev/nvidia-uvm-tools crw-rw---- 1 root 1002 195, 0 Mar 11 10:24 /dev/nvidia0 crw-rw---- 1 root 1002 195, 255 Mar 11 10:24 /dev/nvidiactl
/dev/nvidia-caps: total 0 cr-------- 1 root root 511, 1 Mar 11 12:40 nvidia-cap1 cr--r--r-- 1 root root 511, 2 Mar 11 12:40 nvidia-cap2 docker@minikube:~$