Open ayuzzz opened 1 year ago
@ayuzzz we currently don't support driver containers for debian11. You would need to pre-install NVIDIA driver till we add support for it. GPU Operator will detect pre-installed driver and auto-disable driver Pod on the node. If that is not working, please paste output of nvidia-smi
from the node and logs of init containers from the toolkit pod.
1. Quick Debug Checklist
1. Issue or feature description
On debian 11 based node, I have been trying to deploy gpu-operator helm chart but it keeps asking for a pre-installed driver. The end goal for me is to deploy all the gpu-operator operands along with the driver using the gpu-operator helm chart.
I have even tried pre-installing the drivers from the debian packages but even that is somehow not installing the legacy nvidi-driver correctly as the gpu-operator operand pods keep complaining that driver was not detected. I am using A10 GPU.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaces
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-rnkr9 0/1 Init:0/1 0 8m35s gpu-operator-66bc844599-br48s 1/1 Running 0 9m32s helm-install-nvidiagpuoperatorchart-wshsn 0/1 Completed 0 9m39s nvidia-container-toolkit-daemonset-md8j7 0/1 Init:0/1 0 8m35s nvidia-dcgm-exporter-d52gq 0/1 Init:0/1 0 8m35s nvidia-device-plugin-daemonset-hp5rt 0/1 Init:0/1 0 8m35s nvidia-driver-daemonset-5xf74 0/1 CrashLoopBackOff 6 (104s ago) 9m5s nvidia-operator-validator-rnst6 0/1 Init:0/4 0 8m35s nvidiagpuoperatorchart-node-feature-discovery-master-56d86swqbx 1/1 Running 0 9m32s nvidiagpuoperatorchart-node-feature-discovery-worker-gf96t 1/1 Running 0 9m33s
[ ] kubernetes daemonset status: 11m
kubectl get ds --all-namespaces
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 10m nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 10m nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 10m nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 10m nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 10m nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 10m nvidiagpuoperatorchart-node-feature-discovery-worker 1 1 1 1 1[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver branch 535 for Linux kernel version 5.10.0-23-amd64
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Installing NVIDIA driver kernel modules... Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package linux-objects-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Unable to locate package linux-signatures-nvidia-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-signatures-nvidia-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-signatures-nvidia-5.10.0-23-amd64' E: Unable to locate package linux-modules-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-modules-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-modules-nvidia-535-server-5.10.0-23-amd64' Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...
kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia -c k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label Current value of 'nvidia.com/gpu.deploy.operator-validator=true' Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label Current value of 'nvidia.com/gpu.deploy.container-toolkit=true' Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label Current value of 'nvidia.com/gpu.deploy.device-plugin=true' Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label Current value of 'nvidia.com/gpu.deploy.dcgm=true' Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label Current value of 'nvidia.com/gpu.deploy.mig-manager=' Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label Current value of 'nvidia.com/gpu.deploy.nvsm=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label Current value of 'nvidia.com/gpu.deploy.sandbox-validator=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=' Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager=' Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command) Current value of 'nodeType=' Current value of AUTO_UPGRADE_POLICY_ENABLED=true' Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels node/edgelinworker1 labeled Waiting for the operator-validator to shutdown pod/nvidia-operator-validator-85z6b condition met Waiting for the container-toolkit to shutdown pod/nvidia-container-toolkit-daemonset-h7bbj condition met Waiting for the device-plugin to shutdown Waiting for gpu-feature-discovery to shutdown Waiting for dcgm-exporter to shutdown Waiting for dcgm to shutdown Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy unbinding device 6565:00:00.0 Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy Auto drain of the node edgelinworker1 is disabled by the upgrade policy Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels node/edgelinworker1 labeled Unloading nouveau driver... Successfully unloaded nouveau driver
[ ] NVIDIA shared directory:
ls -la /run/nvidia
total 0 drwxr-xr-x 4 root root 80 Jul 24 07:12 . drwxr-xr-x 32 root root 820 Jul 24 07:13 .. drwxr-xr-x 2 root root 40 Jul 24 07:09 driver drwxr-xr-x 2 root root 40 Jul 24 07:09 validations[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
ls: cannot access '/usr/local/nvidia/toolkit': No such file or directory[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
total 0 drwxr-xr-x 2 root root 40 Jul 24 07:09 . drwxr-xr-x 4 root root 80 Jul 24 07:13 ..