NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 297 forks source link

Need support to be able to deploy Gpu-Operator & driver chart on debian11 based node #558

Open ayuzzz opened 1 year ago

ayuzzz commented 1 year ago

1. Quick Debug Checklist

1. Issue or feature description

On debian 11 based node, I have been trying to deploy gpu-operator helm chart but it keeps asking for a pre-installed driver. The end goal for me is to deploy all the gpu-operator operands along with the driver using the gpu-operator helm chart.

I have even tried pre-installing the drivers from the debian packages but even that is somehow not installing the legacy nvidi-driver correctly as the gpu-operator operand pods keep complaining that driver was not detected. I am using A10 GPU.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

image

NAME READY STATUS RESTARTS AGE gpu-feature-discovery-rnkr9 0/1 Init:0/1 0 8m35s gpu-operator-66bc844599-br48s 1/1 Running 0 9m32s helm-install-nvidiagpuoperatorchart-wshsn 0/1 Completed 0 9m39s nvidia-container-toolkit-daemonset-md8j7 0/1 Init:0/1 0 8m35s nvidia-dcgm-exporter-d52gq 0/1 Init:0/1 0 8m35s nvidia-device-plugin-daemonset-hp5rt 0/1 Init:0/1 0 8m35s nvidia-driver-daemonset-5xf74 0/1 CrashLoopBackOff 6 (104s ago) 9m5s nvidia-operator-validator-rnst6 0/1 Init:0/4 0 8m35s nvidiagpuoperatorchart-node-feature-discovery-master-56d86swqbx 1/1 Running 0 9m32s nvidiagpuoperatorchart-node-feature-discovery-worker-gf96t 1/1 Running 0 9m33s

kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 535 for Linux kernel version 5.10.0-23-amd64

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Installing NVIDIA driver kernel modules... Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package linux-objects-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Unable to locate package linux-signatures-nvidia-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-signatures-nvidia-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-signatures-nvidia-5.10.0-23-amd64' E: Unable to locate package linux-modules-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-modules-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-modules-nvidia-535-server-5.10.0-23-amd64' Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia -c k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label Current value of 'nvidia.com/gpu.deploy.operator-validator=true' Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label Current value of 'nvidia.com/gpu.deploy.container-toolkit=true' Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label Current value of 'nvidia.com/gpu.deploy.device-plugin=true' Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label Current value of 'nvidia.com/gpu.deploy.dcgm=true' Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label Current value of 'nvidia.com/gpu.deploy.mig-manager=' Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label Current value of 'nvidia.com/gpu.deploy.nvsm=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label Current value of 'nvidia.com/gpu.deploy.sandbox-validator=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=' Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager=' Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command) Current value of 'nodeType=' Current value of AUTO_UPGRADE_POLICY_ENABLED=true' Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels node/edgelinworker1 labeled Waiting for the operator-validator to shutdown pod/nvidia-operator-validator-85z6b condition met Waiting for the container-toolkit to shutdown pod/nvidia-container-toolkit-daemonset-h7bbj condition met Waiting for the device-plugin to shutdown Waiting for gpu-feature-discovery to shutdown Waiting for dcgm-exporter to shutdown Waiting for dcgm to shutdown Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy unbinding device 6565:00:00.0 Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy Auto drain of the node edgelinworker1 is disabled by the upgrade policy Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels node/edgelinworker1 labeled Unloading nouveau driver... Successfully unloaded nouveau driver

shivamerla commented 1 year ago

@ayuzzz we currently don't support driver containers for debian11. You would need to pre-install NVIDIA driver till we add support for it. GPU Operator will detect pre-installed driver and auto-disable driver Pod on the node. If that is not working, please paste output of nvidia-smi from the node and logs of init containers from the toolkit pod.