Need support to be able to deploy Gpu-Operator & driver chart on debian11 based node

1. Quick Debug Checklist

Node OS - Debian 11, Kernel 5.1
Kubernetes version - 1.26.0+rke2r2 (RKE2)
Container runtime - Containerd

1. Issue or feature description

On debian 11 based node, I have been trying to deploy gpu-operator helm chart but it keeps asking for a pre-installed driver. The end goal for me is to deploy all the gpu-operator operands along with the driver using the gpu-operator helm chart.

I have even tried pre-installing the drivers from the debian packages but even that is somehow not installing the legacy nvidi-driver correctly as the gpu-operator operand pods keep complaining that driver was not detected. I am using A10 GPU.

2. Steps to reproduce the issue

Setup a basic RKE2 kubernetes cluster with Debian 11 base nodes and container runtime as containerd
try to install gpu-operator chart (original)

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces

NAME READY STATUS RESTARTS AGE gpu-feature-discovery-rnkr9 0/1 Init:0/1 0 8m35s gpu-operator-66bc844599-br48s 1/1 Running 0 9m32s helm-install-nvidiagpuoperatorchart-wshsn 0/1 Completed 0 9m39s nvidia-container-toolkit-daemonset-md8j7 0/1 Init:0/1 0 8m35s nvidia-dcgm-exporter-d52gq 0/1 Init:0/1 0 8m35s nvidia-device-plugin-daemonset-hp5rt 0/1 Init:0/1 0 8m35s nvidia-driver-daemonset-5xf74 0/1 CrashLoopBackOff 6 (104s ago) 9m5s nvidia-operator-validator-rnst6 0/1 Init:0/4 0 8m35s nvidiagpuoperatorchart-node-feature-discovery-master-56d86swqbx 1/1 Running 0 9m32s nvidiagpuoperatorchart-node-feature-discovery-worker-gf96t 1/1 Running 0 9m33s

[ ] kubernetes daemonset status: kubectl get ds --all-namespaces NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 10m nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 10m nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 10m nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 10m nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 10m nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 10m nvidiagpuoperatorchart-node-feature-discovery-worker 1 1 1 1 1 11m
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 535 for Linux kernel version 5.10.0-23-amd64

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Installing NVIDIA driver kernel modules... Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package linux-objects-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-objects-nvidia-535-server-5.10.0-23-amd64' E: Unable to locate package linux-signatures-nvidia-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-signatures-nvidia-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-signatures-nvidia-5.10.0-23-amd64' E: Unable to locate package linux-modules-nvidia-535-server-5.10.0-23-amd64 E: Couldn't find any package by glob 'linux-modules-nvidia-535-server-5.10.0-23-amd64' E: Couldn't find any package by regex 'linux-modules-nvidia-535-server-5.10.0-23-amd64' Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

kubectl logs nvidia-driver-daemonset-5xf74 -n nvidia -c k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label Current value of 'nvidia.com/gpu.deploy.operator-validator=true' Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label Current value of 'nvidia.com/gpu.deploy.container-toolkit=true' Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label Current value of 'nvidia.com/gpu.deploy.device-plugin=true' Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label Current value of 'nvidia.com/gpu.deploy.dcgm=true' Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label Current value of 'nvidia.com/gpu.deploy.mig-manager=' Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label Current value of 'nvidia.com/gpu.deploy.nvsm=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label Current value of 'nvidia.com/gpu.deploy.sandbox-validator=' Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=' Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager=' Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command) Current value of 'nodeType=' Current value of AUTO_UPGRADE_POLICY_ENABLED=true' Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels node/edgelinworker1 labeled Waiting for the operator-validator to shutdown pod/nvidia-operator-validator-85z6b condition met Waiting for the container-toolkit to shutdown pod/nvidia-container-toolkit-daemonset-h7bbj condition met Waiting for the device-plugin to shutdown Waiting for gpu-feature-discovery to shutdown Waiting for dcgm-exporter to shutdown Waiting for dcgm to shutdown Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy unbinding device 6565:00:00.0 Auto eviction of GPU pods on node edgelinworker1 is disabled by the upgrade policy Auto drain of the node edgelinworker1 is disabled by the upgrade policy Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels node/edgelinworker1 labeled Unloading nouveau driver... Successfully unloaded nouveau driver

[ ] NVIDIA shared directory: ls -la /run/nvidia total 0 drwxr-xr-x 4 root root 80 Jul 24 07:12 . drwxr-xr-x 32 root root 820 Jul 24 07:13 .. drwxr-xr-x 2 root root 40 Jul 24 07:09 driver drwxr-xr-x 2 root root 40 Jul 24 07:09 validations
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit ls: cannot access '/usr/local/nvidia/toolkit': No such file or directory
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver total 0 drwxr-xr-x 2 root root 40 Jul 24 07:09 . drwxr-xr-x 4 root root 80 Jul 24 07:13 ..

NVIDIA / gpu-operator