OS/Version: Red Hat Enterprise Linux CoreOS release 4.12
Kernel Version: 4.18.0-372.82.1.el8_6.x86_64
Container Runtime Type/Version: CRI-O
K8s Flavor/Version: Openshift Server Version: 4.12.45, Kubernetes Version: v1.25.14+a52e8df
GPU Operator Version: v23.9.1, v23.6.1
GPU node: A100
2. Issue or feature description
GPU-operator's driver pod failed to get ready, cluster policy installed default with "use_ocp_driver_toolkit" as selected,
driver pod is not getting ready because its Startup probe failed: No devices were found
GPU node is of type A100
[core@master0 tmp]$ kubectl exec -ti nvidia-driver-daemonset-412.86.202311271639-0-qkdsz -n nvidia-gpu-operator -- nvidia-smi
No devices were found
command terminated with exit code 6
[core@master0 tmp]$
We have tried with use_ocp_driver_toolkit set to "false" and applied the entitlement but even with it driver pod failed to download packages "kernel-headers-4.18.0-372.82.1.el8_6.x86_64kernel-devel-4.18.0-372.82.1.el8_6.x86_64" and went into crash-loopback state, the following error is coming
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.6 install kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
++ rm -rf /tmp/tmp.ffQrW5HEcg
3. Steps to reproduce the issue
Use following manifest YML to create cluster-policy
1. Quick Debug Information
2. Issue or feature description
GPU-operator's driver pod failed to get ready, cluster policy installed default with "use_ocp_driver_toolkit" as selected, driver pod is not getting ready because its Startup probe failed: No devices were found GPU node is of type A100
since the driver pod is not fully ready(container nvidia-driver-ctr is not getting ready) other pods are stuck in init state
We have tried with use_ocp_driver_toolkit set to "false" and applied the entitlement but even with it driver pod failed to download packages "kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64" and went into crash-loopback state, the following error is coming
3. Steps to reproduce the issue
Use following manifest YML to create cluster-policy
let us know if more information is required.