Last week, I used to install GPU operator 6.2 on Kubernetes v1.19.9 (CentOS 7.9 Kernel 3.10.0-1160.15.2.el7.x86_64), and everything is fine. But after upgrading the CentOS 7 kernel from 3.10.0 to 5.4, the NVIDIA Driver Pod displays the following error message. The Kernel Version cannot be resolved and the related Kernel package cannot be found.
$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-4czx9 0/1 Init:0/1 0 37m
nvidia-driver-daemonset-cs2sr 0/1 CrashLoopBackOff 14 37m
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
I use ELRepo.org to update my CentOS Kernel, does it seem that NVIDIA Driver Image does not support ELRepo? (or Linux Kernel 5.x?)
2. Steps to reproduce the issue
Install Kubernetes v1.19.9 with CRI-O 1.19.1
Upgrde CentOS 7 kernel from 3.10.0-1160.15.2.el7.x86_64 to 5.4.124-1.el7.elrepo.x86_64 (Use ELRepo )
2. NVIDIA Driver DaemonSet Log.
```bash
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? => Nokubectl describe clusterpolicies --all-namespaces
) => Yes1. Issue or feature description
Cluster Information:
5.4.124-1.el7.elrepo.x86_64
nvcr.io/nvidia/driver:460.32.03-centos7
Last week, I used to install GPU operator 6.2 on Kubernetes v1.19.9 (CentOS 7.9 Kernel
3.10.0-1160.15.2.el7.x86_64
), and everything is fine. But after upgrading the CentOS 7 kernel from 3.10.0 to 5.4, the NVIDIA Driver Pod displays the following error message. The Kernel Version cannot be resolved and the related Kernel package cannot be found.I use ELRepo.org to update my CentOS Kernel, does it seem that NVIDIA Driver Image does not support ELRepo? (or Linux Kernel 5.x?)
2. Steps to reproduce the issue
3.10.0-1160.15.2.el7.x86_64
to5.4.124-1.el7.elrepo.x86_64
(Use ELRepo )3. Information to attach (optional if deemed irrelevant)
default gpu-operator-1623131323-node-feature-discovery-master-6685shjp7 1/1 Running 0 28m default gpu-operator-1623131323-node-feature-discovery-worker-cdvpj 1/1 Running 1 28m default gpu-operator-1623131323-node-feature-discovery-worker-k9fpf 1/1 Running 1 28m default gpu-operator-1623131323-node-feature-discovery-worker-kwdsb 1/1 Running 2 28m default gpu-operator-1623131323-node-feature-discovery-worker-tldwn 1/1 Running 0 28m default gpu-operator-65d474cc8-rtwdq 1/1 Running 0 28m gpu-operator-resources nvidia-container-toolkit-daemonset-4czx9 0/1 Init:0/1 0 28m gpu-operator-resources nvidia-driver-daemonset-cs2sr 0/1 CrashLoopBackOff 12 28m kube-system cilium-42d9z 1/1 Running 0 29m kube-system cilium-mhsdn 1/1 Running 0 29m kube-system cilium-operator-694449c44b-n2pxm 1/1 Running 5 26h kube-system cilium-r6fkq 1/1 Running 0 29m kube-system cilium-sft2q 1/1 Running 0 29m kube-system coredns-7677f9bb54-dx4st 1/1 Running 2 25h kube-system coredns-7677f9bb54-r9h2p 1/1 Running 3 25h kube-system dns-autoscaler-5b7b5c9b6f-99t9s 1/1 Running 2 25h kube-system etcd-k8s-master1.k8s.lab 1/1 Running 2 26h kube-system kube-apiserver-k8s-master1.k8s.lab 1/1 Running 2 26h kube-system kube-controller-manager-k8s-master1.k8s.lab 1/1 Running 3 26h kube-system kube-proxy-9k5cv 1/1 Running 2 26h kube-system kube-proxy-hw9rx 1/1 Running 2 26h kube-system kube-proxy-pz5sq 1/1 Running 2 26h kube-system kube-proxy-xq8k2 1/1 Running 4 26h kube-system kube-scheduler-k8s-master1.k8s.lab 1/1 Running 4 26h kube-system metrics-server-747c56cf5f-qv5vv 2/2 Running 4 25h kube-system nodelocaldns-gzrxj 1/1 Running 2 25h kube-system nodelocaldns-hqg47 1/1 Running 2 25h kube-system nodelocaldns-wc79d 1/1 Running 4 25h kube-system nodelocaldns-zvl9t 1/1 Running 2 25h