NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.79k stars 286 forks source link

NVIDIA Driver Could not resolve Linux kernel version on CentOS 7.9 Kernel 5.4. #205

Open pohsien324 opened 3 years ago

pohsien324 commented 3 years ago

1. Quick Debug Checklist

1. Issue or feature description

Cluster Information:

Last week, I used to install GPU operator 6.2 on Kubernetes v1.19.9 (CentOS 7.9 Kernel 3.10.0-1160.15.2.el7.x86_64), and everything is fine. But after upgrading the CentOS 7 kernel from 3.10.0 to 5.4, the NVIDIA Driver Pod displays the following error message. The Kernel Version cannot be resolved and the related Kernel package cannot be found.

$ kubectl get pods -n gpu-operator-resources

NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-container-toolkit-daemonset-4czx9   0/1     Init:0/1           0          37m
nvidia-driver-daemonset-cs2sr              0/1     CrashLoopBackOff   14         37m
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

I use ELRepo.org to update my CentOS Kernel, does it seem that NVIDIA Driver Image does not support ELRepo? (or Linux Kernel 5.x?)

2. Steps to reproduce the issue

  1. Install Kubernetes v1.19.9 with CRI-O 1.19.1
  2. Upgrde CentOS 7 kernel from 3.10.0-1160.15.2.el7.x86_64 to 5.4.124-1.el7.elrepo.x86_64 (Use ELRepo )
  3. Deploy GPU Operator 1.6.2
    $ helm install --wait --generate-name ./gpu-operator --set operator.defaultRuntime=crio --set toolkit.version=1.4.7-ubi8

3. Information to attach (optional if deemed irrelevant)

  1. kubernetes pods status.
    
    $ kubectl get pods --all-namespaces

default gpu-operator-1623131323-node-feature-discovery-master-6685shjp7 1/1 Running 0 28m default gpu-operator-1623131323-node-feature-discovery-worker-cdvpj 1/1 Running 1 28m default gpu-operator-1623131323-node-feature-discovery-worker-k9fpf 1/1 Running 1 28m default gpu-operator-1623131323-node-feature-discovery-worker-kwdsb 1/1 Running 2 28m default gpu-operator-1623131323-node-feature-discovery-worker-tldwn 1/1 Running 0 28m default gpu-operator-65d474cc8-rtwdq 1/1 Running 0 28m gpu-operator-resources nvidia-container-toolkit-daemonset-4czx9 0/1 Init:0/1 0 28m gpu-operator-resources nvidia-driver-daemonset-cs2sr 0/1 CrashLoopBackOff 12 28m kube-system cilium-42d9z 1/1 Running 0 29m kube-system cilium-mhsdn 1/1 Running 0 29m kube-system cilium-operator-694449c44b-n2pxm 1/1 Running 5 26h kube-system cilium-r6fkq 1/1 Running 0 29m kube-system cilium-sft2q 1/1 Running 0 29m kube-system coredns-7677f9bb54-dx4st 1/1 Running 2 25h kube-system coredns-7677f9bb54-r9h2p 1/1 Running 3 25h kube-system dns-autoscaler-5b7b5c9b6f-99t9s 1/1 Running 2 25h kube-system etcd-k8s-master1.k8s.lab 1/1 Running 2 26h kube-system kube-apiserver-k8s-master1.k8s.lab 1/1 Running 2 26h kube-system kube-controller-manager-k8s-master1.k8s.lab 1/1 Running 3 26h kube-system kube-proxy-9k5cv 1/1 Running 2 26h kube-system kube-proxy-hw9rx 1/1 Running 2 26h kube-system kube-proxy-pz5sq 1/1 Running 2 26h kube-system kube-proxy-xq8k2 1/1 Running 4 26h kube-system kube-scheduler-k8s-master1.k8s.lab 1/1 Running 4 26h kube-system metrics-server-747c56cf5f-qv5vv 2/2 Running 4 25h kube-system nodelocaldns-gzrxj 1/1 Running 2 25h kube-system nodelocaldns-hqg47 1/1 Running 2 25h kube-system nodelocaldns-wc79d 1/1 Running 4 25h kube-system nodelocaldns-zvl9t 1/1 Running 2 25h


2. NVIDIA Driver DaemonSet Log.
```bash
$ kubectl logs nvidia-driver-daemonset-cs2sr -n gpu-operator-resources

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 5.4.124-1.el7.elrepo.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Unable to open the file '/lib/modules/5.4.124-1.el7.elrepo.x86_64/proc/version' (No such file or directory).Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
daniel-hutao commented 3 years ago

+1

ldd91 commented 1 year ago

+1