NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

nvidia-driver-daemonset always fails on Ubuntu 20.04.2 #213

Open aipredict opened 3 years ago

aipredict commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

I'm running

  1. Ubuntu 20.04.2
  2. Microk8s v1.21.1
  3. contained

I don't install any GPU driver on the host machine, the GPU is GeForce RTX 1650

1. Issue or feature description

Enable gpu-operator in microk8s:

microk8s.enable gpu

The POD nvidia-driver-daemonset always fails:

kubectl get po -A
NAMESPACE                NAME                                                          READY   STATUS             RESTARTS   AGE
kube-system              coredns-7f9c69c78c-hphlc                                      1/1     Running            1          9h
kube-system              calico-node-x76zp                                             1/1     Running            1          9h
kube-system              calico-kube-controllers-f7868dd95-tqjnw                       1/1     Running            1          9h
default                  gpu-operator-node-feature-discovery-master-867c4f7bfb-5wpgk   1/1     Running            0          6m46s
default                  gpu-operator-node-feature-discovery-worker-msmv2              1/1     Running            0          6m46s
gpu-operator-resources   nvidia-operator-validator-rh7h7                               0/1     Init:0/4           0          6m8s
gpu-operator-resources   nvidia-device-plugin-daemonset-8kjxn                          0/1     Init:0/1           0          6m8s
gpu-operator-resources   nvidia-dcgm-exporter-kgbq5                                    0/1     Init:0/1           0          6m8s
gpu-operator-resources   gpu-feature-discovery-wvvr2                                   0/1     Init:0/1           0          6m8s
default                  gpu-operator-7db468cfdf-4sv48                                 1/1     Running            0          6m46s
gpu-operator-resources   nvidia-container-toolkit-daemonset-7s686                      0/1     Init:0/1           0          6m8s
gpu-operator-resources   nvidia-driver-daemonset-ck684                                 0/1     CrashLoopBackOff   5          6m9s

I checked its logs:

kubectl logs nvidia-driver-daemonset-ck684 -n gpu-operator-resources
Creating directory NVIDIA-Linux-x86_64-460.73.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 5.8.0-55-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/ce4d38aa740e318d2eae04cba08f1322017d162183c8f61f84391bf88020a534  404  Not Found [IP: 180.101.196.129 443]
E: Some index files failed to download. They have been ignored, or old ones used instead.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

It seems an error occurred to fetch package:

E: Failed to fetch https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/by-hash/SHA256/ce4d38aa740e318d2eae04cba08f1322017d162183c8f61f84391bf88020a534  404  Not Found [IP: 180.101.196.129 443]
E: Some index files failed to download. They have been ignored, or old ones used instead.

I check the helm package, the latest v1.7.0 installed:

microk8s.helm3 ls
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/2262/credentials/client.config
NAME            NAMESPACE    REVISION    UPDATED                                    STATUS      CHART                  APP VERSION
gpu-operator    default      1           2021-06-22 06:38:42.691611216 +0800 CST    deployed    gpu-operator-v1.7.0    v1.7.0

2. Steps to reproduce the issue

On Ubuntu 20.04.2 with a nvidia GPU:

sudo snap install microk8s --channel 1.21/stable --classic
microk8s enable gpu
aipredict commented 3 years ago

An update: the POD nvidia-driver-daemonset will not be created after I install the GPU driver on the host machine, and all related PODs work well:

kubectl get po -A
NAMESPACE                NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-operator-resources   gpu-feature-discovery-8jxrm                                   1/1     Running     2          12h
gpu-operator-resources   nvidia-dcgm-exporter-886qt                                    1/1     Running     2          12h
gpu-operator-resources   nvidia-device-plugin-daemonset-qxk7z                          1/1     Running     2          12h
gpu-operator-resources   nvidia-container-toolkit-daemonset-9fxjl                      1/1     Running     2          12h
kube-system              coredns-7f9c69c78c-qxw9j                                      1/1     Running     11         24h
default                  gpu-operator-7db468cfdf-ghrfd                                 1/1     Running     2          12h
default                  gpu-operator-node-feature-discovery-master-867c4f7bfb-9mv2m   1/1     Running     2          12h
kube-system              calico-kube-controllers-8695b994-jlhfd                        1/1     Running     3          13h
gpu-operator-resources   nvidia-cuda-validator-lv2t7                                   0/1     Completed   0          80m
gpu-operator-resources   nvidia-device-plugin-validator-svl2c                          0/1     Completed   0          80m
kube-system              calico-node-h5rwh                                             1/1     Running     3          13h
gpu-operator-resources   nvidia-operator-validator-khdjk                               1/1     Running     2          12h
default                  gpu-operator-node-feature-discovery-worker-rhcgj              1/1     Running     3          12h
kube-system              hostpath-provisioner-5c65fbdb4f-66dpn                         1/1     Running     0          79m
kube-system              metrics-server-8bbfb4bdb-88j8s                                1/1     Running     0          79m

I checked the script of enabling gpu-operator in Microk8s, which will set the argument driver.enabled=true when the GPU driver installed on the host machine.

The current issue focus on the situation of installing the GPU driver by gpu-operator, the POD nvidia-driver-daemonset always fails on Ubuntu 20.04.2

shivamerla commented 3 years ago

Interesting, the error is not relevant to kernel version. The base image used to build this container is nvidia/cuda:11.3.0-base-ubuntu20.04 and the first command run(w.r.t packages) is apt-get update. Same works fine with AWS Ubuntu20.04 kernels(eg: 5.4.0-1048-aws). I will check with CUDA team here to verify if this was seen.

rohanrehman commented 2 years ago

https://github.com/ubuntu/microk8s/issues/2763#issuecomment-999778587

gigony commented 2 years ago

Installing microk8s v1.22 worked. It looks like an issue with microk8s: https://github.com/canonical/microk8s/issues/2634