NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.75k stars 283 forks source link

nvidia-driver-daemonset stuck in Init:CrashLoopBackOff (again) #367

Open tmbdev opened 2 years ago

tmbdev commented 2 years ago

I'm on Ubuntu 22.04. I'm installing everything with the following Ansible script, including microk8s 1.24/edge.

---
- hosts: "{{ host | default('localhost')}}"
  become: yes
  become_method: sudo
  tasks:
  - name: installing CUDA from NVIDIA
    shell: |
      cd /tmp
      wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
      mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
      apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
      add-apt-repository -y "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
      apt-get update
      apt-get -y install cuda
  - name: installing microk8s
    snap:
      name: microk8s
      channel: 1.24/edge
      classic: yes
      state: present
  - shell: /snap/bin/microk8s start
  - shell: /snap/bin/microk8s enable gpu

Note that channel 1.24/edge is needed due to unrelated bugs in microk8s.

I have installed this on three machines. It works fine on two of the (each having two 2080 cards). I can kubectl apply the vector-add example on both of them.

The third machine has a 3090 GPU and it is failing. It appears there is a problem with the k8s-driver-manager:

syslog:Jun 29 21:44:00 varuna microk8s.daemon-kubelite[77992]: E0629 21:44:00.085442   77992 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"k8s-driver-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=k8s-driver-manager pod=nvidia-driver-daemonset-f482n_gpu-operator-resources(706e7a7c-d8e0-40e4-b57b-f32813ae8a0d)\"" pod="gpu-operator-resources/nvidia-driver-daemonset-f482n" podUID=706e7a7c-d8e0-40e4-b57b-f32813ae8a0d

This is everything that's running on the microk8s installation:

varuna:log$ microk8s.kubectl get pods --all-namespaces
NAMESPACE                NAME                                                          READY   STATUS                  RESTARTS         AGE
default                  vector-add                                                    0/1     Pending                 0                6h14m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-d2776              1/1     Running                 37 (6m57s ago)   6h20m
kube-system              calico-node-mrdfg                                             1/1     Running                 1 (5h54m ago)    6h22m
kube-system              calico-kube-controllers-5755cd6ddb-kv9bx                      1/1     Running                 0                7m8s
gpu-operator-resources   gpu-operator-798c6ddc97-tgd8n                                 1/1     Running                 0                7m8s
gpu-operator-resources   gpu-operator-node-feature-discovery-master-6c65c99969-bxmp2   1/1     Running                 0                7m8s
kube-system              coredns-66bcf65bb8-lf69h                                      1/1     Running                 0                7m8s
kube-system              metrics-server-5f8f64cb86-thks7                               1/1     Running                 0                6m52s
kube-system              kubernetes-dashboard-765646474b-f2bk5                         1/1     Running                 0                5m43s
kube-system              dashboard-metrics-scraper-6b6f796c8d-5gfcs                    1/1     Running                 0                5m43s
gpu-operator-resources   nvidia-dcgm-exporter-bgv7b                                    0/1     Init:0/1                0                2m23s
gpu-operator-resources   nvidia-device-plugin-daemonset-6pfs7                          0/1     Init:0/1                0                2m23s
gpu-operator-resources   gpu-feature-discovery-r8fbh                                   0/1     Init:0/1                0                2m23s
gpu-operator-resources   nvidia-operator-validator-5zz4n                               0/1     Init:0/4                0                2m23s
gpu-operator-resources   nvidia-container-toolkit-daemonset-cpptw                      0/1     Init:0/1                0                2m23s
gpu-operator-resources   nvidia-driver-daemonset-f482n                                 0/1     Init:CrashLoopBackOff   75 (2m23s ago)   6h19m
varuna:log$ 

It appears something is trying to "unload the driver". The other nodes are running the same driver, the latest from the NVIDIA CUDA repo (515.48.07).

This is the logs from the nvidia-driver-daemonset containers:

varuna:log$ microk8s.kubectl logs pod/nvidia-driver-daemonset-f482n -n gpu-operator-resources --all-containers
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia driver module is already loaded with refcount 236
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/varuna labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-5zz4n condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_drm             69632  5
drm_kms_helper        307200  1 nvidia_drm
nvidia_uvm           1282048  0
nvidia_modeset       1142784  7 nvidia_drm
nvidia              40800256  236 nvidia_uvm,nvidia_modeset
drm                   606208  9 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node varuna...
node/varuna cordoned
error: unable to drain node "varuna" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs, continuing command...
There are pending nodes to be drained:
 varuna
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs
Uncordoning node varuna...
node/varuna uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/varuna unlabeled

varuna:log$ 

The difference appears to be that the machine with the 3090 card actually tries to run an nvidia-driver-daemonset (which tries to unload the driver and is failing), while the machines with the 2080 cards don't. Why they behave differently is beyond me, since both cards are fairly old by now and I have the latest kernel drivers installed; there shouldn't be any need to unload/reload the driver on any of them.

All machines run GPU containers fine under docker and podman, so the driver is perfectly functional.

I have tried booting the machine in text mode (no processes using the GPU according to nvidia-smi), and the nvidia-driver-daemonset fails in the same way.

varuna:~$ nvidia-smi
Wed Jun 29 22:35:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8    15W / 350W |      1MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
varuna:~$

So, how can I fix this? Is there some way to tell the gpu-operator not to even attempt to run the nvidia-driver-daemonset? Any other suggestions?

shivamerla commented 2 years ago

@tmbdev Please install with driver container disabled as you seem to have drivers pre-installed on the node already. --set driver.enabled=false. Or use latest versions of operator v1.11.0 where driver container will detect this and will stay in init phase.

shivamerla commented 2 years ago

Also, note that you don't have to pre-install the drivers in the first place and operator takes care of it.

  - name: installing CUDA from NVIDIA
    shell: |
      cd /tmp
      wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
      mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
      apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
      add-apt-repository -y "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
      apt-get update
      apt-get -y install cuda
tmbdev commented 2 years ago

Ah, I see, just adding the option works: microk8s enable gpu --set driver.enabled=false

This leaves the mystery of why this was working on two machines and failing on one.

tmbdev commented 2 years ago

There is no choice: Ubuntu desktop machines necessarily have the NVIDIA drivers installed; the only question is whether it's the Ubuntu repo or the NVIDIA repo.

Microk8s is frequently used on desktop machines, with a docker and podman installation and other GPU software; that's another reason a driver needs to be preinstalled.

shivamerla commented 2 years ago

got it, seems like this option needs to be updated with microk8s installs. This problem will not happen with v1.11.0 of operator as it will not try to overwrite drivers if they are already pre-installed.

wjentner commented 1 year ago

What helped me was to identify the pods from this error (i.e., your error logs):

cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs
  1. Cordon the node
  2. Remove the pods manually mentioned in the error
  3. [restart the nvidia-driver-daemonset pod] (not sure if this is necessary)

Afterwards, when the pod was restarted it actually drained the node successfully, then installed the drivers, and automatically uncordoned the node afterwards.