Open tmbdev opened 2 years ago
@tmbdev Please install with driver container disabled as you seem to have drivers pre-installed on the node already. --set driver.enabled=false
. Or use latest versions of operator v1.11.0 where driver container will detect this and will stay in init
phase.
Also, note that you don't have to pre-install the drivers in the first place and operator takes care of it.
- name: installing CUDA from NVIDIA
shell: |
cd /tmp
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
add-apt-repository -y "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
apt-get update
apt-get -y install cuda
Ah, I see, just adding the option works: microk8s enable gpu --set driver.enabled=false
This leaves the mystery of why this was working on two machines and failing on one.
There is no choice: Ubuntu desktop machines necessarily have the NVIDIA drivers installed; the only question is whether it's the Ubuntu repo or the NVIDIA repo.
Microk8s is frequently used on desktop machines, with a docker and podman installation and other GPU software; that's another reason a driver needs to be preinstalled.
got it, seems like this option needs to be updated with microk8s installs. This problem will not happen with v1.11.0 of operator as it will not try to overwrite drivers if they are already pre-installed.
What helped me was to identify the pods from this error (i.e., your error logs):
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-5f8f64cb86-thks7, kube-system/kubernetes-dashboard-765646474b-f2bk5, kube-system/dashboard-metrics-scraper-6b6f796c8d-5gfcs
Afterwards, when the pod was restarted it actually drained the node successfully, then installed the drivers, and automatically uncordoned the node afterwards.
I'm on Ubuntu 22.04. I'm installing everything with the following Ansible script, including microk8s 1.24/edge.
Note that channel 1.24/edge is needed due to unrelated bugs in microk8s.
I have installed this on three machines. It works fine on two of the (each having two 2080 cards). I can kubectl apply the vector-add example on both of them.
The third machine has a 3090 GPU and it is failing. It appears there is a problem with the k8s-driver-manager:
This is everything that's running on the microk8s installation:
It appears something is trying to "unload the driver". The other nodes are running the same driver, the latest from the NVIDIA CUDA repo (515.48.07).
This is the logs from the
nvidia-driver-daemonset
containers:The difference appears to be that the machine with the 3090 card actually tries to run an
nvidia-driver-daemonset
(which tries to unload the driver and is failing), while the machines with the 2080 cards don't. Why they behave differently is beyond me, since both cards are fairly old by now and I have the latest kernel drivers installed; there shouldn't be any need to unload/reload the driver on any of them.All machines run GPU containers fine under docker and podman, so the driver is perfectly functional.
I have tried booting the machine in text mode (no processes using the GPU according to nvidia-smi), and the nvidia-driver-daemonset fails in the same way.
So, how can I fix this? Is there some way to tell the gpu-operator not to even attempt to run the nvidia-driver-daemonset? Any other suggestions?