Open aipredict opened 3 years ago
An update: the POD nvidia-driver-daemonset
will not be created after I install the GPU driver on the host machine, and all related PODs work well:
kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-8jxrm 1/1 Running 2 12h
gpu-operator-resources nvidia-dcgm-exporter-886qt 1/1 Running 2 12h
gpu-operator-resources nvidia-device-plugin-daemonset-qxk7z 1/1 Running 2 12h
gpu-operator-resources nvidia-container-toolkit-daemonset-9fxjl 1/1 Running 2 12h
kube-system coredns-7f9c69c78c-qxw9j 1/1 Running 11 24h
default gpu-operator-7db468cfdf-ghrfd 1/1 Running 2 12h
default gpu-operator-node-feature-discovery-master-867c4f7bfb-9mv2m 1/1 Running 2 12h
kube-system calico-kube-controllers-8695b994-jlhfd 1/1 Running 3 13h
gpu-operator-resources nvidia-cuda-validator-lv2t7 0/1 Completed 0 80m
gpu-operator-resources nvidia-device-plugin-validator-svl2c 0/1 Completed 0 80m
kube-system calico-node-h5rwh 1/1 Running 3 13h
gpu-operator-resources nvidia-operator-validator-khdjk 1/1 Running 2 12h
default gpu-operator-node-feature-discovery-worker-rhcgj 1/1 Running 3 12h
kube-system hostpath-provisioner-5c65fbdb4f-66dpn 1/1 Running 0 79m
kube-system metrics-server-8bbfb4bdb-88j8s 1/1 Running 0 79m
I checked the script of enabling gpu-operator
in Microk8s, which will set the argument driver.enabled=true
when the GPU driver installed on the host machine.
The current issue focus on the situation of installing the GPU driver by gpu-operator
, the POD nvidia-driver-daemonset
always fails on Ubuntu 20.04.2
Interesting, the error is not relevant to kernel version. The base image used to build this container is nvidia/cuda:11.3.0-base-ubuntu20.04
and the first command run(w.r.t packages) is apt-get update
. Same works fine with AWS Ubuntu20.04 kernels(eg: 5.4.0-1048-aws). I will check with CUDA team here to verify if this was seen.
Installing microk8s v1.22 worked. It looks like an issue with microk8s: https://github.com/canonical/microk8s/issues/2634
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)I'm running
I don't install any GPU driver on the host machine, the GPU is GeForce RTX 1650
1. Issue or feature description
Enable
gpu-operator
in microk8s:The POD
nvidia-driver-daemonset
always fails:I checked its logs:
It seems an error occurred to fetch package:
I check the helm package, the latest v1.7.0 installed:
2. Steps to reproduce the issue
On Ubuntu 20.04.2 with a nvidia GPU: