GPU Kubeflow cluster timeline and advice

odellus commented 3 years ago

I've been having some issues getting enabling gpus on the kubeflow cluster I recently set up.

Per this discussion, it seems that microk8s enable gpu works best for people who already have the nvidia-container-runtime installed on their system for microk8s version 1.22. However, as it's well known by now, the kubeflow add-on is only supported up to version 1.21 of microk8s. I've tried both:

Going through the steps to enable gpus with microk8s v1.21. Logs show the operator still installing its own nvidia-container-runtime, despite my clear statement --set driver.enabled=false when calling helm3 install.
Going through the steps of using juju and charmed operators to bootstrap a kubeflow cluster in microk8s v1.22 and see the same seldon error as reported in #2496 .

What should I do? Uninstall nvidia-container-runtime on my host and cross my fingers microk8s enable gpu will work in that case? If there's any way I can contribute to getting kubeflow running in microk8s v1.22 I'm willing to chip in and help. Any guidance at all on solving this problem would be greatly appreciated.

inspection-report-20211025_173607.tar.gz

odellus commented 3 years ago

I tried purging nvidia-container-runtime and I got the same error when passing through --set driver.enabled=true. Getting rid of the drivers on the host did not solve my problem with not being able to access the gpu with microk8s v1.21.

I have successfully enabled the gpu add-on with microk8s v1.22 using the nvidia-container-runtime installed on the host and I have enabled the kubeflow add-on with microk8s v1.21. It's getting them both working together that I'm having trouble with.

So given that purging my host's nvidia-container-runtime did not work, is there a timeline for when kubeflow might be enabled for microk8s v1.22? Should I go bother the kubeflow people about this seldon crash? It seems they're actively working on getting their system setup for k8s v1.22.

ktsakalozos commented 3 years ago

Hi @odellus, a suggestion would be to use v1.20 because the GPU support on 1.21 is not in a good state and 1.22 does not have kubeflow.

odellus commented 3 years ago

Thank you for your advice. One thing I noticed when going down to v1.20 is that juju apparently doesn't have the refresh command I was using earlier to get the jupyter-ui pod working like discussed here.

microk8s juju refresh jupyter-ui --revision 10
ERROR juju: "refresh" is not a juju command. See "juju --help".

I'm also not able to run nvidia-smi when I log into the pod with kubectl exec -it

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

So it doesn't seem like downgrading to v1.20 fixed my issues with being able to use the GPU.

odellus commented 3 years ago

I tried to install the nvidia-device-plugin from nvidia's helm chart. Seeing the same error as microk8s 1.21.

$ microk8s helm3 install --generate-name nvdp/nvidia-device-plugin
$ kubectl logs -n kube-system ${POD} # name of the nvidia-device-plugin pod
2021/10/28 21:44:51 Loading NVML
2021/10/28 21:44:51 Failed to initialize NVML: could not load NVML library.
2021/10/28 21:44:51 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/10/28 21:44:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/10/28 21:44:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/10/28 21:44:51 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2021/10/28 21:44:51 Error: failed to initialize NVML: could not load NVML library

Logs for the nvidia-device-plugin pod installed with microk8s enable gpu show the same error.

odellus commented 3 years ago

I was able to get seldon v1.12-dev working with microk8s v1.22 yesterday in the hopes of being able to build the rest of kubeflow around it. Having access to the GPU is a much higher priority than getting kubeflow working. I can build what I need out of REST APIs in docker containers as long as they can access the GPU. Kubeflow is more a "nice to have" than the gpu.

odellus commented 2 years ago

The trick to enabling on microk8s v1.20 was to install the cuda drivers with the local .run script instead of the .deb files to install with dpkg. Closing.

canonical / microk8s

GPU Kubeflow cluster timeline and advice #2682