canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.5k stars 772 forks source link

GPU Kubeflow cluster timeline and advice #2682

Closed odellus closed 2 years ago

odellus commented 3 years ago

I've been having some issues getting enabling gpus on the kubeflow cluster I recently set up.

Per this discussion, it seems that microk8s enable gpu works best for people who already have the nvidia-container-runtime installed on their system for microk8s version 1.22. However, as it's well known by now, the kubeflow add-on is only supported up to version 1.21 of microk8s. I've tried both:

  1. Going through the steps to enable gpus with microk8s v1.21. Logs show the operator still installing its own nvidia-container-runtime, despite my clear statement --set driver.enabled=false when calling helm3 install.

  2. Going through the steps of using juju and charmed operators to bootstrap a kubeflow cluster in microk8s v1.22 and see the same seldon error as reported in #2496 .

What should I do? Uninstall nvidia-container-runtime on my host and cross my fingers microk8s enable gpu will work in that case? If there's any way I can contribute to getting kubeflow running in microk8s v1.22 I'm willing to chip in and help. Any guidance at all on solving this problem would be greatly appreciated.

inspection-report-20211025_173607.tar.gz

odellus commented 3 years ago

I tried purging nvidia-container-runtime and I got the same error when passing through --set driver.enabled=true. Getting rid of the drivers on the host did not solve my problem with not being able to access the gpu with microk8s v1.21.

I have successfully enabled the gpu add-on with microk8s v1.22 using the nvidia-container-runtime installed on the host and I have enabled the kubeflow add-on with microk8s v1.21. It's getting them both working together that I'm having trouble with.

So given that purging my host's nvidia-container-runtime did not work, is there a timeline for when kubeflow might be enabled for microk8s v1.22? Should I go bother the kubeflow people about this seldon crash? It seems they're actively working on getting their system setup for k8s v1.22.

ktsakalozos commented 3 years ago

Hi @odellus, a suggestion would be to use v1.20 because the GPU support on 1.21 is not in a good state and 1.22 does not have kubeflow.

odellus commented 3 years ago

Thank you for your advice. One thing I noticed when going down to v1.20 is that juju apparently doesn't have the refresh command I was using earlier to get the jupyter-ui pod working like discussed here.

microk8s juju refresh jupyter-ui --revision 10
ERROR juju: "refresh" is not a juju command. See "juju --help".

I'm also not able to run nvidia-smi when I log into the pod with kubectl exec -it

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

So it doesn't seem like downgrading to v1.20 fixed my issues with being able to use the GPU.

odellus commented 3 years ago

I tried to install the nvidia-device-plugin from nvidia's helm chart. Seeing the same error as microk8s 1.21.

$ microk8s helm3 install --generate-name nvdp/nvidia-device-plugin
$ kubectl logs -n kube-system ${POD} # name of the nvidia-device-plugin pod
2021/10/28 21:44:51 Loading NVML
2021/10/28 21:44:51 Failed to initialize NVML: could not load NVML library.
2021/10/28 21:44:51 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/10/28 21:44:51 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/10/28 21:44:51 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/10/28 21:44:51 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2021/10/28 21:44:51 Error: failed to initialize NVML: could not load NVML library

Logs for the nvidia-device-plugin pod installed with microk8s enable gpu show the same error.

odellus commented 3 years ago

I was able to get seldon v1.12-dev working with microk8s v1.22 yesterday in the hopes of being able to build the rest of kubeflow around it. Having access to the GPU is a much higher priority than getting kubeflow working. I can build what I need out of REST APIs in docker containers as long as they can access the GPU. Kubeflow is more a "nice to have" than the gpu.

odellus commented 2 years ago

The trick to enabling on microk8s v1.20 was to install the cuda drivers with the local .run script instead of the .deb files to install with dpkg. Closing.