Azure / kubeflow-aks

Official repository for the Kubeflow on Azure and AKS project
https://azure.github.io/kubeflow-aks/main/
MIT License
27 stars 19 forks source link

Cannot Use GPU With Kubeflow AKS #37

Open tfontana1 opened 8 months ago

tfontana1 commented 8 months ago

Cannot use GPU when deploying a new notebook on Kubeflow.

  1. Created a GPU node pool following this guide: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool

  2. Create new notebook and give it access to 1 Nvidia GPU Screenshot 2024-02-06 at 3 33 40 PM

  3. Notebook deployment is failing with this error on the kubernetes pod: Screenshot 2024-02-06 at 3 34 53 PM

The notebook server never deploys and crash loops. I am able to run the GPU test example in the microsoft tutorial so I know that the GPU on AKS is configured properly.

seenu433 commented 6 months ago

The cluster should be on a SKU that supports GPU. https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool

The construction set can be updated to add a new node pool with GPU SKUs