Open dsatizabal opened 5 months ago
I wrote an article on how to properly get Aliyun scheduler extender to work in Azure Kubernetes Service (AKS):
https://medium.com/@diego.satizabal_81239/using-aliyun-gpu-share-in-an-azure-aks-717cf7392d05
hope it helps
Scenario I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:
I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.
My cluster runs K8S 1.28:
Problem My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:
I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.
I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.
Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.
Thanks in advance!