AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

Open dsatizabal opened 1 month ago

dsatizabal commented 1 month ago

Scenario I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:

image

I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.

My cluster runs K8S 1.28:

image

Problem My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:

image

I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.

I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.

Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.

Thanks in advance!

dsatizabal commented 13 hours ago

I wrote an article on how to properly get Aliyun scheduler extender to work in Azure Kubernetes Service (AKS):

https://medium.com/@diego.satizabal_81239/using-aliyun-gpu-share-in-an-azure-aks-717cf7392d05

hope it helps