Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[Feature] Extend node-problem detector to support health check for HPC/AI specialty SKU's #4198

Open garvct opened 1 month ago

garvct commented 1 month ago

Is your feature request related to a problem? Please describe. It is strongly recommended that AI customers should run some specific node health checks before attempting to run a large training job. Nodes which fail the health checks are not included in the training job. AKS currently does not have an integration of HPC/AI SKU's recommended health checks.

Describe the solution you'd like The recommended HPC/AI SKU healthchecks are contained here https://github.com/Azure/azurehpc-health-checks . The proposed solution would be to use/leverage K8s node problem detector framework and integrate the azurehpc-health-checks into the node problem detector framework, allowing AKS users to opt-in to these tests and have control over which azurehpc-health-checks run via NPD.

Similar work has been done with the Cyclecloud SLURM scheduler (integrating HPC/AI specific healthchecks wit hthe SLURM scheduler). https://techcommunity.microsoft.com/t5/azure-high-performance-computing/automated-hpc-ai-compute-node-health-checks-integrated-with-the/ba-p/3113454