Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
122 stars 65 forks source link

Integrate GPU node health checks into AKS #746

Closed garvct closed 4 months ago

garvct commented 4 months ago

Reference implementation of how to integrate Azurehpc-health-checks with AKS. Azurehpc-health-checks are integrated into k8s node problem detector (NPD) and then draino looks for specific GPU failed conditions and cordons/drains nodes.

NPD and Draino modified scripts are provided.