This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
Reference implementation of how to integrate Azurehpc-health-checks with AKS.
Azurehpc-health-checks are integrated into k8s node problem detector (NPD) and then draino looks for specific GPU failed conditions and cordons/drains nodes.
Reference implementation of how to integrate Azurehpc-health-checks with AKS. Azurehpc-health-checks are integrated into k8s node problem detector (NPD) and then draino looks for specific GPU failed conditions and cordons/drains nodes.
NPD and Draino modified scripts are provided.