Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
122 stars 65 forks source link

Fix Cc slurm nhc autoscaling bug #708

Closed garvct closed 1 year ago

garvct commented 1 year ago

Fixed bug when NHC is used with autoscaling enabled (i.e AUTOSCALING=1). Added an exit code in the PROLOG, if NHC passed (exit code = 0), if NHC failed (exit code =1).

Added an additional PROLOG option (when AUTOSCALING=1 only). if PROLOG_NOHOLD_REQUEUE=1 and NHC fails (prolog exit code =1), the job will be requeued with no HOLD, i.e. Will attempt to allocate new VM's for the job. (Default is job will be requeued (with a hold), node will be deallocated (PROLOG_NOHOLD_REQUEUE=0)