Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
56 stars 42 forks source link

Lazily start slurmd so that additional cluster init can fail nodes #162

Closed ryanhamel closed 1 year ago

ryanhamel commented 1 year ago

In our move away from Chef, we added a regression where we immediately start slurmd. This , combined with Slurm's weak contract with ResumeProgram (slurm does not wait for this program to exit before starting a job), can cause a job to start on a node that had a failure in additional cluster init.