In our move away from Chef, we added a regression where we immediately start slurmd. This , combined with Slurm's weak contract with ResumeProgram (slurm does not wait for this program to exit before starting a job), can cause a job to start on a node that had a failure in additional cluster init.
In our move away from Chef, we added a regression where we immediately start slurmd. This , combined with Slurm's weak contract with ResumeProgram (slurm does not wait for this program to exit before starting a job), can cause a job to start on a node that had a failure in additional cluster init.