Instead of shutting down the nodes manually in the fail script and setting the node to resume, fail script now sets the node state to POWER_DOWN which will automatically call terminate.sh which then terminates the node.
It seems like this prevents the NOT_RESPONDING flag. In any case: it involves Slurm more in the shutdown process and hence is probably a better solution in any case.
Instead of shutting down the nodes manually in the fail script and setting the node to resume, fail script now sets the node state to POWER_DOWN which will automatically call terminate.sh which then terminates the node.
It seems like this prevents the NOT_RESPONDING flag. In any case: it involves Slurm more in the shutdown process and hence is probably a better solution in any case.
This should be tested for a larger run.