BiBiServ / bibigrid

BiBiGrid is a tool for an easy cluster setup inside a cloud environment.
Apache License 2.0
11 stars 8 forks source link

Fixes NOT_RESPONDING kept while shutdown #503

Closed XaverStiensmeier closed 1 month ago

XaverStiensmeier commented 2 months ago

Instead of shutting down the nodes manually in the fail script and setting the node to resume, fail script now sets the node state to POWER_DOWN which will automatically call terminate.sh which then terminates the node.

It seems like this prevents the NOT_RESPONDING flag. In any case: it involves Slurm more in the shutdown process and hence is probably a better solution in any case.

This should be tested for a larger run.