ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
52 stars 21 forks source link

SLURM restart-in-place script hangs when forcing prolog on down node #22

Closed CamStan closed 1 year ago

CamStan commented 5 years ago

If a node is in the allocation but is down (i.e., in down_nodes), this causes a hang when attempting to run on the down node.

https://github.com/ECP-VeloC/VELOC/blob/4144d924d9ae8d562ab2f187236a359a4eb0bab7/scripts/SLURM/veloc_srun.in#L57-L59

bnicolae commented 1 year ago

This issue stayed inactive for a long time. Please reopen if still relevant.