ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
52 stars 21 forks source link

restart-in-place: record number of nodes used in first run, so restart logic knows whether enough healthy nodes exist #14

Closed adammoody closed 1 year ago

adammoody commented 5 years ago

To know whether there are enough nodes left, it's useful to have the first job that runs record the number of nodes it used in a file. Then the scripts can process that file to get the number of nodes needed to know whether there are enough nodes for a restart. We can work around that by having the user set a variable or config param stating the number of nodes they need, like VELOC_MIN_NODES. However, it's nice to automate this, since it's one less setting for the user.

bnicolae commented 1 year ago

This issue stayed inactive for a long time. Please reopen if still relevant.