ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
53 stars 22 forks source link

restart-in-place: detect halt file from library to know when to stop restarting #13

Closed adammoody closed 1 year ago

adammoody commented 5 years ago

Without knowing otherwise, the scripts will assume the job must always be restarted, including the case that the job actually ran to completion. To avoid having the scripts auto-restart the job, they need to know that the job ended on purpose.

Note that it's not sufficient to use the exit code of the launch command because some jobs return a non-zero exit code to indicate various info -- e.g., maybe the calculation went bad.

With SCR, we ended up writing a "halt" file in SCR_Finalize, and then we look for that "halt" file in the scripts. If we see it, we assume the job completed and we won't try to restart it. If there is no file, the scripts will try to restart the job.

bnicolae commented 1 year ago

This issue stayed inactive for a long time. Please reopen if still relevant.