SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
This adds a --min-nodes option to the scr_should_exit command, which enables a user to specify the required number of nodes to continue running in an allocation in which they may have spare nodes. By default, the command refers to $SCR_MIN_NODES if set. Otherwise, it tries to read the .scr/nodes.scr file to use the number of nodes from the previous run. Finally, it assumes all nodes in the allocation are required. This new option overrides those other mechanisms.
This adds a --runs option to the scr_should_exit command. This is a temporary option, which simplifies the example job scripts. Ultimately, it would be nice to update the SCR library to support this logic instead. For example, SCR could read the halt file during SCR_Init and update any "runs remaining" value, similar to how it decrements the checkpoints remaining value. At that point, this temporary option could be dropped.
This adds a
--min-nodes
option to thescr_should_exit
command, which enables a user to specify the required number of nodes to continue running in an allocation in which they may have spare nodes. By default, the command refers to$SCR_MIN_NODES
if set. Otherwise, it tries to read the.scr/nodes.scr
file to use the number of nodes from the previous run. Finally, it assumes all nodes in the allocation are required. This new option overrides those other mechanisms.This adds a
--runs
option to thescr_should_exit
command. This is a temporary option, which simplifies the example job scripts. Ultimately, it would be nice to update the SCR library to support this logic instead. For example, SCR could read the halt file duringSCR_Init
and update any "runs remaining" value, similar to how it decrements the checkpoints remaining value. At that point, this temporary option could be dropped.