LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

flux: test scalable restart correctness and performance #553

Open adammoody opened 1 year ago

adammoody commented 1 year ago

Here is the old issue opened with Flux about tolerating node failures.

https://github.com/flux-framework/flux-core/issues/4417

We want to be sure that scalable restart is supported in a Flux allocation in which a user has allocated a spare node. As an example, the full test process for that would be:

adammoody commented 1 year ago

As a first pass, let's create flux versions of the jobscripts like we have for SLURM scr_srun.sh and scr_srun_loop.sh:

https://github.com/LLNL/scr/tree/develop/scripts/jobscripts