LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 35 forks source link

scripts: improve job launcher interface #578

Open adammoody opened 10 months ago

adammoody commented 10 months ago

The JobLauncher interface for launch_run() is ambiguous in that some launchers require the list of nodes to run on (like aprun and mpirun) while others take the list of nodes to avoid (like srun and jsrun). For proper polymorphism, we need to settle on one approach so that the interface is consistent across all implementations.

Additionally, let's create a JobLauncherRun class to be returned by launch_run() which will later be passed to wait_run() or kill_run(). This class will represent a launched parallel run, and it will encapsulate the details of the proc and jobid values currently returned by launch_run. It could also provide stdout(), stderr(), and rc() functions so one can query the results of the parallel run.