LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

flux launcher: get jobspec using runproc #523

Closed ofaaland closed 1 year ago

ofaaland commented 1 year ago

The existing implementation needs a flux "Jobspec", describing what is to be run and its resource needs, to submit the job to flux. It obtains this using a flux python interface JobspecV1.from_command() which requires the number of nodes, tasks, etc. be specified as arguments.

This in turn requires the flux launcher to parse the command line the user provided, parsing FLUX arguments, to get the number of nodes, tasks, etc.

Instead of re-implementing 'flux mini run' arg parsing, run the command line via Popen.subcommand() with additional flux option "--dry-run". Flux responds with the Jobspec we need, and eliminates the need for an argument parser.

Also add a TODO indicating down_nodes may not be excluded, which seems not to to be supported.