LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

WIP: launchers/flux.py: Use argparse not handwritten code #520

Closed ofaaland closed 1 year ago

ofaaland commented 2 years ago

The handwritten argument parsing code handled the form --option=value

but not equivalent forms such as -o value -ovalue --option value

Use argparse instead. Depends on python >= 3.7 to handle flux options before user script argument and user script options (if any).

Update list of arguments handled to include "exclusive" and "gpus-per-task" which are required by the flux interface.

ofaaland commented 2 years ago

@adammoody I'm not sure where we should have the discussion about the python version dependency - maybe at our next team meeting? But I wanted to call it out.

ofaaland commented 2 years ago

@adammoody I'm not sure where we should have the discussion about the python version dependency - maybe at our next team meeting? But I wanted to call it out.

We have python3.8 on toss systems, and it looks from https://forums.centos.org/viewtopic.php?t=74796 like python 3.8 is available in any rhel 8/ centos 8 system, which presumably means available nearly everywhere since rhel tracks so far behind on such things. But it would mean changing all the python scripts from

#! /usr/bin/env python3
...

to

#! /usr/bin/env python3.8
...

I could also investigate "getopt" and "optparse" python modules, which are both older than argparse and might have some other way to accomplish this without requiring a newer python.

ofaaland commented 2 years ago

... change from /usr/bin/env python3 to /usr/bin/env python3.8

On my current branch for testing I just made that change, to confirm everything works, but presumably that would be better handled via cmake at build time.

ofaaland commented 2 years ago

Added commit https://github.com/LLNL/scr/pull/520/commits/c8a08401b9c05fb5063db79e7ba5aa296514df52 to detect the python interpreter at build time using cmake and then use that. We can discuss other options.

ofaaland commented 2 years ago

Updated commit fixed incorrect indenting of return statement in launchers/flux.py:parsefluxargs().

ofaaland commented 1 year ago

@adammoody I think the root of the issue is that for flux, scr is trying to use the flux python interface to launch a job rather than just running "flux mini run ..." as specified by the user. This saves creating one process, but then requires we parse the command line, in the same way "flux mini" would parse it. This seems fragile, and therefore a mistake, to me.

I suspect the (python) launchers for slurm, etc. in scr do not try to do this, and that's why I ran across this problem when I started testing flux. I'll confirm that.

ofaaland commented 1 year ago

@adammoody OK, I confirmed that the scr flux launcher calls a flux-provided python interface to launch the job, and that interface requires we provide # of nodes, # of tasks per node, etc. explicitly. Since the scr launcher launchruncmd() takes a command line type string or a list of strings (ie argv) as its input, this means that by using the flux python interface to launch, we have to parse the user's args to extract nodes, tasks, etc.

The other launchers (ie slurm) just call subprocess.Popen() and pass the user's argv list in, requiring no parsing.

I think it would be better for the flux launcher's implementation of launchruncmd() to use subprocess.Popen() like the other launchers. We could provide an additional interface that allows the user to specify nodes, tasks-per-node, etc. that then calls the flux interface.

adammoody commented 1 year ago

Thanks, @ofaaland . Yes, I agree. I'm also in favor of avoiding parsing if we can do it. Just passing the user's string to the underlying command will be more reliable and future-proof.

ofaaland commented 1 year ago

Superseded by #523