LLNL / LaunchMON

LaunchMON is a software infrastructure that enables HPC run-time tools to co-locate tool daemons with a parallel job. Its API allows a tool to identify all the remote processes of a job and to scalably launch daemons into the relevant nodes.
Other
13 stars 9 forks source link

Integrate Slurm with Travis CI of Launchmon #25

Open mcfadden8 opened 8 years ago

mcfadden8 commented 8 years ago

Just creating ticket as a way to inform others that this would be a useful feature and that I'm interested in doing it at some point.

dongahn commented 8 years ago

@mcfadden8:

As discussed, this will be a great contribution and represents a huge undertake on your part! I think it would be wise to break this down to multiple smaller tasks and take some baby steps along the way.

Some of the interesting early experiments would be:

  1. Add gcc+slurm in .travis.yml, and for this testing instance, see if one can install a SLURM package and run them to become the system resource manager for the Travis VM instance.
  2. Add automake's testing rig for two very basic tests (test/test.launch_1.in and test/test.attach_1.in) such as a way that make check will run them and the new testing rig can harvest TAP output. Sharness has been working great for my other project to write the tests.
  3. This also means one will have to modify these test codes and scripts such a way that success/failure can be reliably harvesedt with no apparent race condition. My experience has been setting this up for distributed/parallel environment is not trivial.

In any case, once 1 and 2 are done, let's discuss how we can make our test cases and scripts more suitable for automatic regression testing environments.

Once your effort establishes a feasibility, I can see why we will ultimately want to add testing instances for

I probably won't be able to set up testing instance for CORAL Sierra as desired by Issue #17 or Blue Gene/Q. But we should try to cover as many RM environments as possible.