LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

flux: add compatibility for Flux in the ATS codebase #89

Closed wihobbs closed 2 years ago

wihobbs commented 2 years ago

Per our discussion today with @dawson6, @davidbloss, and @jameshcorbett, this PR is opened to test the compatibility of a work-in-progress Flux executor and monitor the progress of expanding ATS compatibility for Flux.

Currently, the Flux executor has option compatibility for:

To test this on Ruby (or another slurm-runnning LC cluster), install ATS in a virtual environment and request an allocation from slurm: salloc -N 3 -p pdebug -A cbronze -t 60

And then start a flux instance using srun: srun -N 3 --pty flux start

wihobbs commented 2 years ago

I'm setting the base branch as python3 to isolate the changes that I made. When I have the okay from @dawson6 I'll work out the kinks between my branch and main.

wihobbs commented 2 years ago

Thanks, @jameshcorbett, and sorry for the delay in pushing this -- still finding my way around in git. I think I addressed all of your comments (hopefully more to come ;))

jameshcorbett commented 2 years ago

If you wanted to, you could then read the jobspec that Flux constructed, and verify that the numbers check out.

Upon further thought I think that's a good idea but I would actually use Flux's jobspec class to read in the jobspec that was constructed. Then you don't need to hard-code any assumptions about jobspec layout. And I had forgotten that you won't be able to import the FluxScheduled class unless you have the flux package, so the check of whether Flux is installed is somewhat redundant.

wihobbs commented 2 years ago

Thanks @jameshcorbett. Using the jobspec class to test seems like an excellent idea and takes it one step further than I thought. I will work on this tomorrow and see if I can get it running in one of the Flux Docker containers. I will test both valid and invalid jobspecs.

And, thinking more deeply, I wonder if validating a jobspec should be part of calculateCommandList, so that invalid jobs are automatically skipped by ATS and never submitted to the queue. I will add this to our discussion for Thursday and we can see what @davidbloss and @dawson6 would prefer.

wihobbs commented 2 years ago

Update: the jobspec verification wouldn't provide any additional functionality. On reflection, it would stop jobs from being submitted to the queue but this wouldn't provide a justifiable speedup compared to the amount of implementation work/code.

wihobbs commented 2 years ago

Well groan. I'm sure I did something wrong to make the working tree this out of whack, but this should be aligned with the main branch.

Making a note to myself that this is what I did and in the future I should not do this again (it seems like rebase should do this better but I must be using it wrong)

(fluke_ats) [hobbs17@corona173:ATS]$ git config pull.rebase true
(fluke_ats) [hobbs17@corona173:ATS]$ git pull -X theirs upstream main

Tomorrow I'll go through machines.py and the commit history and make sure none of the commits from @dawson6 and @davidbloss change any of my implementation for Flux (but at first glance they shouldn't). And find a way to make my branch's commit history...less yucky.

wihobbs commented 2 years ago

All right, I believe I fixed the commit so that it will not have merge conflicts and can be merged into main once WIP is dropped. Making a note here that these are the pending actions for dropping WIP:

wihobbs commented 2 years ago

I believe at this point I have addressed all of the must-haves for ATS/Flux but will run more tests on Monday to be sure.

wihobbs commented 2 years ago

I removed work-in-progress because I believe we are good for final review here. Thanks @dawson6, @jameshcorbett, and @davidbloss -- I believe I have addressed all of your comments from here and our meetings.