LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Return resource control to ats to avoid scheduling issues with the flux adapter. #107

Open jwhite242 opened 1 year ago

jwhite242 commented 1 year ago

Had found in testing that the split in resource tracking across ats/flux was confusing the schedulers a bit, resulting in much lower than expected throughput when dealing with lots of short lived jobs.

Additionally, the scheduled time limit on the flux mini run was disabled to return control of that to ats's scheduling: flux was refusing to schedule jobs when too close to the end time, but ats wasn't able to pick up on that and just idled resources.

This MR addresses the issues mentioned in #106.

dawson6 commented 1 year ago

@jwhite242 can you go ahead and push your branch(es) to main please. I like to do a pull so I can test the changes as part of the approval process.

dawson6 commented 1 year ago

Jeremy, can you test with the current version of ATS (7.0.114 or the main branch) on rzvernal. We have reworked how we track jobs under flux, and this may have helped this issue.

jwhite242 commented 1 year ago

Jeremy, can you test with the current version of ATS (7.0.114 or the main branch) on rzvernal. We have reworked how we track jobs under flux, and this may have helped this issue.

@dawson6 We've been using 7.0.114 for ~3 weeks now and things seem to be ok at the moment. We're not using any of the limiters though in case you're looking for feedback on those: concurrency, time limit, etc.

Think for now we can close this PR.