flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
90 stars 41 forks source link

readers: fix rv1exec cancel with multiple entries in `R_lite` array #1307

Closed grondo closed 1 month ago

grondo commented 1 month ago

This PR fixes resource cancel with rv1exec when there is more than one entry in the Rv1 R_lite array.

Currently for every resource free request (whether a "partial" cancel or not) where not all ranks have the same set of cores or gpus allocated, users will see these nuisance error message in their logs:

sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 228417863981989888: Success
sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=default id=228417863981989888): Protocol error

Since my home cluster has a differing number of cores between nodes, I get these errors for almost every job.

The issue is that while resource_reader_rv1exec_t::partial_cancel_internal() loops over the entries in R_lite, it doesn't accumulate the visited ranks idsets, instead overwriting the ranks string each time and only processing the final one.

Also, there clearly wasn't a test for this case, so I added a new sharness test that contains a reproducer, along with some sanity testing with housekeeping that could later be expanded.

garlick commented 1 month ago

This fixes it for me!

grondo commented 1 month ago

Setting MWP.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 75.3%. Comparing base (4961e7b) to head (e6dad53). Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
resource/readers/resource_reader_rv1exec.cpp 71.4% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1307 +/- ## ======================================== - Coverage 75.3% 75.3% -0.1% ======================================== Files 111 111 Lines 15297 15300 +3 ======================================== + Hits 11529 11531 +2 - Misses 3768 3769 +1 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1307?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [resource/readers/resource\_reader\_rv1exec.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1307?src=pr&el=tree&filepath=resource%2Freaders%2Fresource_reader_rv1exec.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvcmVhZGVycy9yZXNvdXJjZV9yZWFkZXJfcnYxZXhlYy5jcHA=) | `70.9% <71.4%> (-0.1%)` | :arrow_down: |