flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
90 stars 41 forks source link

Resource: support partial cancel of resources external to broker ranks #1292

Closed milroy closed 3 weeks ago

milroy commented 2 months ago

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. ~This PR adds a goto statement to return 0 under this circumstance.~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.

This line is causing some of the errors reported in the related issues: https://github.com/flux-framework/flux-sched/blob/996f999c9bc398845f91ea58851e517de63ad677/resource/traversers/dfu_impl_update.cpp#L436 That error condition (rc !=0) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.

This PR adds support for a cleanup cancel post partial cancel that skips the inapplicable error check for non-existent planner spans in the error criterion for removal of planner_multi spans.

Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd). In preliminary testing, supporting both was challenging because with ALL:core configured, the final .free RPC frees all planner_multi-tracked resources, which prevents a cleanup, full cancel. However, tracking additional resources (e.g., ALL:ssd) successfully removes resource allocations on those vertices only with a cleanup cancel.

This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.

Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.

milroy commented 2 months ago

@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?

It would be good to know if the partial cancel behaves as expected when encountering this issue.

jameshcorbett commented 2 months ago

Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.

I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.

milroy commented 2 months ago

After more thought, I think we need to add a testsuite check for this issue.

jameshcorbett commented 2 months ago

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd resources are allocated.

milroy commented 2 months ago

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: https://github.com/flux-framework/flux-coral2/pull/208, but it would be good to have a test in sched, too.

jameshcorbett commented 2 months ago

What are the scheduler and queue policies set to in the coral2 tests?

Whatever the defaults are I think, there's no configuration done.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

Thanks for the pointer, I should be able to look into this tomorrow!

milroy commented 1 month ago

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

jameshcorbett commented 1 month ago

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

Working on that now.

milroy commented 1 month ago

Could you refactor the tests and put them under t/issues/ so the issue test driver script executes them?

milroy commented 1 month ago

I realized the brokerless resource cancellation currently implemented in this PR will lead to highly undesirable behavior. All the brokerless vertices will get cancelled in the first partial cancel traversal. That means that they will be allocatable before the final .free which can lead to state inconsistency between sched and core.

The best approach may be to propagate the final free to the traverser, which can then execute a cleanup, full cancel upon detecting brokerless resources. That has the added bonus of not triggering the if (full_removal || final) condition in qmanager, which will allow that condition to remain an error state.

A simpler alternative would be to always execute a full cancel when the final .free RPC is received. The disadvantage of that approach is additional overhead if there are no brokerless resources.

Thoughts @trws and @jameshcorbett?

trws commented 1 month ago

If propagating the final free state to the traverser is feasible, I think that's a great choice, if not doing the full cancel unconditionally would be a good stopgap while we figure things out.

milroy commented 1 month ago

I was able to implement propagating the free state to the traverser. I ran performance comparisons between the two approaches (propagation vs unconditional full cleanup cancel) and observed that the unconditional full cleanup cancel is almost always faster than propagating the free state. The exception appears to be when processing more than 1 free RPC for a single jobid.

However, I also observed errors with the free state propagation implementation under certain pruning filter configurations. For example, with the ALL:core,ALL:ssd pruning filter configuration the partial removal exits with an error on the rack1 vertex in the test JGF (issue1284.json). The error occurs when attempting to remove core types from the pruning filter span because the broker rank allocation being released includes N cores. However, rack1 only contains ssd vertices, so the attempt to remove N cores from the pruning filter core span fails because N > 0 (i.e., it attempts to remove more resources than exist).

I can't think of a way to fix that error without adding substantial complexity to the traverser and planner logic. Given these findings I think the best course is to do a full cancel unconditionally upon receipt of the final .free. @trws what do you think?

milroy commented 1 month ago

The exception appears to be when processing more than 1 free RPC for a single jobid.

I expect this will be the most common case in production.

In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single .free RPC for test 1027, the full cleanup cancel takes 102.3 microseconds on average (20 tests) in comparison with the propagating the free state implementation, which takes 121.8 microseconds (19% difference).

For test 1026 with a series of four .free RPCs, the average (across 20 tests) sum of the four cancels is 495.2 microseconds for the full cleanup cancel in comparison with 424.3 microseconds for the propagating the free state implementation (14% difference).

trws commented 1 month ago

It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

milroy commented 1 month ago

Would you agree @milroy?

Yes, that's my conclusion, too.

I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways.

trws commented 1 month ago

Looks like there's a failure on el8, specifically failing to find an error output that's there?

  [Oct24 23:36] sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 88030052352: Success
  test_must_fail: command succeeded: grep free RPC failed to remove all resources log.out
  not ok 8 - no fluxion errors logged

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

milroy commented 4 weeks ago

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet.

milroy commented 4 weeks ago

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

The final flag on the .free RPC isn't getting set for t1027 on el8 for some reason.

milroy commented 4 weeks ago

@trws I thought of a much simpler and cleaner way to check whether a planner_multi sub span was removed by a prior partial cancel: https://github.com/flux-framework/flux-sched/pull/1292/commits/a42de67ef8dcfe0e8bbdc9f49c37ed2d28f66f6d

The PR is ready for a final review.

milroy commented 3 weeks ago

I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR.

trws commented 3 weeks ago

Looks good to me @milroy, thanks for all the work on this!

milroy commented 3 weeks ago

Thanks for the review! Setting MWP.

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (93589f5) to head (655dcc3). Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
qmanager/policies/base/queue_policy_base.hpp 45.4% 6 Missing :warning:
resource/planner/c/planner_multi_c_interface.cpp 66.6% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1292 +/- ## ======================================== - Coverage 75.3% 75.2% -0.2% ======================================== Files 111 111 Lines 15979 15983 +4 ======================================== - Hits 12048 12032 -16 - Misses 3931 3951 +20 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1292?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [resource/traversers/dfu\_impl\_update.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1292?src=pr&el=tree&filepath=resource%2Ftraversers%2Fdfu_impl_update.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvdHJhdmVyc2Vycy9kZnVfaW1wbF91cGRhdGUuY3Bw) | `75.2% <100.0%> (-1.2%)` | :arrow_down: | | [resource/planner/c/planner\_multi\_c\_interface.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1292?src=pr&el=tree&filepath=resource%2Fplanner%2Fc%2Fplanner_multi_c_interface.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvcGxhbm5lci9jL3BsYW5uZXJfbXVsdGlfY19pbnRlcmZhY2UuY3Bw) | `62.6% <66.6%> (+0.1%)` | :arrow_up: | | [qmanager/policies/base/queue\_policy\_base.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1292?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fbase%2Fqueue_policy_base.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvYmFzZS9xdWV1ZV9wb2xpY3lfYmFzZS5ocHA=) | `78.2% <45.4%> (-0.9%)` | :arrow_down: | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1292/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)