Closed milroy closed 3 weeks ago
@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?
It would be good to know if the partial cancel behaves as expected when encountering this issue.
Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.
I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.
After more thought, I think we need to add a testsuite check for this issue.
Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd
resources are allocated.
Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.
Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh
What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: https://github.com/flux-framework/flux-coral2/pull/208, but it would be good to have a test in sched, too.
What are the scheduler and queue policies set to in the coral2 tests?
Whatever the defaults are I think, there's no configuration done.
Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh
Thanks for the pointer, I should be able to look into this tomorrow!
@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?
Alternatively I can cherry pick your commits and push them to my fork.
@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?
Alternatively I can cherry pick your commits and push them to my fork.
Working on that now.
Could you refactor the tests and put them under t/issues/
so the issue test driver script executes them?
I realized the brokerless resource cancellation currently implemented in this PR will lead to highly undesirable behavior. All the brokerless vertices will get cancelled in the first partial cancel traversal. That means that they will be allocatable before the final .free
which can lead to state inconsistency between sched and core.
The best approach may be to propagate the final free to the traverser, which can then execute a cleanup, full cancel upon detecting brokerless resources. That has the added bonus of not triggering the if (full_removal || final)
condition in qmanager, which will allow that condition to remain an error state.
A simpler alternative would be to always execute a full cancel when the final .free
RPC is received. The disadvantage of that approach is additional overhead if there are no brokerless resources.
Thoughts @trws and @jameshcorbett?
If propagating the final free state to the traverser is feasible, I think that's a great choice, if not doing the full cancel unconditionally would be a good stopgap while we figure things out.
I was able to implement propagating the free state to the traverser. I ran performance comparisons between the two approaches (propagation vs unconditional full cleanup cancel) and observed that the unconditional full cleanup cancel is almost always faster than propagating the free state. The exception appears to be when processing more than 1 free
RPC for a single jobid.
However, I also observed errors with the free state propagation implementation under certain pruning filter configurations. For example, with the ALL:core,ALL:ssd
pruning filter configuration the partial removal exits with an error on the rack1
vertex in the test JGF (issue1284.json
). The error occurs when attempting to remove core
types from the pruning filter span because the broker rank allocation being released includes N cores. However, rack1
only contains ssd
vertices, so the attempt to remove N cores from the pruning filter core
span fails because N > 0 (i.e., it attempts to remove more resources than exist).
I can't think of a way to fix that error without adding substantial complexity to the traverser and planner logic. Given these findings I think the best course is to do a full cancel unconditionally upon receipt of the final .free
. @trws what do you think?
The exception appears to be when processing more than 1 free RPC for a single jobid.
I expect this will be the most common case in production.
In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single .free
RPC for test 1027, the full cleanup cancel takes 102.3 microseconds on average (20 tests) in comparison with the propagating the free state implementation, which takes 121.8 microseconds (19% difference).
For test 1026 with a series of four .free
RPCs, the average (across 20 tests) sum of the four cancels is 495.2 microseconds for the full cleanup cancel in comparison with 424.3 microseconds for the propagating the free state implementation (14% difference).
It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.
Would you agree @milroy?
Yes, that's my conclusion, too.
I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.
Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways.
Looks like there's a failure on el8, specifically failing to find an error output that's there?
[Oct24 23:36] sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 88030052352: Success
test_must_fail: command succeeded: grep free RPC failed to remove all resources log.out
not ok 8 - no fluxion errors logged
Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.
Is this otherwise ready to go? Trying to plan my reviews etc. for the day.
Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.
Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.
Is this otherwise ready to go? Trying to plan my reviews etc. for the day.
Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet.
Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.
The final
flag on the .free
RPC isn't getting set for t1027
on el8 for some reason.
@trws I thought of a much simpler and cleaner way to check whether a planner_multi sub span was removed by a prior partial cancel: https://github.com/flux-framework/flux-sched/pull/1292/commits/a42de67ef8dcfe0e8bbdc9f49c37ed2d28f66f6d
The PR is ready for a final review.
I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR.
Looks good to me @milroy, thanks for all the work on this!
Thanks for the review! Setting MWP.
Attention: Patch coverage is 53.33333%
with 7 lines
in your changes missing coverage. Please review.
Project coverage is 75.2%. Comparing base (
93589f5
) to head (655dcc3
). Report is 6 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
qmanager/policies/base/queue_policy_base.hpp | 45.4% | 6 Missing :warning: |
resource/planner/c/planner_multi_c_interface.cpp | 66.6% | 1 Missing :warning: |
Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the
mod_plan
function when an allocation is found andmod_data.mod_type == job_modify_t::PARTIAL_CANCEL
. ~This PR adds agoto
statement to return0
under this circumstance.~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.This line is causing some of the errors reported in the related issues: https://github.com/flux-framework/flux-sched/blob/996f999c9bc398845f91ea58851e517de63ad677/resource/traversers/dfu_impl_update.cpp#L436 That error condition (
rc !=0
) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especiallycore
, which is installed in all pruning filters by default) because they have broker ranks. However, when the final.free
RPC fails to remove anssd
vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.This PR adds support for a cleanup
cancel
post partial cancel that skips the inapplicable error check for non-existentplanner
spans in the error criterion for removal ofplanner_multi
spans.Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (
ALL:core
) and pruning filters are set for the resources excluded from broker ranks (e.g.,ALL:ssd
). In preliminary testing, supporting both was challenging because withALL:core
configured, the final.free
RPC frees allplanner_multi
-tracked resources, which prevents a cleanup, fullcancel
. However, tracking additional resources (e.g.,ALL:ssd
) successfully removes resource allocations on those vertices only with a cleanup cancel.This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.
Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.