flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

qmanager: split remove to separate pending #1175

Closed trws closed 4 weeks ago

trws commented 1 month ago

problem: The "remove" method was used for both handling the free and cancel RPCs. Under some code paths, remove de-allocates resources, so using it for a pending cancellation, even if it seems to be protected by checks, is unnecessarily dangerous. Given the recent double-booking issue and a lack so far of other theories on how we get there, this is a possible culprit.

solution: Re-factor queue_policy_base_t remove into remove and remove_pending, the latter of which only operates on the pending states and has no code-paths that can result in a resource.cancel RPC being sent. The remove_pending is used for both the cancel RPC and the part of the free RPC where the code was used before, should be a pure refactor.

trws commented 1 month ago

Sure, I can add that in no problem. I have a test that should do that, but I haven't managed to make that test fail even with the old code. This ensures there's no code path where it could get from sched.cancel to sched-fluxion-resource.cancel, but I haven't found the race or whatever it is that makes that actually occur. Either way I think this is a test we want, adding shortly.

trws commented 1 month ago

Test pushed.

trws commented 1 month ago

Anyone want a last look here before MWP? I think the added test tickles the issue but another set of eyes would be great.

milroy commented 4 weeks ago

@trws I have a nit on the new commit message, but apart from that this is good to go.