qmanager: always process unblocked jobs and make bf reactive during sched loop

trws commented 5 months ago

fixes #1210

This is a bit of a big one, and may benefit from having each commit reviewed separately. If anyone prefers a stacked push I can do that.

The main changes here are:

Jobs that are determined to be blocked, cannot be reserved, do not count toward the queue_depth. This means if there are 1000 blocked jobs, and two at the end, we will consider 1002 jobs even if the queue_depth is only 32. The observed behavior is substantially better than otherwise however, because it means we should always make progress on work that can run, where before the two jobs at the end would block forever if the 1000 blocked jobs stay blocked forever.
The qmanager's backfill implementation now re-enters the reactor after processing a certain number of matches while leaving the scheduling loop active. It's a relatively straightforward approach that stores the loop state and re-enters it in a prep/check pair as needed that mostly re-uses existing machinery.

I would really appreciate extra eyes on this, and testing if anyone has time. It should not only fix the blocking issue, but also our responsiveness issue (at least up to a point, async would be better, still working on that).

trws commented 5 months ago

Ok, figured this one out. It was a loop ordering issue, but I think this is now fixed, and should be a significant improvement. The one question I think we need to figure out is what a reasonable trip count is for this.

trws commented 5 months ago

@milroy this should be ready for another look. To answer your question about the performance comment here in #1210, I did this in two pieces. The first backfill commit here makes it so that we process jobs until we have considered queue_depth jobs that are not blocked. The first version of it that I put up, using the 1000 jobs enqueued example I've been using for perf testing, took quite a while to become responsive because it wasn't breaking every 32 jobs, it considered 1032 instead.

I considered that unacceptable, so the next commit in here refactors our base backfill policy implementation to periodically leave the scheduling loop, while still in scheduling state, to re-enter the reactor. This version still processes all 1032 jobs without needing to be poked, but instead of processing all 1032 before processing incoming messages, it now processes 5. This version should be both more "correct" in that it will run work that can run if it's available, and substantially more responsive than before, at the potential cost of running the scheduling loop slightly slower because of re-entering the sched loop. Thanks to your tests, I'm not really concerned about that reactor loop delay anymore, so I set it to a small but not singular value so we should have good reactivity while we work out how to make this thing actually async.

Also adjusted the unreservable test to ensure we can't miss this case going forward.

trws commented 5 months ago

The new test has intermittent failures, timeouts, on fedora38 and bookworm seemingly. Trying to track this down but I haven't been able to repro in my local dev container yet.

grondo commented 5 months ago

In case it is helpful: sometimes running the test under taskset -c 0 or FLUX_TEST_VALGRIND=t can more easily reproduce failures that only occur in Github actions. Otherwise, I've had success using the action-tmate action to run on the actual builders when failures can only reproduce there..

trws commented 5 months ago

Thanks @grondo, the taskset did it. Now trying to figure out how this can happen. Just to be sure, flux job wait -av doesn't require any specific order of completion does it? Trying to work out what could be causing this hang.

grondo commented 5 months ago

flux job wait -av doesn't require any specific order of completion does it?

No it should just wait on all waitable jobs AFAIK

trws commented 5 months ago

Ok, that was valuable. I'm pretty sure we ended up with a race condition where the signal to reconsider blocked jobs caused the in-progress iterator of the scheduling loop to be invalidated, causing a hang. This should take care of it. I really want a more general mechanism for deferring work during the sched loop, but we need this now if we can stabilize it.

trws commented 5 months ago

Ok, now I think this finally does it. @milroy it turns out that crazy fcfs issue is because fcfs wasn't deterministically invoking post_sched_loop after the sched loop actually ended, but relying on it being invoked whenever the sched loop was started. In protecting the post_sched_loop from processing a sched loop in progress, it broke that, so now this fixes fcfs as well.

I really, really want to refactor this whole mess, there are far too many non-local flags going on, but this now passes all the tests I've managed to come up with for it, at least locally. We'll see what happens in CI. 🤞

trws commented 5 months ago

@grondo, note the checks_run change here by the way. If you're good with it I can port that over to core as well. Probably doesn't matter as much there because it's a sub-make rather than ctest, but it makes a huge difference for live-output on CI here.

grondo commented 5 months ago

checks_run change here by the way. If you're good with it I can port that over to core as well.

Looks great to me!

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Project coverage is 74.0%. Comparing base (9bdd6d6) to head (3ca473c). Report is 173 commits behind head on master.

Files with missing lines	Patch %	Lines
qmanager/policies/queue_policy_bf_base_impl.hpp	86.0%	7 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1211 +/- ## ====================================== Coverage 74.0% 74.0% ====================================== Files 103 103 Lines 14619 14631 +12 ====================================== + Hits 10819 10829 +10 - Misses 3800 3802 +2 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [qmanager/modules/qmanager\_callbacks.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?src=pr&el=tree&filepath=qmanager%2Fmodules%2Fqmanager_callbacks.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvbW9kdWxlcy9xbWFuYWdlcl9jYWxsYmFja3MuY3Bw) | `74.7% <100.0%> (+0.3%)` | :arrow_up: | | [qmanager/policies/base/queue\_policy\_base.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fbase%2Fqueue_policy_base.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvYmFzZS9xdWV1ZV9wb2xpY3lfYmFzZS5ocHA=) | `73.0% <100.0%> (+0.3%)` | :arrow_up: | | [qmanager/policies/queue\_policy\_fcfs.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_fcfs.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2ZjZnMuaHBw) | `100.0% <ø> (ø)` | | | [qmanager/policies/queue\_policy\_fcfs\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_fcfs_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2ZjZnNfaW1wbC5ocHA=) | `73.8% <100.0%> (+0.6%)` | :arrow_up: | | [qmanager/policies/queue\_policy\_bf\_base\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1211?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_bf_base_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2JmX2Jhc2VfaW1wbC5ocHA=) | `86.4% <86.0%> (-2.8%)` | :arrow_down: |

flux-framework / flux-sched

qmanager: always process unblocked jobs and make bf reactive during sched loop #1211

Codecov Report