Closed trws closed 5 months ago
Ok, figured this one out. It was a loop ordering issue, but I think this is now fixed, and should be a significant improvement. The one question I think we need to figure out is what a reasonable trip count is for this.
@milroy this should be ready for another look. To answer your question about the performance comment here in #1210, I did this in two pieces. The first backfill commit here makes it so that we process jobs until we have considered queue_depth
jobs that are not blocked. The first version of it that I put up, using the 1000 jobs enqueued example I've been using for perf testing, took quite a while to become responsive because it wasn't breaking every 32 jobs, it considered 1032 instead.
I considered that unacceptable, so the next commit in here refactors our base backfill policy implementation to periodically leave the scheduling loop, while still in scheduling state, to re-enter the reactor. This version still processes all 1032 jobs without needing to be poked, but instead of processing all 1032 before processing incoming messages, it now processes 5. This version should be both more "correct" in that it will run work that can run if it's available, and substantially more responsive than before, at the potential cost of running the scheduling loop slightly slower because of re-entering the sched loop. Thanks to your tests, I'm not really concerned about that reactor loop delay anymore, so I set it to a small but not singular value so we should have good reactivity while we work out how to make this thing actually async.
Also adjusted the unreservable test to ensure we can't miss this case going forward.
The new test has intermittent failures, timeouts, on fedora38 and bookworm seemingly. Trying to track this down but I haven't been able to repro in my local dev container yet.
In case it is helpful: sometimes running the test under taskset -c 0
or FLUX_TEST_VALGRIND=t
can more easily reproduce failures that only occur in Github actions. Otherwise, I've had success using the action-tmate action to run on the actual builders when failures can only reproduce there..
Thanks @grondo, the taskset
did it. Now trying to figure out how this can happen. Just to be sure, flux job wait -av
doesn't require any specific order of completion does it? Trying to work out what could be causing this hang.
flux job wait -av doesn't require any specific order of completion does it?
No it should just wait on all waitable jobs AFAIK
Ok, that was valuable. I'm pretty sure we ended up with a race condition where the signal to reconsider blocked jobs caused the in-progress iterator of the scheduling loop to be invalidated, causing a hang. This should take care of it. I really want a more general mechanism for deferring work during the sched loop, but we need this now if we can stabilize it.
Ok, now I think this finally does it. @milroy it turns out that crazy fcfs issue is because fcfs wasn't deterministically invoking post_sched_loop after the sched loop actually ended, but relying on it being invoked whenever the sched loop was started. In protecting the post_sched_loop from processing a sched loop in progress, it broke that, so now this fixes fcfs as well.
I really, really want to refactor this whole mess, there are far too many non-local flags going on, but this now passes all the tests I've managed to come up with for it, at least locally. We'll see what happens in CI. 🤞
@grondo, note the checks_run change here by the way. If you're good with it I can port that over to core as well. Probably doesn't matter as much there because it's a sub-make rather than ctest, but it makes a huge difference for live-output on CI here.
checks_run change here by the way. If you're good with it I can port that over to core as well.
Looks great to me!
Attention: Patch coverage is 89.06250%
with 7 lines
in your changes missing coverage. Please review.
Project coverage is 74.0%. Comparing base (
9bdd6d6
) to head (3ca473c
). Report is 173 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
qmanager/policies/queue_policy_bf_base_impl.hpp | 86.0% | 7 Missing :warning: |
fixes #1210
This is a bit of a big one, and may benefit from having each commit reviewed separately. If anyone prefers a stacked push I can do that.
The main changes here are:
I would really appreciate extra eyes on this, and testing if anyone has time. It should not only fix the blocking issue, but also our responsiveness issue (at least up to a point, async would be better, still working on that).