As noted in flux-framework/flux-core#5990, some follow up is needed on why jobs weren't scheduled when the "alloc leak" occurred on el cap.
[Edit from @trws]
Currently, the loop over jobs to schedule up to queue_depth jobs or max_reservations reservations (whichever is first) counts jobs in the blocked queue toward queue_depth. In a system where jobs are flowing in and out, this is not a problem because the scheduling loop gets kicked on each of those events and doesn't always move the blocked into the pending, so it will eventually process everything. In a system with many blocked jobs and relatively few new jobs, the scheduler may not re-enter enough times to process through all the unreservable jobs to get to the ones that can actually be processed. As an example:
Script below, TLDR: Drain some nodes, enqueue 1000 jobs requiring one of those nodes, enqueue two jobs that can run on anything, wait for the two jobs, hang forever 😞
Anyway, the core problem is that a bunch of jobs that can't possibly run are eating up the queue_depth, which is usually 32. I have a tweak that allows the scheduling loop to continue until it processes 32 meaningful jobs, which solves this but makes qmanager unresponsive for even longer during a scheduling loop so I'm loathe to push that up by itself.
Ok, I'm going to edit this a bit to reflect the issue that was found as part of the discussion. It's somewhat distinct but still related to the original.
As noted in flux-framework/flux-core#5990, some follow up is needed on why jobs weren't scheduled when the "alloc leak" occurred on el cap.
[Edit from @trws] Currently, the loop over jobs to schedule up to
queue_depth
jobs ormax_reservations
reservations (whichever is first) counts jobs in the blocked queue towardqueue_depth
. In a system where jobs are flowing in and out, this is not a problem because the scheduling loop gets kicked on each of those events and doesn't always move the blocked into the pending, so it will eventually process everything. In a system with many blocked jobs and relatively few new jobs, the scheduler may not re-enter enough times to process through all the unreservable jobs to get to the ones that can actually be processed. As an example:Script below, TLDR: Drain some nodes, enqueue 1000 jobs requiring one of those nodes, enqueue two jobs that can run on anything, wait for the two jobs, hang forever 😞
full sharness test
```sh #!/usr/bin/env bash # test_description=' ' . `dirname $0`/sharness.sh export TEST_UNDER_FLUX_QUORUM=1 export TEST_UNDER_FLUX_START_MODE=leader rpc() { flux python -c \ "import flux, json; print(flux.Flux().rpc(\"$1\").get_str())" } test_under_flux 16384 system test_expect_success 'unload sched-simple' ' flux module remove -f sched-simple ' test_expect_success 'update configuration' ' flux config load <<-'EOF' [[resource.config]] hosts = "fake[0-16383]" cores = "0-63" gpus = "0-3" [[resource.config]] hosts = "fake[0-9999]" properties = ["compute"] [[resource.config]] hosts = "fake[10000-16000]" properties = ["test"] [[resource.config]] hosts = "fake[16001-16383]" properties = ["debug"] [sched-fluxion-qmanager] queue-policy = "easy" [sched-fluxion-resource] match-policy = "firstnodex" prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node" match-format = "rv1_nosched" EOF ' test_expect_success 'reload resource with monitor-force-up' ' flux module reload -f resource noverify monitor-force-up ' test_expect_success 'drain a few nodes' ' flux resource drain 1-1000 test with drained nodes ' test_expect_success 'load fluxion modules' ' flux module load sched-fluxion-resource && flux module load sched-fluxion-qmanager ' test_expect_success 'wait for fluxion to be ready' ' time flux python -c \ "import flux, json; print(flux.Flux().rpc(\"sched.resource-status\").get_str())" ' test_expect_success 'create a set of 10 inactive jobs' ' flux submit --cc=1-1000 --quiet \ -N 1 --exclusive \ --requires="host:fake[1005]" \ --progress --jps \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'create a set of 2 running jobs' ' time flux submit --progress --jps --quiet --cc=1-2 --wait-event=start -N1 \ --flags=waitable \ --requires=compute \ --setattr=exec.test.run_duration=5m hostname ' test_expect_success 'get match stats' ' time flux job wait -av ' test_expect_success 'get match stats' ' flux jobs -Ano "{id} {duration}" && \ flux resource undrain 1-1000 && \ flux jobs -Ano "{id} {duration}" && \ sleep 10 && \ flux dmesg && \ flux jobs -a && \ rpc sched-fluxion-resource.stats-get | jq ' test_expect_success 'unload fluxion' ' flux module remove sched-fluxion-qmanager && flux module remove sched-fluxion-resource && flux module load sched-simple ' test_done ```Anyway, the core problem is that a bunch of jobs that can't possibly run are eating up the queue_depth, which is usually 32. I have a tweak that allows the scheduling loop to continue until it processes 32 meaningful jobs, which solves this but makes qmanager unresponsive for even longer during a scheduling loop so I'm loathe to push that up by itself.