flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
87 stars 41 forks source link

submitting a large number of jobs and canceling them caused scheduler "hang" #1222

Closed grondo closed 4 months ago

grondo commented 5 months ago

On elcap a user submitted a large number of jobs, each requesting a specific node, then thought better of it and canceled them all. For the past few hours the scheduler has not scheduled any jobs and the "alloc pending" count is stuck at many thousands.

Attempting to stop the queues has not improved the situation.

On the recommendation of @trws I've grabbed a core of the rank 0 broker (available in my homedir).

Here's the resource module stats at the time of this failure:

{
 "match": {
  "succeeded": {
   "njobs": 11940,
   "njobs-reset": 11940,
   "max-match-jobid": 429461867754015744,
   "max-match-iters": 52,
   "stats": {
    "min": 0.001353974,
    "max": 32.755777653999999,
    "avg": 0.29216099639715115,
    "variance": 3.2035476969734025
   }
  },
  "failed": {
   "njobs": 83916,
   "njobs-reset": 83916,
   "max-match-jobid": 429655446928752640,
   "max-match-iters": 328,
   "stats": {
    "min": 0.0073356740000000004,
    "max": 68.145720294,
    "avg": 0.29968277191067089,
    "variance": 0.886960191744261
   }
 }

The use case here is that a user is looking for bad/slow nodes and needs to run a test on each node in a queue. For now I've suggested they allocate all the nodes from the queue and then use bulksubmit in that subinstance, which should work better.

milroy commented 5 months ago

The following script reproduces an approximately linear scheduling slowdown as a function of the number of jobs submitted and then cancelled with a constraint specifying a single node. For example, it takes 10 4-node jobs without constraints running hostname 113s to complete with 500 jobs with constraints submitted then cancelled. For 1000 submitted and cancelled jobs the 4-node jobs take 248s to complete. I ran tests with up to 5000 submitted and cancelled jobs.

It doesn't reproduce the stuck "alloc pending" count.

Sharness test script ```bash #!/usr/bin/env bash # test_description=' ' . `dirname $0`/sharness.sh export TEST_UNDER_FLUX_QUORUM=1 export TEST_UNDER_FLUX_START_MODE=leader export TEST_UNDER_FLUX_EXIT_TIMEOUT=0 test_under_flux 16384 system test_expect_success 'unload sched-simple' ' flux module remove -f sched-simple ' test_expect_success 'update configuration' ' flux config load <<-'EOF' [[resource.config]] hosts = "fake[0-16383]" cores = "0-63" gpus = "0-3" [[resource.config]] hosts = "fake[0-9999]" properties = ["compute"] [[resource.config]] hosts = "fake[10000-16000]" properties = ["test"] [[resource.config]] hosts = "fake[16001-16383]" properties = ["debug"] [sched-fluxion-qmanager] queue-policy = "easy" [sched-fluxion-resource] match-policy = "firstnodex" prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node" match-format = "rv1_nosched" EOF ' test_expect_success 'reload resource with monitor-force-up' ' flux module reload -f resource noverify monitor-force-up ' test_expect_success 'load fluxion modules' ' flux module load sched-fluxion-resource && flux module load sched-fluxion-qmanager ' test_expect_success 'wait for fluxion to be ready' ' time -p flux module stats sched-fluxion-resource ' test_expect_success 'create a set of 500 jobs' ' flux submit --cc=1-500 --quiet \ -N 1 --exclusive \ --requires="host:fake[1005]" \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'cancel the jobs' ' time -p flux cancel --all ' test_expect_success 'create a set of 10 jobs' ' flux submit --cc=1-10 --quiet \ -N 4 --exclusive \ --flags=waitable \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'get logs and match stats' ' flux dmesg && \ flux jobs -a && \ time -p flux module stats sched-fluxion-resource ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'ensure all succeeded' ' time -p flux job wait -av ' test_expect_success 'get logs and match stats' ' flux jobs -a ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'unload fluxion' ' time -p flux module remove sched-fluxion-qmanager && time -p flux module remove sched-fluxion-resource && ' test_done ```

I'm going to experiment with adding running and blocked jobs to try to reproduce the hang.

milroy commented 5 months ago

I believe I've reproduced the scheduler hang (no progress after more than 85 minutes and approx. constant 120% CPU %) with the following test:

Sharness test script to reproduce scheduler hang ```bash #!/usr/bin/env bash # test_description=' ' . `dirname $0`/sharness.sh export TEST_UNDER_FLUX_QUORUM=1 export TEST_UNDER_FLUX_START_MODE=leader export TEST_UNDER_FLUX_EXIT_TIMEOUT=0 test_under_flux 16384 system test_expect_success 'unload sched-simple' ' flux module remove -f sched-simple ' test_expect_success 'update configuration' ' flux config load <<-'EOF' [[resource.config]] hosts = "fake[0-16383]" cores = "0-63" gpus = "0-3" [[resource.config]] hosts = "fake[0-9999]" properties = ["compute"] [[resource.config]] hosts = "fake[10000-16000]" properties = ["test"] [[resource.config]] hosts = "fake[16001-16383]" properties = ["debug"] [sched-fluxion-qmanager] queue-policy = "easy" [queues.compute] requires = [ "compute" ] [queues.test] requires = [ "test" ] [queues.debug] requires = [ "debug" ] [sched-fluxion-resource] match-policy = "firstnodex" prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node" match-format = "rv1_nosched" EOF ' test_expect_success 'reload resource with monitor-force-up' ' flux module reload -f resource noverify monitor-force-up ' test_expect_success 'load fluxion modules' ' flux module load sched-fluxion-resource && flux module load sched-fluxion-qmanager && flux queue start --all ' test_expect_success 'wait for fluxion to be ready' ' time -p flux module stats sched-fluxion-resource ' test_expect_success 'drain a few nodes' ' flux resource drain 10000-16000 ' test_expect_success 'create a set of 100 jobs' ' flux submit --cc=1-100 --quiet \ -N 4 --exclusive \ --queue=test \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'create a set of 10000 jobs' ' flux submit --cc=1-10000 --quiet \ -N 1 --exclusive \ --requires="host:fake[1005]" \ --queue=compute \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'cancel the jobs' ' jobids=$(flux job last 10000) && time -p flux cancel ${jobids} ' test_expect_success 'create a set of 100 jobs' ' flux submit --cc=1-100 --quiet \ -N 4 --exclusive \ --queue=compute \ --flags=waitable \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'get logs and match stats' ' flux dmesg && \ flux jobs -a -c 11000 && \ time -p flux module stats sched-fluxion-resource ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'ensure all succeeded' ' time -p flux job wait -av ' test_expect_success 'get logs and match stats' ' flux jobs -a -c 11000 ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'unload fluxion' ' flux module stats sched-fluxion-resource && time -p flux module remove sched-fluxion-qmanager && time -p flux module remove sched-fluxion-resource ' test_done ```

The hang doesn't seem to occur until beyond 4000 immediately cancelled jobs (note the 10,000 jobs above). The test throws two errors, one for

jobids=$(flux job last 10000) &&
time -p flux cancel ${jobids}

I did verify that the 10,000 jobs are actually cancelled, so the error could be related to the number of jobids passed to flux cancel. The other error is:

Jun 13 01:10:57.267291 UTC sched-fluxion-resource.err[0]: update_resource: exiting due to resource.acquire failure: Operation canceled
flux-broker: broker module 'sched-fluxion-qmanager' was not properly shut down
flux-broker: broker module 'sched-fluxion-resource' was not properly shut down
FATAL: exception not rethrown
milroy commented 5 months ago

Update: I can reproduce a hang without the additional jobs by simply creating and cancelling 10,000 jobs:

Sharness test script to reproduce scheduler hang ```sh #!/usr/bin/env bash # test_description=' ' . `dirname $0`/sharness.sh export TEST_UNDER_FLUX_QUORUM=1 export TEST_UNDER_FLUX_START_MODE=leader export TEST_UNDER_FLUX_EXIT_TIMEOUT=0 test_under_flux 16384 system test_expect_success 'unload sched-simple' ' flux module remove -f sched-simple ' test_expect_success 'update configuration' ' flux config load <<-'EOF' [[resource.config]] hosts = "fake[0-16383]" cores = "0-63" gpus = "0-3" [[resource.config]] hosts = "fake[0-9999]" properties = ["compute"] [[resource.config]] hosts = "fake[10000-16000]" properties = ["test"] [[resource.config]] hosts = "fake[16001-16383]" properties = ["debug"] [sched-fluxion-qmanager] queue-policy = "easy" [queues.compute] requires = [ "compute" ] [queues.test] requires = [ "test" ] [queues.debug] requires = [ "debug" ] [sched-fluxion-resource] match-policy = "firstnodex" prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node" match-format = "rv1_nosched" EOF ' test_expect_success 'reload resource with monitor-force-up' ' flux module reload -f resource noverify monitor-force-up ' test_expect_success 'load fluxion modules' ' flux module load sched-fluxion-resource && flux module load sched-fluxion-qmanager && flux queue start --all ' test_expect_success 'wait for fluxion to be ready' ' time -p flux module stats sched-fluxion-resource ' test_expect_success 'create a set of 10000 jobs' ' flux submit --cc=1-10000 --quiet \ -N 1 --exclusive \ --requires="host:fake[1005]" \ --queue=compute \ --setattr=exec.test.run_duration=0.01s \ hostname ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'cancel the jobs' ' jobids=$(flux job last 10000) && time -p flux cancel --all ' test_expect_success 'get logs and match stats' ' flux dmesg && \ flux jobs -a -c 11000 && \ time -p flux module stats sched-fluxion-resource ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'ensure all succeeded' ' time -p flux job wait -av ' test_expect_success 'get logs and match stats' ' flux jobs -a -c 11000 ' test_expect_success 'get the queue status' ' flux queue status -vvv ' test_expect_success 'unload fluxion' ' flux module stats sched-fluxion-resource && time -p flux module remove sched-fluxion-qmanager && time -p flux module remove sched-fluxion-resource && flux module stats sched-fluxion-resource ' test_done ```

Which generates this error:

Jun 13 07:21:46.174028 UTC sched-fluxion-resource.err[0]: update_resource: exiting due to resource.acquire failure: Operation canceled
Jun 13 07:21:46.174269 UTC sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled
flux-broker: broker module 'sched-fluxion-resource' was not properly shut down
terminate called without an active exception

Flux uses about 120% CPU indefinitely at this point.

trws commented 5 months ago

Oh wow. They died on a cancel from resource.acquire. I'm actually not sure how we could get to a FATAL: exception not rethrown like that, it actually looks like what would happen if you hit an uncaught throw inside a destructor and terminated the process. Did it OOM or something?

milroy commented 5 months ago

Yes, it looks like the test that creates blocked jobs causes flux-broker-0 to get killed by oomkiller after consuming 9,089,876kB total-vm. That doesn't seem to happen without blocked jobs, but I'll rerun the test.

milroy commented 5 months ago

I can confirm that the test that simply creates and cancels 10,000 jobs doesn't get killed by oomkiller. It does eventually complete after 47 minutes, so it may not reproduce the title issue.

milroy commented 5 months ago

I've continued the investigation but haven't turned up much. First I generated a perf mem trace. Here are some relevant output snippets:

Loads

Screenshot 2024-06-16 at 00 01 51

Stores

Screenshot 2024-06-16 at 00 02 46

I also tried Valgrind massif to look at the heap and ended up with a bunch of garbage:

#-----------
snapshot=3
#-----------
time=311735
mem_heap_B=9961
mem_heap_extra_B=383
mem_stacks_B=0
heap_tree=detailed
n4: 9961 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n5: 9211 0x115E88: ??? (in /usr/bin/dash)
  n1: 8193 0x11371D: ??? (in /usr/bin/dash)
   n1: 8193 0x115D77: ??? (in /usr/bin/dash)
    n1: 8193 0x10ED44: ??? (in /usr/bin/dash)
     n1: 8193 0x10F452: ??? (in /usr/bin/dash)
      n1: 8193 0x10E20E: ??? (in /usr/bin/dash)
       n1: 8193 0x115BC4: ??? (in /usr/bin/dash)
        n1: 8193 0x10C70E: ??? (in /usr/bin/dash)
         n0: 8193 0x4876249: (below main) (libc_start_call_main.h:58)
  n2: 736 0x11D92C: ??? (in /usr/bin/dash)
   n1: 576 0x11E434: ??? (in /usr/bin/dash)
    n1: 576 0x10C5FA: ??? (in /usr/bin/dash)
     n0: 576 0x4876249: (below main) (libc_start_call_main.h:58)
   n0: 160 in 3 places, all below massif's threshold (1.00%)
  n1: 136 0x1136D9: ??? (in /usr/bin/dash)
   n1: 136 0x115D77: ??? (in /usr/bin/dash)
    n1: 136 0x10ED44: ??? (in /usr/bin/dash)
     n1: 136 0x10F452: ??? (in /usr/bin/dash)
      n1: 136 0x10E20E: ??? (in /usr/bin/dash)
       n1: 136 0x115BC4: ??? (in /usr/bin/dash)
        n1: 136 0x10C70E: ??? (in /usr/bin/dash)
         n0: 136 0x4876249: (below main) (libc_start_call_main.h:58)

I also collected a perf trace and saw typical culprits: boost::iterators::filter_iterator::satisfy_predicate took about 28% of Fluxion execution time, and std::char_traits::compare took 12%:

Screenshot 2024-06-16 at 00 14 37

In sched-fluxion-qmanager, the only thing that stood out was flux_requeue took 12% of the execution time.

I think the next thing to check is the core Mark referenced in the first comment.

trws commented 5 months ago

We probably need a tcmalloc, jemalloc or heaptrack trace to see where the allocations are coming from. Perf mem will show us the load/store pattern but doesn't track allocations AFAIK, would have thought massif would have worked but that does seem like some non-useful output.

I'm pretty sure the cause of the actual hang behavior is that the process_provisional_reconsider method needs a set_schedulability(true). It solves it for me, and some other things too, but it changes the meaning of queue_depth such that at least two tests fail. They fail because they successfully backfill one more job than they are expected to, because it's supposed to be past the queue depth. I'm still working through whether that's behavior we need to preserve, or if it's an artifact of running the scheduling loop once more in a condition we didn't before because we didn't have a concept of "scheduling but blocked" jobs.

milroy commented 5 months ago

I'm pretty sure the cause of the actual hang behavior is that the process_provisional_reconsider method needs a set_schedulability(true). It solves it for me

I tested your fix in PR #1223 against the test script above that creates blocked jobs and cancels 10,000 jobs. The runtime was 4353 seconds and 4422 seconds without it. Strangely enough, the OOM condition didn't occur in either test.

trws commented 4 months ago

At least it finishes I guess. I had a test where one out of say 3 runs just bulk submitting 100 jobs asking for the same drained node, then two jobs asking for 1 node (no requirement) would actually hang after scheduling the first 3-5 jobs. The cancels probably provide the necessary trips back into the scheduling loop, though that performance is just horrendous. I have a couple of things stacked up to help that somewhat. Hoping maybe we can work it so that a cancel can apply to an in-progress scheduling loop. I think I can make that safe, possibly without even restarting it.

trws commented 4 months ago

New information, making the cancels instantaneous actually doesn't help, because we don't even see them until far too late. When the test gets to the second flux jobs invocation and shows them all cancelled, qmanager has only seen about 300 of the alloc requests! It would take a long time to even get to the first cancel message. I'm not sure why that is yet, but it's a very different problem than I expected.

trws commented 4 months ago

Documenting as I go since I have to drop it for the day. It looks like we have at least two problems: 1) sched is entering the sched loop once after each alloc call, that's much too often for a high rate situation like this; 2) the job-manager processes all the cancels, but keeps sending all the no-longer-valid alloc requests anyway, for something like 10 minutes, before we start getting cancels to remove them.

grondo commented 4 months ago

I'll open an issue in flux-core to investigate the job-manager sending alloc requests after the jobs have a fatal exception raised, but I wonder if the alloc requests have all already been sent (I think Fluxion uses the unlimited alloc limit, so no alloc requests are "held back") and are sitting in the message queue of the qmanager? We could perhaps check if the behavior is duplicated with sched-simple.

If an alloc_limit could be used with Fluxion, then the alloc requests for canceled jobs would never be sent. However, I think to do that the job manager would need to separate jobs into queues so that one queue full of jobs doesn't starve out the others.

Edit: opened flux-framework/flux-core#6051

trws commented 4 months ago

That’s entirely possible. The combination is really the problem. If we can figure out why fluxion is only seeing one alloc request at a time, so we can just jam them into the queue and get to the cancels, we would be ok. Does job manager need to see a reply before it releases the next one or something? It seems really odd to me that we get into the idle watcher between each and every one otherwise.

garlick commented 4 months ago

In order to be fair to all sources of reactor events, I believe the handle watcher (and hence one message handler) only gets called once per event loop iteration. The fact that the event loop is at the "bottom" is not an indication that there are no more events to handle.

garlick commented 4 months ago

We've used a pattern elsewhere involving a "batch timer". On receipt of the first message in a batch, the timer is armed. When the timer expires, the batch is handled. That might be helpful here for driving the scheudling loop, if qmanager isn't already doing something like that.

trws commented 4 months ago

It isn’t using a batch timer currently, but I’d been considering doing something like that to keep it from spending too long in the actual scheduling loop before re-entering the reactor and it could certainly work here too. Wouldn’t be too hard to do the other way around. Any idea what makes a good baseline delay or is it too case-specific? The main potential downside is it will add latency to each job in a slow submission scenario.

Do we not have a way to run something when the reactor is currently drained of messages, or a way to detect the depth of the message queue? It feels like a way to track that depth would probably help working out back-pressure, but maybe the timer is sufficient.

garlick commented 4 months ago

Any idea what makes a good baseline delay or is it too case-specific?

In other use cases it's often 10s of milliseconds so that response latency doesn't suffer too much. In this case I guess it depends on how quickly you can drain the queue, e.g. how long it takes to decode a message and insert an alloc request into the queue. It might be useful to allow it to be live tuned by TOML config.

do we not have a way to run something when the reactor is currently drained of messages, or a way to detect the depth of the message queue?

There isn't but now that we no longer use zeromq for interthread messages we could easily add a connector specific flux_opt_get() hook for that.

trws commented 4 months ago

On the memory issue you brought up @milroy, I haven't seen it hit 9, but just loading all of the resource graph from the test and letting a few hundred of the jobs load got me well into the 6.5gb+ range. With some modifications to core I managed to get heaptrack to work, you can take a look at this trace for fine detail, but it looks like the resource graph is consuming almost 6GB all by itself. Mostly in vertex, edge, and planner objects. Still drilling down into this to try and understand it better, but that's a whole lot of memory.

trws commented 4 months ago

Actually including trace: heaptrack.flux-broker.457190.gz

flamegraph of peak memory usage: image