Closed grondo closed 4 months ago
The following script reproduces an approximately linear scheduling slowdown as a function of the number of jobs submitted and then cancelled with a constraint specifying a single node. For example, it takes 10 4-node jobs without constraints running hostname
113s to complete with 500 jobs with constraints submitted then cancelled. For 1000 submitted and cancelled jobs the 4-node jobs take 248s to complete. I ran tests with up to 5000 submitted and cancelled jobs.
It doesn't reproduce the stuck "alloc pending" count.
I'm going to experiment with adding running and blocked jobs to try to reproduce the hang.
I believe I've reproduced the scheduler hang (no progress after more than 85 minutes and approx. constant 120% CPU %) with the following test:
The hang doesn't seem to occur until beyond 4000 immediately cancelled jobs (note the 10,000 jobs above). The test throws two errors, one for
jobids=$(flux job last 10000) &&
time -p flux cancel ${jobids}
I did verify that the 10,000 jobs are actually cancelled, so the error could be related to the number of jobids passed to flux cancel
. The other error is:
Jun 13 01:10:57.267291 UTC sched-fluxion-resource.err[0]: update_resource: exiting due to resource.acquire failure: Operation canceled
flux-broker: broker module 'sched-fluxion-qmanager' was not properly shut down
flux-broker: broker module 'sched-fluxion-resource' was not properly shut down
FATAL: exception not rethrown
Update: I can reproduce a hang without the additional jobs by simply creating and cancelling 10,000 jobs:
Which generates this error:
Jun 13 07:21:46.174028 UTC sched-fluxion-resource.err[0]: update_resource: exiting due to resource.acquire failure: Operation canceled
Jun 13 07:21:46.174269 UTC sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled
flux-broker: broker module 'sched-fluxion-resource' was not properly shut down
terminate called without an active exception
Flux uses about 120% CPU indefinitely at this point.
Oh wow. They died on a cancel from resource.acquire. I'm actually not sure how we could get to a FATAL: exception not rethrown
like that, it actually looks like what would happen if you hit an uncaught throw
inside a destructor and terminated the process. Did it OOM or something?
Yes, it looks like the test that creates blocked jobs causes flux-broker-0
to get killed by oomkiller after consuming 9,089,876kB total-vm
. That doesn't seem to happen without blocked jobs, but I'll rerun the test.
I can confirm that the test that simply creates and cancels 10,000 jobs doesn't get killed by oomkiller. It does eventually complete after 47 minutes, so it may not reproduce the title issue.
I've continued the investigation but haven't turned up much. First I generated a perf mem
trace. Here are some relevant output snippets:
I also tried Valgrind massif to look at the heap and ended up with a bunch of garbage:
#-----------
snapshot=3
#-----------
time=311735
mem_heap_B=9961
mem_heap_extra_B=383
mem_stacks_B=0
heap_tree=detailed
n4: 9961 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
n5: 9211 0x115E88: ??? (in /usr/bin/dash)
n1: 8193 0x11371D: ??? (in /usr/bin/dash)
n1: 8193 0x115D77: ??? (in /usr/bin/dash)
n1: 8193 0x10ED44: ??? (in /usr/bin/dash)
n1: 8193 0x10F452: ??? (in /usr/bin/dash)
n1: 8193 0x10E20E: ??? (in /usr/bin/dash)
n1: 8193 0x115BC4: ??? (in /usr/bin/dash)
n1: 8193 0x10C70E: ??? (in /usr/bin/dash)
n0: 8193 0x4876249: (below main) (libc_start_call_main.h:58)
n2: 736 0x11D92C: ??? (in /usr/bin/dash)
n1: 576 0x11E434: ??? (in /usr/bin/dash)
n1: 576 0x10C5FA: ??? (in /usr/bin/dash)
n0: 576 0x4876249: (below main) (libc_start_call_main.h:58)
n0: 160 in 3 places, all below massif's threshold (1.00%)
n1: 136 0x1136D9: ??? (in /usr/bin/dash)
n1: 136 0x115D77: ??? (in /usr/bin/dash)
n1: 136 0x10ED44: ??? (in /usr/bin/dash)
n1: 136 0x10F452: ??? (in /usr/bin/dash)
n1: 136 0x10E20E: ??? (in /usr/bin/dash)
n1: 136 0x115BC4: ??? (in /usr/bin/dash)
n1: 136 0x10C70E: ??? (in /usr/bin/dash)
n0: 136 0x4876249: (below main) (libc_start_call_main.h:58)
I also collected a perf
trace and saw typical culprits: boost::iterators::filter_iterator::satisfy_predicate
took about 28% of Fluxion execution time, and std::char_traits::compare
took 12%:
In sched-fluxion-qmanager
, the only thing that stood out was flux_requeue
took 12% of the execution time.
I think the next thing to check is the core Mark referenced in the first comment.
We probably need a tcmalloc, jemalloc or heaptrack trace to see where the allocations are coming from. Perf mem will show us the load/store pattern but doesn't track allocations AFAIK, would have thought massif would have worked but that does seem like some non-useful output.
I'm pretty sure the cause of the actual hang behavior is that the process_provisional_reconsider
method needs a set_schedulability(true)
. It solves it for me, and some other things too, but it changes the meaning of queue_depth
such that at least two tests fail. They fail because they successfully backfill one more job than they are expected to, because it's supposed to be past the queue depth. I'm still working through whether that's behavior we need to preserve, or if it's an artifact of running the scheduling loop once more in a condition we didn't before because we didn't have a concept of "scheduling but blocked" jobs.
I'm pretty sure the cause of the actual hang behavior is that the process_provisional_reconsider method needs a set_schedulability(true). It solves it for me
I tested your fix in PR #1223 against the test script above that creates blocked jobs and cancels 10,000 jobs. The runtime was 4353 seconds and 4422 seconds without it. Strangely enough, the OOM condition didn't occur in either test.
At least it finishes I guess. I had a test where one out of say 3 runs just bulk submitting 100 jobs asking for the same drained node, then two jobs asking for 1 node (no requirement) would actually hang after scheduling the first 3-5 jobs. The cancels probably provide the necessary trips back into the scheduling loop, though that performance is just horrendous. I have a couple of things stacked up to help that somewhat. Hoping maybe we can work it so that a cancel can apply to an in-progress scheduling loop. I think I can make that safe, possibly without even restarting it.
New information, making the cancels instantaneous actually doesn't help, because we don't even see them until far too late. When the test gets to the second flux jobs
invocation and shows them all cancelled, qmanager has only seen about 300 of the alloc requests! It would take a long time to even get to the first cancel message. I'm not sure why that is yet, but it's a very different problem than I expected.
Documenting as I go since I have to drop it for the day. It looks like we have at least two problems: 1) sched is entering the sched loop once after each alloc call, that's much too often for a high rate situation like this; 2) the job-manager processes all the cancels, but keeps sending all the no-longer-valid alloc requests anyway, for something like 10 minutes, before we start getting cancels to remove them.
I'll open an issue in flux-core to investigate the job-manager sending alloc requests after the jobs have a fatal exception raised, but I wonder if the alloc requests have all already been sent (I think Fluxion uses the unlimited
alloc limit, so no alloc requests are "held back") and are sitting in the message queue of the qmanager? We could perhaps check if the behavior is duplicated with sched-simple.
If an alloc_limit could be used with Fluxion, then the alloc requests for canceled jobs would never be sent. However, I think to do that the job manager would need to separate jobs into queues so that one queue full of jobs doesn't starve out the others.
Edit: opened flux-framework/flux-core#6051
That’s entirely possible. The combination is really the problem. If we can figure out why fluxion is only seeing one alloc request at a time, so we can just jam them into the queue and get to the cancels, we would be ok. Does job manager need to see a reply before it releases the next one or something? It seems really odd to me that we get into the idle watcher between each and every one otherwise.
In order to be fair to all sources of reactor events, I believe the handle watcher (and hence one message handler) only gets called once per event loop iteration. The fact that the event loop is at the "bottom" is not an indication that there are no more events to handle.
We've used a pattern elsewhere involving a "batch timer". On receipt of the first message in a batch, the timer is armed. When the timer expires, the batch is handled. That might be helpful here for driving the scheudling loop, if qmanager isn't already doing something like that.
It isn’t using a batch timer currently, but I’d been considering doing something like that to keep it from spending too long in the actual scheduling loop before re-entering the reactor and it could certainly work here too. Wouldn’t be too hard to do the other way around. Any idea what makes a good baseline delay or is it too case-specific? The main potential downside is it will add latency to each job in a slow submission scenario.
Do we not have a way to run something when the reactor is currently drained of messages, or a way to detect the depth of the message queue? It feels like a way to track that depth would probably help working out back-pressure, but maybe the timer is sufficient.
Any idea what makes a good baseline delay or is it too case-specific?
In other use cases it's often 10s of milliseconds so that response latency doesn't suffer too much. In this case I guess it depends on how quickly you can drain the queue, e.g. how long it takes to decode a message and insert an alloc request into the queue. It might be useful to allow it to be live tuned by TOML config.
do we not have a way to run something when the reactor is currently drained of messages, or a way to detect the depth of the message queue?
There isn't but now that we no longer use zeromq for interthread messages we could easily add a connector specific flux_opt_get()
hook for that.
On the memory issue you brought up @milroy, I haven't seen it hit 9, but just loading all of the resource graph from the test and letting a few hundred of the jobs load got me well into the 6.5gb+ range. With some modifications to core I managed to get heaptrack to work, you can take a look at this trace for fine detail, but it looks like the resource graph is consuming almost 6GB all by itself. Mostly in vertex, edge, and planner objects. Still drilling down into this to try and understand it better, but that's a whole lot of memory.
Actually including trace: heaptrack.flux-broker.457190.gz
flamegraph of peak memory usage:
On elcap a user submitted a large number of jobs, each requesting a specific node, then thought better of it and canceled them all. For the past few hours the scheduler has not scheduled any jobs and the "alloc pending" count is stuck at many thousands.
Attempting to stop the queues has not improved the situation.
On the recommendation of @trws I've grabbed a core of the rank 0 broker (available in my homedir).
Here's the resource module stats at the time of this failure:
The use case here is that a user is looking for bad/slow nodes and needs to run a test on each node in a queue. For now I've suggested they allocate all the nodes from the queue and then use bulksubmit in that subinstance, which should work better.