double booking observed on cray system

garlick commented 1 month ago

Problem: two jobs were allocated the same resources.

The jobs in question are foxZW969KtT and foy5qJFL9Qs, both of which got allocations on cray8292 (rank 8048). Thanks @adambertsch for catching this. This was actually detected when flux resource list threw a duplicate allocation error (because the job manager now builds the allocated resource set from all the R fragments assigned to running jobs and that failed).

You can use flux job info <jobid> R | jq to view the resources assigned to each. On rank 8048, the cores 80-95 were assigned to foxZW969KtT and 0-95 were assigned to foy5qJFL9Qs.

Here is the scheduler config (excerpted from flux config get | jq)

  "sched-fluxion-qmanager": {
    "queue-policy": "easy"
  },
  "sched-fluxion-resource": {
    "match-policy": "firstnodex",
    "prune-filters": "ALL:core,ALL:gpu,cluster:node,rack:node",
    "match-format": "rv1_nosched"
  },

trws commented 1 month ago

Information pulled for later debugging, partially redacted, contact me for an original.

# flux job info foy5qJFL9Qs eventlog | jq -c '.timestamp -= 1712877164.122406 |  {timestamp, name}'
{"timestamp":0,"name":"submit"}
{"timestamp":0.027739763259887695,"name":"validate"}
{"timestamp":0.048435211181640625,"name":"depend"}
{"timestamp":0.04851555824279785,"name":"priority"}
{"timestamp":1.1492242813110352,"name":"alloc"}
{"timestamp":1.1503560543060303,"name":"prolog-start"}
{"timestamp":1.1503760814666748,"name":"prolog-start"}
{"timestamp":1.5522873401641846,"name":"cray_port_distribution"}
{"timestamp":1.5525174140930176,"name":"prolog-finish"}
{"timestamp":2.7350993156433105,"name":"prolog-finish"}
{"timestamp":2.8903696537017822,"name":"start"}
{"timestamp":10.688645601272583,"name":"finish"}
{"timestamp":10.689984560012817,"name":"epilog-start"}
{"timestamp":11.628239631652832,"name":"release"}
{"timestamp":747.695139169693,"name":"epilog-finish"}
{"timestamp":747.695253610611,"name":"free"}
{"timestamp":747.6952958106995,"name":"clean"}
# flux job info foxZW969KtT eventlog | jq -c '.timestamp -= 1712877164.122406 |  {timestamp, name}'
{"timestamp":-3991.654640197754,"name":"submit"}
{"timestamp":-3991.628582715988,"name":"validate"}
{"timestamp":-3991.6084730625153,"name":"depend"}
{"timestamp":-3991.6084220409393,"name":"priority"}
{"timestamp":-3969.4593319892883,"name":"alloc"}
{"timestamp":-3969.45880818367,"name":"prolog-start"}
{"timestamp":-3969.458781003952,"name":"prolog-start"}
{"timestamp":-3969.0725326538086,"name":"cray_port_distribution"}
{"timestamp":-3969.072403907776,"name":"prolog-finish"}
{"timestamp":-3959.6299834251404,"name":"prolog-finish"}
{"timestamp":-3959.24777674675,"name":"start"}
{"timestamp":-3952.4265785217285,"name":"exception"} #NOTE: exception 1
{"timestamp":-3951.0524611473083,"name":"finish"}
{"timestamp":-3951.0521211624146,"name":"epilog-start"}
{"timestamp":-3950.356454849243,"name":"release"}
{"timestamp":-343.0432062149048,"name":"exception"} #NOTE: exception 2
{"timestamp":3523.777511358261,"name":"epilog-finish"}
{"timestamp":3523.7776334285736,"name":"free"}
{"timestamp":3523.7776606082916,"name":"clean"}

The context for the two exceptions is important:

exception 1: {"type":"exec","severity":0,"note":"task 288 (host hostname1050): start failed: /home/user/src/ior/src/ior: No such file or directory","userid":uid}}
exception 2: {"type":"cancel","severity":0,"note":"","userid":uid}

Note that the cancel is sent 343ms before the start of foy5qJFL9Qs. Somehow we're getting the cancel and releasing the resources before the epilog-finish and free events.

garlick commented 1 month ago

That is odd. I checked earlier and it does not appear that fluxion subscribes to the job-exception event messages that are published as the exception occurs. In theory it shouldn't be getting any cues that anything happened until the job manager sends the sched.free request corresponding to the free event in the job eventlog.

milroy commented 1 month ago

Note that the cancel is sent 343ms before the start of foy5qJFL9Qs. Somehow we're getting the cancel and releasing the resources before the epilog-finish and free events.

Are the timestamps milliseconds? Looking at some of the raw logs they seem to be seconds. Comparing with other logs the cancel appears to have occurred around 5 minutes before the second job started.

Does the log from exception 2 mean it was a user-initiated cancel or is it coming from job manager?

I ask because there may be a path around the typical sched.free request in resource: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/resource/modules/resource_match.cpp#L2023

Which is a callback corresponding to the sched-fluxion-resource.cancel topic.

trws commented 1 month ago

Is it possible we're getting the cancel event through qmanager from schedutil sched.cancel somehow? That code path does free up the resources, so it would actually produce this result.

I was just digging through the RFC and looked at sched-simple, and it looks like it's a bit ill-defined what cancel is supposed to do when the job is already allocated. The RFC says something like "some of these stages may not be necessary and the job enters cleanup", fluxion clears the job out immediately, sched-simple removes it from the queue if it's waiting for resources and otherwise ignores the event entirely.

It was logged as a user-initiated cancel, but there might be some other way to generate that I don't know about. That said I haven't found the path by which that would get invoked, it ties back quite far, but I haven't found the last link to either the job-exception event or a watch on the event log.

trws commented 1 month ago

Ah, and you're right, they're seconds. Clearly need some sleep, looking back in on this first thing in the morning after I literally sleep on it.

garlick commented 1 month ago

The cancel request to the scheduler is supposed to cancel the sched.alloc RPC not the allocation. The scheduler does not respond to the cancel request directly. Instead instead it must terminate the alloc RPC, either with a success response (resources allocated) or an error (such as "canceled") If the alloc RPC is already retired, it should do nothing.

In this case, the alloc event in the eventlog indicates that the alloc RPC has already been retired, so a cancel request should not have been generated. If it was (because of a bug maybe) then the scheduler should be ignoring it. So I think the theory boils down to an unlikely scenario where two things have to be true: flux-core has to be sending cancel requests for alloc requests that have already been retired, and fluxion has to be handling it as though the alloc request were not already retired. It's worth checking though!

milroy commented 1 month ago

It was logged as a user-initiated cancel, but there might be some other way to generate that I don't know about.

I was wondering if somehow a user-initiated cancel is bypassing qmanager and directly invoking cancel_request_cb in resource_module. It sounds like that's not likely.

The cancel request to the scheduler is supposed to cancel the sched.alloc RPC not the allocation.

The cancel request to the scheduler might be cancelling the allocation. Examining the complex control flow starting with the .cancel in qmanager: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/modules/qmanager.cpp#L511

The control flow enters qmanager_safe_cb_t and then qmanager_cb_t: qmanager_cb_t::jobmanager_cancel_cb --> remove --> queue_policy_base_impl_t::remove

Assuming the job was in job_state_kind_t::RUNNING, it's inserted into the m_complete map.

From there it will eventually be popped in the queue_policy_bf_base_t: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/policies/queue_policy_bf_base_impl.hpp#L34

immediately before it's canceled: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/policies/queue_policy_bf_base_impl.hpp#L35

EASY queue policy inherits from queue_policy_bf_base_t which inherits from queue_policy_base_t, so the lines above would have been called if I've traced correctly.

I hope that combination of markdown and URLs isn't too confusing and that I haven't misunderstood the .cancel origin in qmanager.

garlick commented 1 month ago

That's the correct origin in qmanager. However, in jobmanager_cancel_cb(), it looks remove is only called if the job is in job_state_kind::PENDING state (checked by job->is_pending).

trws commented 1 month ago

Ok, I don't have a full answer, but here's at least some further tests and analysis. I started with this flow:

flux submit --wait-event epilog-start -N 1 -c 4 -n 1 /non-existant/path > id
flux job raise $(cat id)
flux submit --wait -N 1 --exclusive -n 1 hostname > id2
J1_EXCEPTION=$(flux job info $(cat id) eventlog | grep exception | tail -n 1 | jq -c '.timestamp')
echo JOB 1
flux job info $(cat id) eventlog | jq -c ".timestamp -= $J1_EXCEPTION"
echo JOB 2
flux job info $(cat id2) eventlog | jq -c ".timestamp -= $J1_EXCEPTION"

It produces the same flow of events that we see in the eventlog, except that it does not overlap. Clearly it's not that simple. If I add an explicit flux ion-resource cancel $(cat id) around the raise, it does exactly what we see in the double booking trace.

I'm trying to come up with how to actually replicate the cancel, I'm getting the impression that this was a flux run rather than a flux submit, and a ctrl-c raised exception rather than an explicit flux job raise, but that isn't normally possible because flux run returns at finish or release rather than after epilog-finish. Maybe something strange happens if a double ctrl-c hits run while it is stuck waiting to be released? I'm not sure how to test that though.

grondo commented 1 month ago

I don't think this was a ctrl-C because those exceptions are logged like this:

[  +0.807710] exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=6885

i.e. the note explicitly says interrupted by ctrl-C.

trws commented 1 month ago

Yup, note's empty, so it must have been a raise. Getting progressively less sure how we got to this place unless we ended up with a race condition. I can go in and make it so the cancel RPC doesn't clear resource allocations easily enough, but if we can't repro I'm not sure that's what it is.

trws commented 1 month ago

I've been banging on this every way I can think of, and so far can't reproduce the scenario without a resource.cancel RPC coming from somewhere.

The best I have right now, is that the interface between qmanager and resource doesn't differentiate between free and cancel. There's only a cancel RPC on the resource side, so if for any reason the qmanager considers the job complete or gets into the remove method during run or alloc_run state it sends the same message to resource, which clears the allocation status of the job. Even considering race conditions, I haven't found a way that can actually happen yet, everything seems to be protected with checks that the job is in pending state. That said, it would be really easy for there to be a code-path in here that deallocates things when it shouldn't because the two are conflated.

It should be relatively straightforward to split this so a cancel RPC has no code path to the cancel sched RPC. Pending a way to explicitly reproduce it, that's the most useful thing I can think of to do. It will make sure that the cancel RPC on qmanager only ever cancels a pending job at least.

flux-framework / flux-sched

double booking observed on cray system #1170