Open garlick opened 1 month ago
Information pulled for later debugging, partially redacted, contact me for an original.
# flux job info foy5qJFL9Qs eventlog | jq -c '.timestamp -= 1712877164.122406 | {timestamp, name}'
{"timestamp":0,"name":"submit"}
{"timestamp":0.027739763259887695,"name":"validate"}
{"timestamp":0.048435211181640625,"name":"depend"}
{"timestamp":0.04851555824279785,"name":"priority"}
{"timestamp":1.1492242813110352,"name":"alloc"}
{"timestamp":1.1503560543060303,"name":"prolog-start"}
{"timestamp":1.1503760814666748,"name":"prolog-start"}
{"timestamp":1.5522873401641846,"name":"cray_port_distribution"}
{"timestamp":1.5525174140930176,"name":"prolog-finish"}
{"timestamp":2.7350993156433105,"name":"prolog-finish"}
{"timestamp":2.8903696537017822,"name":"start"}
{"timestamp":10.688645601272583,"name":"finish"}
{"timestamp":10.689984560012817,"name":"epilog-start"}
{"timestamp":11.628239631652832,"name":"release"}
{"timestamp":747.695139169693,"name":"epilog-finish"}
{"timestamp":747.695253610611,"name":"free"}
{"timestamp":747.6952958106995,"name":"clean"}
# flux job info foxZW969KtT eventlog | jq -c '.timestamp -= 1712877164.122406 | {timestamp, name}'
{"timestamp":-3991.654640197754,"name":"submit"}
{"timestamp":-3991.628582715988,"name":"validate"}
{"timestamp":-3991.6084730625153,"name":"depend"}
{"timestamp":-3991.6084220409393,"name":"priority"}
{"timestamp":-3969.4593319892883,"name":"alloc"}
{"timestamp":-3969.45880818367,"name":"prolog-start"}
{"timestamp":-3969.458781003952,"name":"prolog-start"}
{"timestamp":-3969.0725326538086,"name":"cray_port_distribution"}
{"timestamp":-3969.072403907776,"name":"prolog-finish"}
{"timestamp":-3959.6299834251404,"name":"prolog-finish"}
{"timestamp":-3959.24777674675,"name":"start"}
{"timestamp":-3952.4265785217285,"name":"exception"} #NOTE: exception 1
{"timestamp":-3951.0524611473083,"name":"finish"}
{"timestamp":-3951.0521211624146,"name":"epilog-start"}
{"timestamp":-3950.356454849243,"name":"release"}
{"timestamp":-343.0432062149048,"name":"exception"} #NOTE: exception 2
{"timestamp":3523.777511358261,"name":"epilog-finish"}
{"timestamp":3523.7776334285736,"name":"free"}
{"timestamp":3523.7776606082916,"name":"clean"}
The context for the two exceptions is important:
exception 1: {"type":"exec","severity":0,"note":"task 288 (host hostname1050): start failed: /home/user/src/ior/src/ior: No such file or directory","userid":uid}}
exception 2: {"type":"cancel","severity":0,"note":"","userid":uid}
Note that the cancel is sent 343ms before the start of foy5qJFL9Qs. Somehow we're getting the cancel and releasing the resources before the epilog-finish and free events.
That is odd. I checked earlier and it does not appear that fluxion subscribes to the job-exception
event messages that are published as the exception occurs. In theory it shouldn't be getting any cues that anything happened until the job manager sends the sched.free
request corresponding to the free
event in the job eventlog.
Note that the cancel is sent 343ms before the start of foy5qJFL9Qs. Somehow we're getting the cancel and releasing the resources before the epilog-finish and free events.
Are the timestamps milliseconds? Looking at some of the raw logs they seem to be seconds. Comparing with other logs the cancel appears to have occurred around 5 minutes before the second job started.
Does the log from exception 2 mean it was a user-initiated cancel or is it coming from job manager?
I ask because there may be a path around the typical sched.free
request in resource: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/resource/modules/resource_match.cpp#L2023
Which is a callback corresponding to the sched-fluxion-resource.cancel
topic.
Is it possible we're getting the cancel event through qmanager from schedutil sched.cancel somehow? That code path does free up the resources, so it would actually produce this result.
I was just digging through the RFC and looked at sched-simple, and it looks like it's a bit ill-defined what cancel is supposed to do when the job is already allocated. The RFC says something like "some of these stages may not be necessary and the job enters cleanup", fluxion clears the job out immediately, sched-simple removes it from the queue if it's waiting for resources and otherwise ignores the event entirely.
It was logged as a user-initiated cancel, but there might be some other way to generate that I don't know about. That said I haven't found the path by which that would get invoked, it ties back quite far, but I haven't found the last link to either the job-exception event or a watch on the event log.
Ah, and you're right, they're seconds. Clearly need some sleep, looking back in on this first thing in the morning after I literally sleep on it.
The cancel request to the scheduler is supposed to cancel the sched.alloc
RPC not the allocation. The scheduler does not respond to the cancel request directly. Instead instead it must terminate the alloc RPC, either with a success response (resources allocated) or an error (such as "canceled") If the alloc RPC is already retired, it should do nothing.
In this case, the alloc
event in the eventlog indicates that the alloc RPC has already been retired, so a cancel request should not have been generated. If it was (because of a bug maybe) then the scheduler should be ignoring it. So I think the theory boils down to an unlikely scenario where two things have to be true: flux-core has to be sending cancel requests for alloc requests that have already been retired, and fluxion has to be handling it as though the alloc request were not already retired. It's worth checking though!
It was logged as a user-initiated cancel, but there might be some other way to generate that I don't know about.
I was wondering if somehow a user-initiated cancel is bypassing qmanager and directly invoking cancel_request_cb
in resource_module
. It sounds like that's not likely.
The cancel request to the scheduler is supposed to cancel the sched.alloc RPC not the allocation.
The cancel request to the scheduler might be cancelling the allocation. Examining the complex control flow starting with the .cancel
in qmanager
:
https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/modules/qmanager.cpp#L511
The control flow enters qmanager_safe_cb_t
and then qmanager_cb_t
:
qmanager_cb_t::jobmanager_cancel_cb
--> remove
--> queue_policy_base_impl_t::remove
Assuming the job was in job_state_kind_t::RUNNING
, it's inserted into the m_complete
map.
From there it will eventually be popped in the queue_policy_bf_base_t
: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/policies/queue_policy_bf_base_impl.hpp#L34
immediately before it's canceled: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/qmanager/policies/queue_policy_bf_base_impl.hpp#L35
EASY queue policy inherits from queue_policy_bf_base_t
which inherits from queue_policy_base_t
, so the lines above would have been called if I've traced correctly.
I hope that combination of markdown and URLs isn't too confusing and that I haven't misunderstood the .cancel
origin in qmanager
.
That's the correct origin in qmanager. However, in jobmanager_cancel_cb()
, it looks remove
is only called if the job is in job_state_kind::PENDING
state (checked by job->is_pending
).
Ok, I don't have a full answer, but here's at least some further tests and analysis. I started with this flow:
flux submit --wait-event epilog-start -N 1 -c 4 -n 1 /non-existant/path > id
flux job raise $(cat id)
flux submit --wait -N 1 --exclusive -n 1 hostname > id2
J1_EXCEPTION=$(flux job info $(cat id) eventlog | grep exception | tail -n 1 | jq -c '.timestamp')
echo JOB 1
flux job info $(cat id) eventlog | jq -c ".timestamp -= $J1_EXCEPTION"
echo JOB 2
flux job info $(cat id2) eventlog | jq -c ".timestamp -= $J1_EXCEPTION"
It produces the same flow of events that we see in the eventlog, except that it does not overlap. Clearly it's not that simple. If I add an explicit flux ion-resource cancel $(cat id)
around the raise, it does exactly what we see in the double booking trace.
I'm trying to come up with how to actually replicate the cancel, I'm getting the impression that this was a flux run
rather than a flux submit
, and a ctrl-c raised exception rather than an explicit flux job raise
, but that isn't normally possible because flux run
returns at finish or release rather than after epilog-finish. Maybe something strange happens if a double ctrl-c hits run
while it is stuck waiting to be released? I'm not sure how to test that though.
I don't think this was a ctrl-C
because those exceptions are logged like this:
[ +0.807710] exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=6885
i.e. the note
explicitly says interrupted by ctrl-C
.
Yup, note's empty, so it must have been a raise. Getting progressively less sure how we got to this place unless we ended up with a race condition. I can go in and make it so the cancel RPC doesn't clear resource allocations easily enough, but if we can't repro I'm not sure that's what it is.
I've been banging on this every way I can think of, and so far can't reproduce the scenario without a resource.cancel RPC coming from somewhere.
The best I have right now, is that the interface between qmanager and resource doesn't differentiate between free
and cancel
. There's only a cancel RPC on the resource side, so if for any reason the qmanager considers the job complete or gets into the remove method during run or alloc_run state it sends the same message to resource, which clears the allocation status of the job. Even considering race conditions, I haven't found a way that can actually happen yet, everything seems to be protected with checks that the job is in pending state. That said, it would be really easy for there to be a code-path in here that deallocates things when it shouldn't because the two are conflated.
It should be relatively straightforward to split this so a cancel RPC has no code path to the cancel sched RPC. Pending a way to explicitly reproduce it, that's the most useful thing I can think of to do. It will make sure that the cancel RPC on qmanager only ever cancels a pending job at least.
Problem: two jobs were allocated the same resources.
The jobs in question are foxZW969KtT and foy5qJFL9Qs, both of which got allocations on cray8292 (rank 8048). Thanks @adambertsch for catching this. This was actually detected when
flux resource list
threw a duplicate allocation error (because the job manager now builds the allocated resource set from all the R fragments assigned to running jobs and that failed).You can use
flux job info <jobid> R | jq
to view the resources assigned to each. On rank 8048, the cores 80-95 were assigned to foxZW969KtT and 0-95 were assigned to foy5qJFL9Qs.Here is the scheduler config (excerpted from
flux config get | jq
)