Open grondo opened 4 months ago
Users are seeing this error again in a slightly different scenario. Example:
flux run
or flux start
is issued. this occurs between E1 and E2match_multi_request_cb: match failed due to match error
I will see if I can reproduce this scenario.
I'm also not sure why the duration-validator isn't rejecting the job from step 6 if this is indeed the sequence of events.
Users have been seeing this error in their batch job logs frequently on elcap:
Here's some data from one such job:
timestamp of
alloc
andstart
events:starttime
andexpiration
in R:The precise error was:
where
Some things to note:
The timestamps in R are off, the time of the
alloc
event was 3584s after the starttime in R, and thus the job was already expired by the time it ran. It sounds like this could cause the match request failure above, so the question is how this could happen.If the scheduler is stuck in one of its resource match death spirals, could that delay the alloc response to the job manager for this long? Could that explain the behavior here?