Closed chu11 closed 8 months ago
We've been seeing this in CI too. I was working on debugging it yesterday but got stuck with not being able to reproduce. It's been happening on quartz.
The test doesn't issue flux shutdown --gc
so how is the content.dump
attr getting set here?
This did just happen twice in one CI run though...
The test doesn't issue flux shutdown --gc so how is the content.dump attr getting set here?
Ahhhh ... b/c I read it wrong -n
is for non-zero string length :-) so i went down the wrong path there.
This did just happen twice in one CI run though...
Interesting. I can't reproduce w/o make -j16
, so perhaps is not a parallelism issue like I thought.
Yeah, I can definitely reproduce this by running two of the tests in parallel.. Interesting that this only started happening recently.
Can you see if adding -o exit-timeout=none
to the flux-alloc job in the test helps? Maybe shutdown is racing with the exit timeout.
diff --git a/t/issues/t4583-free-range-test.sh b/t/issues/t4583-free-range-test.sh
index de787989d..fbfa8e847 100755
--- a/t/issues/t4583-free-range-test.sh
+++ b/t/issues/t4583-free-range-test.sh
@@ -36,7 +36,7 @@ fi
# Start a job with tbon.topo=kary:0
log "Starting a child instance with flat topology\n"
-jobid=$(flux alloc -N4 --bg --broker-opts=-Stbon.topo=kary:0)
+jobid=$(flux alloc -N4 -o exit-timeout=none --bg --broker-opts=-Stbon.topo=kary:0)
log "Started job $jobid\n"
Good guess. I think I was nearing that as I realized the error ultimately came from
if (!future_is_ready (f)) {
if (flux_reactor_run (f->now->r, 0) < 0) {
if (errno == EEXIST)
fprintf(stderr, "eexist %s:%d\n", __FUNCTION__, __LINE__);
return -1; // errno set by now_timer_cb or other watcher
}
}
in flux_future_wait_for()
. EEXIST is a strange errno though ... points you in a completely different direction. Not sure if it's a "fallthrough" errno from something else within libev
.
I'm not seeing anything that looks like eexist function:line
in my output, are you?
Edit: Oh, now I'm assuming that was debugging added by you?
Edit: Oh, now I'm assuming that was debugging added by you?
Yeah, that was my debugging.
Did you try setting errno = 0
before the call into flux_reactor_run()
? My guess is that errno is just left over from a previous call.
My guess as to what is happening is that the broker to which flux shutdown
is connected gets terminated during the shutdown, thus removing the connector socket and this isn't handled so gracefully, though I'm not sure what should be returned in this situation... Maybe a connection reset error?
I think this test started taking longer recently, which is why we're seeing the exit-timeout being reached, thus triggering the issue.
The test should indeed be fixed to remove the exit-timeout, meanwhile I'll look into why the test is taking so long.
The handling of the remote broker going away underneath a connected Flux handle should probably get its own issue. It should also be easy to create a simpler reproducer for it (if the assumption of root cause is correct)
Did you try setting errno = 0 before the call into flux_reactor_run()? My guess is that errno is just left over from a previous call.
Gave that a shot and it didn't work :-(
ahhh i hadn't guessed the test was faster before. Which makes sense now given what I saw in output. Notice the wait event below is 24 seconds running only t4000 by hand and only this one issue. It sat on "waiting for resources" for quite a long time.
# ./t4000-issues-test-driver.t: flux session size will be 2
free-range-test: Re-launching test script under flux-start
free-range-test: Starting a child instance with flat topology
free-range-test: Started job ƒ2iQcqx3
free-range-test: Current overlay status of ƒ2iQcqx3:
0 corona212: full
├─ 1 corona212: full
├─ 2 corona212: full
└─ 3 corona212: full
free-range-test: Launch a sleep job within ƒ2iQcqx3:
ƒ2fyn2BH
JOBID USER ST NTASKS NNODES RUNTIME
. achu R 4 4 10.97s .
ƒ2iQcqx3 achu R 4 4 5.796s └── flux
ƒ2fyn2BH achu R 4 4 0.386s └── sleep
free-range-test: Killing rank 3 (pid 3522537) and all children
free-range-test: Wait for exception event in ƒ2iQcqx3
flux-start: 3 (pid 3522537) Killed
Mar 21 00:13:03.939077 broker.err[0]: corona212 (rank 3) transitioning to LOST due to EHOSTUNREACH error on send
1710979983.939847 exception type="node-failure" severity=2 userid=8556 note="lost contact with job shell on corona212 (shell rank 3)"
free-range-test: But running a 3 node job in ƒ2iQcqx3 still works:
flux-job: ƒ49Kzg8f started 00:00:24
corona212
corona212
corona212
free-range-test: Overlay status of ƒ2iQcqx3 should show rank lost:
0 corona212: degraded
├─ 1 corona212: full
├─ 2 corona212: full
└─ 3 corona212: lost
free-range-test: Call flux shutdown on ƒ2iQcqx3
broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.3s
broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.3s
broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.3s
broker.info[0]: cleanup-success: cleanup->shutdown 0.856045s
broker.info[0]: children-complete: shutdown->finalize 1.33805s
broker.info[0]: online: corona[212,212] (ranks 0-1)
broker.info[0]: rc3.0: /g/g0/achu/chaos/git/flux-framework/flux-core/etc/rc3 Exited (rc=0) 1.2s
broker.info[0]: rc3-success: finalize->goodbye 1.22378s
free-range-test: job ƒ2iQcqx3 should exit cleanly (no hang) and a zero exit code:
1710980012.670174 finish status=0
Yes, I found the root cause of the slowness and will post a PR soon.
Getting this failure half regularly running
make -j16 check
on corona.Debugging shows the
flux shutdown
inissues/t4583-free-range-test.sh
is failing. Adding some debugging (the "flux shutdown exit" is my debug addition).my theory is that in rc3 we have
my guess is there is a conflict "File exists" on the RESTORE symlink now that we're running the t4000 regression tests in parallel.