flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

t2406 and t2900 fail in GitLab CI, possible matching problem? #6146

Open wihobbs opened 1 month ago

wihobbs commented 1 month ago

Sort of a head scratcher, for a few days now, t2406 job-exec: kill-timeout > original value has been failing:

expecting success:
    flux module stats job-exec | jq .jobs.${jobid}.kill_timeout &&
    flux module stats job-exec | jq -e ".jobs.${jobid}.kill_timeout > 0.1"

0.40000000000000002
false
not ok 12 - job-exec: kill-timeout > original value (0.1)

And, a separate issue, t2900 fails inconsistently too:

0.311s: flux-shell[0]: FATAL: task 0 (host tioga16): start failed: sleep: No such file or directory
0.306s: job.exception type=timeout severity=0 resource allocation expired
0.312s: job.exception type=exec severity=0 task 0 (host tioga16): start failed: sleep: No such file or directory
flux-job: task(s) exited with exit code 127
test_expect_code: command exited with 127, we wanted 142 flux run --time-limit=0.25s sleep 30
grondo commented 1 month ago

I see these fail occasionally in github CI as well. My guess is each has some kind of race condition.

grondo commented 1 month ago

The first issue reported here should be fixed by #6187.

The second issue seems strange now that I look at it:

0.312s: job.exception type=exec severity=0 task 0 (host tioga16): start failed: sleep: No such file or directory
flux-job: task(s) exited with exit code 127

The test is not finding sleep in PATH? I'd assume we're just missing /usr/bin in PATH, but the report is that the test only sometimes fails, so there must be something else going on. Also, we use sleep jobs in a lot of places, so I'd assume we'd see failures elsewhere as well if missing sleep was really the cause.

Edit: Oh, I see we are hitting the timeout, then we get the start failed error a few milliseconds later. I wonder if sending the SIGALRM at an inopportune time could cause this error instead of something more sensible?