flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

increase default prolog kill-timeout from 10s to 1m #6431

Closed garlick closed 1 week ago

garlick commented 1 week ago

Problem: on a slow system, the prolog kill timeout may be exceeded due to reasons other than an intransigent prolog process.

Increase the timeout from 10s to 1m.

Fixes #6420

garlick commented 1 week ago

Hmm, got a failure in t2274-manager-perilog.t in the el8 - test install builder:

2024-11-13T14:53:17.5390080Z expecting success: 
2024-11-13T14:53:17.5390383Z    printf "#!/bin/sh\nsleep 60" > prolog.d/sleep.sh &&
2024-11-13T14:53:17.5390759Z    chmod +x prolog.d/sleep.sh &&
2024-11-13T14:53:17.5391130Z    test_when_finished "rm -f prolog.d/sleep.sh" &&
2024-11-13T14:53:17.5391569Z    jobid=$(flux submit --job-name=cancel hostname) &&
2024-11-13T14:53:17.5392027Z    flux job wait-event -t 15 $jobid prolog-start &&
2024-11-13T14:53:17.5392379Z    flux cancel $jobid &&
2024-11-13T14:53:17.5392717Z    flux job wait-event -t 15 $jobid prolog-finish &&
2024-11-13T14:53:17.5393154Z    flux job wait-event -t 15 $jobid exception &&
2024-11-13T14:53:17.5393557Z    test_must_fail flux job attach -vE $jobid
2024-11-13T14:53:17.5393790Z 
2024-11-13T14:53:17.5394040Z 1731506489.783088 prolog-start description="job-manager.prolog"
2024-11-13T14:53:17.5394571Z flux-job: wait-event timeout on event 'prolog-finish'
2024-11-13T14:53:17.5395333Z not ok 12 - perilog: job can be canceled while prolog is running

I wouldn't think the kill-timeout would be used here. I'll restart and see if it pops up again.

garlick commented 1 week ago

OK, setting MWP. Thanks!