Closed ddtm closed 7 years ago
Is there any reason we didn't try setting an alarm when launching the PBS files with msub? https://computing.llnl.gov/tutorials/moab/#TimeExpired
I'm guessing the answer is that the option is not supported by qsub?
Anyhow, if the worker are indeed receiving two SIGTERM signals, maybe we should send a SIGALRM instead?
@MarcCote
-notify Available for qsub, qrsh (with command) and qalter only.
This flag, when set causes Sun Grid Engine to send
"warning" signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIG-
STOP. If a SIGKILL is pending, the job will receive a
SIGUSR2 several seconds before the SIGKILL. .....
@ddtm @MarcCote
Did some quick testing.
smart-dispatch -q gpu_1 --walltime=0:03:00 --pbsFlags="-lsignal=14@120" launch python -u test_signal.py
Fri Nov 18 18:01:33 2016 - Started
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 15 SIGTERM
There is something wrong happening, I'll investigate more. First the sigalarm whould ne be there 6 times. (It's always 6) Second the sitterm is arriving 1 minute early when I use sigterm.
This was on helios.
@mgermain Is it possible that on helios the period of grace is actually two minutes instead of 60 seconds (as you or @ddtm mentioned before)?
Can you try: smart-dispatch -q gpu_1 --walltime=0:04:00 --pbsFlags="-lsignal=14@90" launch python -u test_signal.py
I have some experiments to do, so I'll try this awesome feature right away. :)
I'm using it since last night and my experiments have resumed successfully many times. @mgermain all good on my side.
@mgermain anything you want to add? Otherwise you can go ahead and merge it. Thanks again @ddtm
Everything is fine now. But for cleanliness we should add the -l depends=<current_jobid>
to the relaunched jobs.
The reason is that for small jobs sometimes the new jobs starts before the scheduler is done cleaning the old job and you have 2 jobs theoretically doing the same thing in the queue.
Also, a rebase seems to be needed.
Implements #138
A user can now add
--autoresume
to automatically requeue her jobs if the running time exceeds maximum walltime allowed on the cluster.