SMART-Lab / smartdispatch

An easy to use job launcher for supercomputers with PBS compatible job manager.
Do What The F*ck You Want To Public License
34 stars 18 forks source link

Autoresume feature #145

Closed ddtm closed 7 years ago

ddtm commented 8 years ago

Implements #138

A user can now add --autoresume to automatically requeue her jobs if the running time exceeds maximum walltime allowed on the cluster.

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-0.2%) to 94.185% when pulling 82159bc6354c4b2fd46c2f8dbda82aa449e5d6d0 on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.47% when pulling c47cadee3b7f74d96897396ccefff3c2231ffd0d on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.486% when pulling 86512e9c0bf1ee8f359501772ba4cff8a05b85f4 on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

MarcCote commented 8 years ago

Is there any reason we didn't try setting an alarm when launching the PBS files with msub? https://computing.llnl.gov/tutorials/moab/#TimeExpired

I'm guessing the answer is that the option is not supported by qsub?

Anyhow, if the worker are indeed receiving two SIGTERM signals, maybe we should send a SIGALRM instead?

mgermain commented 8 years ago

@MarcCote

-notify Available for qsub, qrsh (with command) and qalter only.

      This flag, when set causes  Sun  Grid  Engine  to  send
      "warning" signals to a running job prior to sending the
      signals themselves. If a SIGSTOP is  pending,  the  job
      will  receive a SIGUSR1 several seconds before the SIG-
      STOP. If a SIGKILL is pending, the job will  receive  a
      SIGUSR2  several  seconds  before  the  SIGKILL.   .....
mgermain commented 8 years ago

@ddtm @MarcCote

Did some quick testing. smart-dispatch -q gpu_1 --walltime=0:03:00 --pbsFlags="-lsignal=14@120" launch python -u test_signal.py

Fri Nov 18 18:01:33 2016 - Started
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 14 SIGALRM
Fri Nov 18 18:02:28 2016 - 15 SIGTERM

There is something wrong happening, I'll investigate more. First the sigalarm whould ne be there 6 times. (It's always 6) Second the sitterm is arriving 1 minute early when I use sigterm.

This was on helios.

MarcCote commented 8 years ago

@mgermain Is it possible that on helios the period of grace is actually two minutes instead of 60 seconds (as you or @ddtm mentioned before)?

Can you try: smart-dispatch -q gpu_1 --walltime=0:04:00 --pbsFlags="-lsignal=14@90" launch python -u test_signal.py

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.486% when pulling 54c182bc296ef94ac1c90f2b1f9d3425f091b2ca on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

MarcCote commented 8 years ago

I have some experiments to do, so I'll try this awesome feature right away. :)

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.486% when pulling f6551ef41423de21a45259dd2c1c3e623c2594ed on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-0.8%) to 93.507% when pulling f16f1f507e9156cbd85bbcf9288fda525c22f771 on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.491% when pulling 92f5ccff193c231f2ed62559e236852c42a9e31b on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.1%) to 94.491% when pulling 7ce8a282d62d97511f459e03c52d1425357976fd on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

MarcCote commented 8 years ago

I'm using it since last night and my experiments have resumed successfully many times. @mgermain all good on my side.

MarcCote commented 7 years ago

@mgermain anything you want to add? Otherwise you can go ahead and merge it. Thanks again @ddtm

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.1%) to 94.491% when pulling 4c90741d263dd20b504ae09b5a3a0be4b9c9c489 on ddtm:autoresume into f0793309f25b03e1a2dced75b51a19d5e31dc820 on SMART-Lab:master.

mgermain commented 7 years ago

Everything is fine now. But for cleanliness we should add the -l depends=<current_jobid> to the relaunched jobs.

The reason is that for small jobs sometimes the new jobs starts before the scheduler is done cleaning the old job and you have 2 jobs theoretically doing the same thing in the queue.

MarcCote commented 7 years ago

Also, a rebase seems to be needed.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.1%) to 94.532% when pulling 2e548652e1559e0f557ddbb465c02ce70ae998f9 on ddtm:autoresume into da978d6de8f0b9c596d2496ccc8239aebc41a4e4 on SMART-Lab:master.