Handling pre-emption on slurm with fairtask? #108

Closed bamos closed 4 years ago

bamos commented 5 years ago

I'd like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get 6_sweep to restart when pre-empted that I'm having some trouble with:

We can add import time; time.sleep(1e6) to experiment.py and then run ./experiment.py -m. We can see this job on the cluster:

And I have a dask dashboard for it:


I then send a USR1 signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:


$ scancel --signal=USR1 4817474

But then my job just gets killed and never comes back online:


And I can see in the logs that USR1 my job got the USR1 signal but I'm not sure the best way of triggering a restart when this happens:

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.* -n 100
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.err <==
distributed.nanny - INFO -         Start Nanny at: 'tcp://'
distributed.diskutils - INFO - Found stale lock file and directory '/private/home/bda/.fairtask/dask-worker-space/worker-_z3oli3u', purging
distributed.worker - INFO -       Start worker at:  tcp://
distributed.worker - INFO -          Listening to:  tcp://
distributed.worker - INFO - Waiting to connect to:  tcp://
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-6z34zo8f
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/private/home/bda/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
srun: error: learnfair087: task 0: User defined signal 1

==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.out <==
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/0_4817474/UNKNOWN_NAME.log -n 100
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov
omry commented 5 years ago

fairtask should requeue your job. are you sure it's not getting requeued?

cc @calebho

bamos commented 5 years ago

Yes, am sure it's not getting requeued

calebho commented 5 years ago

Will take a look this afternoon

calebho commented 5 years ago

@bamos What happens if you try to cancel the job without specifying the signal, e.g. scancel 4817474? When a worker dies, the task it was processing should be returned back to the scheduler's queue and the scheduler should start a replacement worker. You shouldn't need to implement any signal handling in your code.

omry commented 5 years ago

@bamos, taking a guess here - but is it possible that it does get re-queued to a new output directory?

bamos commented 5 years ago

@calebho - I just tried using scancel without the signal and it's still not coming back online, and no new job with a different id is coming online in place of it. A timeout error is coming up in the error log though:

distributed.nanny - INFO -         Start Nanny at: 'tcp://'
distributed.worker - INFO -       Start worker at:  tcp://
distributed.worker - INFO -          Listening to:  tcp://
distributed.worker - INFO - Waiting to connect to:  tcp://
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-qefhdgge
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 4.19s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-learnfair073: error: *** JOB 4818895 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
slurmstepd-learnfair073: error: *** STEP 4818895.0 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://'
distributed.dask_worker - INFO - End worker
distributed.worker - INFO - Stopping worker at tcp://
Traceback (most recent call last):
  File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 405, in <module>
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 401, in go
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 392, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
distributed.process - WARNING - reaping stray process <ForkServerProcess(Dask Worker process (from Nanny), started)>
calebho commented 5 years ago

Hmm, I'm not observing this behavior: a replacement job queued soon after I scancel the original one. Can you paste the output of conda list?

calebho commented 5 years ago

Note this is what my output directory looks like because two SLURM jobs were submitted:

(base) calebh@devfair020:/checkpoint/calebh/outputs/2019-08-28_17-04-15$ ls
0_4818901  0_4818902
bamos commented 5 years ago

Hmm, interesting -- my process from ~30 minutes is still running and no second job has been launched:

6_sweep(master*)$ ls /checkpoint/bda/outputs/2019-08-28_16-44-06/

omry commented 5 years ago

@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.

calebho commented 5 years ago

It may be because your versions of dask* and distributed are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I'll open an issue in fairtask to bump the versions to v2

bamos commented 5 years ago

@calebho - I just downgraded to the dask* and distributed versions that are in the setup.py files in fairtask and fairtask-slurm and can confirm that the newer versions are causing the issue I filed here -- the example I posted above is working with the older versions.

@omry - this is creating a new job output directory since hydra.job.id is the slurm PID, and pre-emptions cause a new slurm PID. Filing a new issue to further discuss this

omry commented 5 years ago

@bamos, yes - I realized it by now. I think I will recommend not to have the job id in the directory in the future. Once I get some support from @calebho, I will be able to have a synlink from the hydra job directory to the stdout and stderr files created by fairtask.

omry commented 5 years ago

@calebho, putting this one on your plate as you are the one actually dealing with it.

omry commented 4 years ago

I filed a more focused task here: https://github.com/fairinternal/hydra-fair-plugins/issues/8