Closed bamos closed 4 years ago
fairtask should requeue your job. are you sure it's not getting requeued?
cc @calebho
Yes, am sure it's not getting requeued
Will take a look this afternoon
@bamos What happens if you try to cancel the job without specifying the signal, e.g. scancel 4817474
? When a worker dies, the task it was processing should be returned back to the scheduler's queue and the scheduler should start a replacement worker. You shouldn't need to implement any signal handling in your code.
@bamos, taking a guess here - but is it possible that it does get re-queued to a new output directory?
@calebho - I just tried using scancel
without the signal and it's still not coming back online, and no new job with a different id is coming online in place of it. A timeout error is coming up in the error log though:
distributed.nanny - INFO - Start Nanny at: 'tcp://100.97.16.199:33719'
distributed.worker - INFO - Start worker at: tcp://100.97.16.199:33281
distributed.worker - INFO - Listening to: tcp://100.97.16.199:33281
distributed.worker - INFO - Waiting to connect to: tcp://100.97.17.198:46029
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 10
distributed.worker - INFO - Memory: 64.00 GB
distributed.worker - INFO - Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-qefhdgge
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://100.97.17.198:46029
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 4.19s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-learnfair073: error: *** JOB 4818895 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
slurmstepd-learnfair073: error: *** STEP 4818895.0 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://100.97.16.199:33719'
distributed.dask_worker - INFO - End worker
distributed.worker - INFO - Stopping worker at tcp://100.97.16.199:33281
Traceback (most recent call last):
File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 405, in <module>
go()
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 401, in go
main()
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 392, in main
raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
distributed.process - WARNING - reaping stray process <ForkServerProcess(Dask Worker process (from Nanny), started)>
Hmm, I'm not observing this behavior: a replacement job queued soon after I scancel
the original one. Can you paste the output of conda list
?
Note this is what my output directory looks like because two SLURM jobs were submitted:
(base) calebh@devfair020:/checkpoint/calebh/outputs/2019-08-28_17-04-15$ ls
0_4818901 0_4818902
Hmm, interesting -- my process from ~30 minutes is still running and no second job has been launched:
6_sweep(master*)$ ls /checkpoint/bda/outputs/2019-08-28_16-44-06/
0_4818895
Here's my conda list
output:
@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.
It may be because your versions of dask*
and distributed
are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I'll open an issue in fairtask to bump the versions to v2
@calebho - I just downgraded to the dask*
and distributed
versions that are in the setup.py
files in fairtask
and fairtask-slurm
and can confirm that the newer versions are causing the issue I filed here -- the example I posted above is working with the older versions.
@omry - this is creating a new job output directory since hydra.job.id
is the slurm PID, and pre-emptions cause a new slurm PID. Filing a new issue to further discuss this
@bamos, yes - I realized it by now. I think I will recommend not to have the job id in the directory in the future. Once I get some support from @calebho, I will be able to have a synlink from the hydra job directory to the stdout and stderr files created by fairtask.
@calebho, putting this one on your plate as you are the one actually dealing with it.
I filed a more focused task here: https://github.com/fairinternal/hydra-fair-plugins/issues/8
I'd like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get
6_sweep
to restart when pre-empted that I'm having some trouble with:We can add
import time; time.sleep(1e6)
toexperiment.py
and then run./experiment.py -m
. We can see this job on the cluster:And I have a dask dashboard for it:
I then send a
USR1
signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:But then my job just gets killed and never comes back online:
And I can see in the logs that USR1 my job got the USR1 signal but I'm not sure the best way of triggering a restart when this happens: