facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.69k stars 626 forks source link

Handling pre-emption on slurm with fairtask? #108

Closed bamos closed 4 years ago

bamos commented 5 years ago

I'd like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get 6_sweep to restart when pre-empted that I'm having some trouble with:

We can add import time; time.sleep(1e6) to experiment.py and then run ./experiment.py -m. We can see this job on the cluster:

And I have a dask dashboard for it:

image

I then send a USR1 signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:

image

$ scancel --signal=USR1 4817474

But then my job just gets killed and never comes back online:

image

And I can see in the logs that USR1 my job got the USR1 signal but I'm not sure the best way of triggering a restart when this happens:

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.* -n 100
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.err <==
distributed.nanny - INFO -         Start Nanny at: 'tcp://100.97.16.233:34063'
distributed.diskutils - INFO - Found stale lock file and directory '/private/home/bda/.fairtask/dask-worker-space/worker-_z3oli3u', purging
distributed.worker - INFO -       Start worker at:  tcp://100.97.16.233:42249
distributed.worker - INFO -          Listening to:  tcp://100.97.16.233:42249
distributed.worker - INFO - Waiting to connect to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-6z34zo8f
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/private/home/bda/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))
srun: error: learnfair087: task 0: User defined signal 1

==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.out <==
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/0_4817474/UNKNOWN_NAME.log -n 100
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov
omry commented 5 years ago

fairtask should requeue your job. are you sure it's not getting requeued?

cc @calebho

bamos commented 5 years ago

Yes, am sure it's not getting requeued

calebho commented 5 years ago

Will take a look this afternoon

calebho commented 5 years ago

@bamos What happens if you try to cancel the job without specifying the signal, e.g. scancel 4817474? When a worker dies, the task it was processing should be returned back to the scheduler's queue and the scheduler should start a replacement worker. You shouldn't need to implement any signal handling in your code.

omry commented 5 years ago

@bamos, taking a guess here - but is it possible that it does get re-queued to a new output directory?

bamos commented 5 years ago

@calebho - I just tried using scancel without the signal and it's still not coming back online, and no new job with a different id is coming online in place of it. A timeout error is coming up in the error log though:

distributed.nanny - INFO -         Start Nanny at: 'tcp://100.97.16.199:33719'
distributed.worker - INFO -       Start worker at:  tcp://100.97.16.199:33281
distributed.worker - INFO -          Listening to:  tcp://100.97.16.199:33281
distributed.worker - INFO - Waiting to connect to:  tcp://100.97.17.198:46029
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-qefhdgge
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://100.97.17.198:46029
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 4.19s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd-learnfair073: error: *** JOB 4818895 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
slurmstepd-learnfair073: error: *** STEP 4818895.0 ON learnfair073 CANCELLED AT 2019-08-28T16:45:14 ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://100.97.16.199:33719'
distributed.dask_worker - INFO - End worker
distributed.worker - INFO - Stopping worker at tcp://100.97.16.199:33281
Traceback (most recent call last):
  File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/private/home/bda/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 405, in <module>
    go()
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 401, in go
    main()
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/private/home/bda/anaconda3/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 392, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
distributed.process - WARNING - reaping stray process <ForkServerProcess(Dask Worker process (from Nanny), started)>
calebho commented 5 years ago

Hmm, I'm not observing this behavior: a replacement job queued soon after I scancel the original one. Can you paste the output of conda list?

calebho commented 5 years ago

Note this is what my output directory looks like because two SLURM jobs were submitted:

(base) calebh@devfair020:/checkpoint/calebh/outputs/2019-08-28_17-04-15$ ls
0_4818901  0_4818902
bamos commented 5 years ago

Hmm, interesting -- my process from ~30 minutes is still running and no second job has been launched:

6_sweep(master*)$ ls /checkpoint/bda/outputs/2019-08-28_16-44-06/
0_4818895

Here's my conda list output:

``` # packages in environment at /private/home/bda/anaconda3: # # Name Version Build Channel _ipyw_jlab_nb_ext_conf 0.1.0 py37_0 _libgcc_mutex 0.1 main absl-py 0.7.1 pypi_0 pypi alabaster 0.7.12 py37_0 anaconda-client 1.7.2 py37_0 anaconda-navigator 1.9.7 py37_0 anaconda-project 0.8.2 py37_0 asn1crypto 0.24.0 py37_0 aspy-yaml 1.3.0 pypi_0 pypi astor 0.8.0 pypi_0 pypi astroid 2.2.5 py37_0 astropy 3.1.2 py37h7b6447c_0 async-generator 1.10 pypi_0 pypi atomicwrites 1.3.0 py37_1 attrs 19.1.0 py37_1 babel 2.6.0 py37_0 backcall 0.1.0 py37_0 backports 1.0 py37_1 backports.os 0.1.1 py37_0 backports.shutil_get_terminal_size 1.0.0 py37_2 beautifulsoup4 4.7.1 py37_1 bitarray 0.8.3 py37h14c3975_0 bkcharts 0.2 py37_0 blas 1.0 mkl bleach 3.1.0 py37_0 blessings 1.7 pypi_0 pypi block 0.0.5 pypi_0 pypi blosc 1.15.0 hd408876_0 bokeh 1.0.4 py37_0 boto 2.49.0 py37_0 bottleneck 1.2.1 py37h035aef0_1 bzip2 1.0.6 h14c3975_5 ca-certificates 2019.5.15 1 cairo 1.14.12 h8948797_3 certifi 2019.6.16 py37_1 cffi 1.12.2 py37h2e261b9_1 cfgv 2.0.1 pypi_0 pypi chardet 3.0.4 py37_1 click 7.0 py37_0 cloudpickle 1.2.1 pypi_0 pypi clyent 1.2.2 py37_1 colorama 0.4.1 py37_0 conda 4.7.11 py37_0 conda-build 3.17.8 py37_0 conda-env 2.6.0 1 conda-package-handling 1.3.11 py37_0 conda-verify 3.1.1 py37_0 contextlib2 0.5.5 py37_0 coverage 5.0a5 pypi_0 pypi cryptography 2.6.1 py37h1ba5d50_0 cudatoolkit 10.0.130 0 curl 7.64.0 hbc83047_2 cvxpy 1.0.21 pypi_0 pypi cycler 0.10.0 py37_0 cython 0.29.6 py37he6710b0_0 cytoolz 0.9.0.1 py37h14c3975_1 dask 2.0.0 pypi_0 pypi dask-jobqueue 0.6.0 pypi_0 pypi dbus 1.13.6 h746ee38_0 decorator 4.4.0 py37_1 defusedxml 0.5.0 py37_1 deprecated 1.2.5 pypi_0 pypi dictdiffer 0.8.0 pypi_0 pypi diffcp 1.0.2 pypi_0 pypi diffcvxpy 0.1 dev_0 dill 0.2.9 pypi_0 pypi distributed 2.1.0 pypi_0 pypi dm-control 0.0.0 pypi_0 pypi dm-env 1.0 pypi_0 pypi dmc2gym 1.0.0 dev_0 docrep 0.2.7 pypi_0 pypi docutils 0.14 py37_0 dotmap 1.3.8 pypi_0 pypi easyprocess 0.2.7 pypi_0 pypi ecos 2.0.7.post1 pypi_0 pypi entrypoints 0.3 py37_0 enum34 1.1.6 pypi_0 pypi et_xmlfile 1.0.1 py37_0 expat 2.2.6 he6710b0_0 fairtask 0.1 dev_0 fairtask-slurm 0.1.1 dev_0 fastcache 1.0.2 py37h14c3975_2 ffmpeg 4.0 hcdf2ecd_0 filelock 3.0.10 py37_0 flask 1.0.2 py37_1 flatbuffers 1.11 pypi_0 pypi fontconfig 2.13.0 h9420a91_0 freeglut 3.0.0 hf484d3e_5 freetype 2.9.1 h8a8886c_1 fribidi 1.0.5 h7b6447c_0 funcsigs 1.0.2 pypi_0 pypi future 0.17.1 py37_0 futures 3.1.1 pypi_0 pypi gast 0.2.2 pypi_0 pypi geoopt 0.0.1 pypi_0 pypi get_terminal_size 1.0.0 haa9412d_0 gevent 1.4.0 py37h7b6447c_0 glfw 1.8.1 pypi_0 pypi glib 2.56.2 hd408876_0 glob2 0.6 py37_1 gmp 6.1.2 h6c8ec71_1 gmpy2 2.0.8 py37h10f8cd9_2 google-pasta 0.1.7 pypi_0 pypi gpflow 1.4.1 pypi_0 pypi gpustat 0.6.0 pypi_0 pypi graphite2 1.3.13 h23475e2_0 greenlet 0.4.15 py37h7b6447c_0 grpcio 1.21.1 pypi_0 pypi gst-plugins-base 1.14.0 hbbd80ab_1 gstreamer 1.14.0 hb453b48_1 gtimer 1.0.0b5 pypi_0 pypi gym 0.14.0 pypi_0 pypi h5py 2.8.0 py37h989c5e5_3 harfbuzz 1.8.8 hffaf4a1_0 hdf5 1.10.2 hba1933b_1 heapdict 1.0.0 py37_2 higher 0.1.1 dev_0 html5lib 1.0.1 py37_0 hydra 0.1.1 dev_0 hydra-fairtask 0.1.0 dev_0 icu 58.2 h9c2bf20_1 identify 1.4.5 pypi_0 pypi idna 2.8 py37_0 imageio 2.5.0 py37_0 imageio-ffmpeg 0.3.0 pypi_0 pypi imagesize 1.1.0 py37_0 importlib_metadata 0.8 py37_0 intel-openmp 2019.3 199 ipdb 0.12 pypi_0 pypi ipykernel 5.1.0 py37h39e3cac_0 ipython 7.4.0 py37h39e3cac_0 ipython_genutils 0.2.0 py37_0 ipywidgets 7.4.2 py37_0 isort 4.3.16 py37_0 itsdangerous 1.1.0 py37_0 jasper 2.0.14 h07fcdf6_1 jbig 2.1 hdba287a_0 jdcal 1.4 py37_0 jedi 0.13.3 py37_0 jeepney 0.4 py37_0 jinja2 2.10 py37_0 jpeg 9b h024ee3a_2 jsonschema 3.0.1 py37_0 jupyter 1.0.0 py37_7 jupyter_client 5.2.4 py37_0 jupyter_console 6.0.0 py37_0 jupyter_core 4.4.0 py37_0 jupyterlab 0.35.4 py37hf63ae98_0 jupyterlab_server 0.2.0 py37_0 keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.0 pypi_0 pypi keyring 18.0.0 py37_0 kiwisolver 1.0.1 py37hf484d3e_0 krb5 1.16.1 h173b8e3_7 lazy-object-proxy 1.3.1 py37h14c3975_2 libarchive 3.3.3 h5d8350f_5 libcurl 7.64.0 h20c2e04_2 libedit 3.1.20181209 hc058e9b_0 libffi 3.2.1 hd88cf55_4 libgcc-ng 8.2.0 hdf63c60_1 libgfortran 3.0.0 1 https://repo.anaconda.com/pkgs/free libgfortran-ng 7.3.0 hdf63c60_0 libglu 9.0.0 hf484d3e_1 liblief 0.9.0 h7725739_2 libopencv 3.4.2 hb342d67_1 libopus 1.3 h7b6447c_0 libpng 1.6.36 hbc83047_0 libsodium 1.0.16 h1bed415_0 libssh2 1.8.0 h1ba5d50_4 libstdcxx-ng 8.2.0 hdf63c60_1 libtiff 4.0.10 h2733197_2 libtool 2.4.6 h7b6447c_5 libuuid 1.0.3 h1bed415_2 libvpx 1.7.0 h439df22_0 libxcb 1.13 h1bed415_1 libxml2 2.9.9 he19cac6_0 libxslt 1.1.33 h7d1a2b0_0 line-profiler 2.1.1 dev_0 llvmlite 0.28.0 py37hd408876_0 lml 0.0.1 dev_0 locket 0.2.0 py37_1 lockfile 0.12.2 pypi_0 pypi lxml 4.3.2 py37hefd8a0e_0 lz4-c 1.8.1.2 h14c3975_0 lzo 2.10 h49e0be7_2 markdown 3.1.1 pypi_0 pypi markupsafe 1.1.1 py37h7b6447c_0 matplotlib 3.0.3 py37h5429711_0 mbbl 0.1 dev_0 mbrl 0.1.dev0 dev_0 mccabe 0.6.1 py37_1 mistune 0.8.4 py37h7b6447c_0 mkl 2019.3 199 mkl_fft 1.0.10 py37ha843d7b_0 mkl_random 1.0.2 py37hd81dba3_0 more-itertools 6.0.0 py37_0 moviepy 1.0.0 pypi_0 pypi mpc 0.0.3 dev_0 mpfr 4.0.1 hdf1c602_3 mpmath 1.1.0 py37_0 msgpack-python 0.6.1 py37hfd86e86_1 mujoco-py 0.5.7 pypi_0 pypi multipledispatch 0.6.0 py37_0 multiprocess 0.70.7 pypi_0 pypi multiworld 0.0.0 pypi_0 pypi mypy 0.711 pypi_0 pypi mypy-extensions 0.4.1 pypi_0 pypi natsort 6.0.0 pypi_0 pypi navigator-updater 0.2.1 py37_0 nbconvert 5.4.1 py37_3 nbformat 4.4.0 py37_0 ncurses 6.1 he6710b0_1 networkx 2.2 py37_1 ninja 1.9.0 py37hfd86e86_0 nltk 3.4 py37_1 nodeenv 1.3.3 pypi_0 pypi nose 1.3.7 py37_2 notebook 5.7.8 py37_0 numba 0.43.1 py37h962f231_0 numdifftools 0.9.39 pypi_0 pypi numexpr 2.6.9 py37h9e4a6bb_0 numpy 1.16.2 py37h7e9f1db_0 numpy-base 1.16.2 py37hde5b4d6_0 numpydoc 0.8.0 py37_0 nvidia-ml-py3 7.352.0 pypi_0 pypi olefile 0.46 py37_0 omegaconf 1.3.0 pypi_0 pypi openblas 0.2.19 0 kidzik opencv 3.4.2 py37h6fd60c2_1 openpyxl 2.6.1 py37_1 opensim 4.0.0 15 kidzik openssl 1.1.1c h7b6447c_1 opt-einsum 3.0.0 pypi_0 pypi osim-rl 3.0.3 pypi_0 pypi osqp 0.5.0 pypi_0 pypi osqpth 0.0.1 dev_0 packaging 19.0 py37_0 pandas 0.24.2 py37he6710b0_0 pandoc 2.2.3.2 0 pandocfilters 1.4.2 py37_1 pango 1.42.4 h049681c_0 parso 0.3.4 py37_0 partd 0.3.10 py37_1 patchelf 0.9 he6710b0_3 path.py 11.5.0 py37_0 pathlib2 2.3.3 py37_0 pathos 0.2.3 pypi_0 pypi patsy 0.5.1 py37_0 pcre 8.43 he6710b0_0 pep8 1.7.1 py37_0 pexpect 4.6.0 py37_0 pickleshare 0.7.5 py37_0 pillow 5.4.1 py37h34e0f95_0 pip 19.0.3 py37_0 pixman 0.38.0 h7b6447c_0 pkginfo 1.5.0.1 py37_0 pluggy 0.9.0 py37_0 ply 3.11 py37_0 pox 0.2.5 pypi_0 pypi ppft 1.6.4.9 pypi_0 pypi pre-commit 1.17.0 pypi_0 pypi proglog 0.1.9 pypi_0 pypi prometheus_client 0.6.0 py37_0 prompt_toolkit 2.0.9 py37_0 protobuf 3.8.0 pypi_0 pypi psutil 5.6.1 py37h7b6447c_0 ptyprocess 0.6.0 py37_0 py 1.8.0 py37_0 py-lief 0.9.0 py37h7725739_2 py-opencv 3.4.2 py37hb342d67_1 pycodestyle 2.5.0 py37_0 pycosat 0.6.3 py37h14c3975_0 pycparser 2.19 py37_0 pycrypto 2.6.1 py37h14c3975_9 pycurl 7.43.0.2 py37h1ba5d50_0 pydantic 0.29 pypi_0 pypi pyflakes 2.1.1 py37_0 pyglet 1.3.2 pypi_0 pypi pygments 2.3.1 py37_0 pylint 2.3.1 py37_0 pyodbc 4.0.26 py37he6710b0_0 pyopengl 3.1.0 pypi_0 pypi pyopenssl 19.0.0 py37_0 pyparsing 2.3.1 py37_0 pyqt 5.9.2 py37h05f1152_2 pyro-ppl 0.3.3+de764530 dev_0 pyrsistent 0.14.11 py37h7b6447c_0 pysocks 1.6.8 py37_0 pytables 3.4.4 py37ha205bf6_0 pytest 4.3.1 py37_0 pytest-arraydiff 0.3 py37h39e3cac_0 pytest-astropy 0.5.0 py37_0 pytest-cov 2.7.1 pypi_0 pypi pytest-doctestplus 0.3.0 py37_0 pytest-openfiles 0.3.2 py37_0 pytest-remotedata 0.3.1 py37_0 python 3.7.3 h0371630_0 python-dateutil 2.8.0 py37_0 python-graphviz 0.11.1 pypi_0 pypi python-libarchive-c 2.8 py37_6 pytorch 1.2.0 py3.7_cuda10.0.130_cudnn7.6.2_0 pytorch pytz 2018.9 py37_0 pyvirtualdisplay 0.2.3 pypi_0 pypi pywavelets 1.0.2 py37hdd07704_0 pyyaml 5.1 py37h7b6447c_0 pyzmq 18.0.0 py37he6710b0_0 qpth 0.0.13 dev_0 qt 5.9.7 h5867ecd_1 qtawesome 0.5.7 py37_1 qtconsole 4.4.3 py37_0 qtpy 1.7.0 py37_1 ray 0.7.1 pypi_0 pypi readline 7.0 h7b6447c_5 redis 3.2.1 pypi_0 pypi requests 2.21.0 py37_0 rlkit 0.2.1.dev0 dev_0 rope 0.12.0 py37_0 ruamel-yaml 0.15.97 pypi_0 pypi ruamel_yaml 0.15.46 py37h14c3975_0 satnet 0.1.2 dev_0 scikit-image 0.14.2 py37he6710b0_0 scikit-learn 0.20.3 py37hd81dba3_0 scipy 1.2.1 py37h7c811a0_0 scipyplot 0.0.6 pypi_0 pypi scs 2.1.0 pypi_0 pypi seaborn 0.9.0 py37_0 secretstorage 3.1.1 py37_0 semantic-version 2.6.0 pypi_0 pypi send2trash 1.5.0 py37_0 setgpu 0.0.7 pypi_0 pypi setproctitle 1.1.10 pypi_0 pypi setuptools 41.0.1 pypi_0 pypi simplegeneric 0.8.1 py37_2 singledispatch 3.4.0.3 py37_0 sip 4.19.8 py37hf484d3e_0 six 1.12.0 py37_0 snappy 1.1.7 hbae5bb6_3 snowballstemmer 1.2.1 py37_0 sortedcollections 1.1.2 py37_0 sortedcontainers 2.1.0 py37_0 soupsieve 1.8 py37_0 sphinx 1.8.5 py37_0 sphinxcontrib 1.0 py37_1 sphinxcontrib-websupport 1.1.0 py37_1 spyder 3.3.3 py37_0 spyder-kernels 0.4.2 py37_0 sqlalchemy 1.3.1 py37h7b6447c_0 sqlite 3.27.2 h7b6447c_0 statsmodels 0.9.0 py37h035aef0_0 submitit 0.1.0 dev_0 sympy 1.3 py37_0 tb-nightly 1.14.0a20190603 pypi_0 pypi tblib 1.3.2 py37_0 tensorboard 1.14.0 pypi_0 pypi tensorflow-estimator 1.14.0 pypi_0 pypi tensorflow-gpu 1.14.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi terminado 0.8.1 py37_1 testpath 0.4.2 py37_0 tf-estimator-nightly 1.14.0.dev2019060501 pypi_0 pypi timeout-decorator 0.4.1 pypi_0 pypi tk 8.6.8 hbc83047_0 toml 0.10.0 pypi_0 pypi toolz 0.9.0 py37_0 torch 1.1.0 pypi_0 pypi torch-cg 1.0.1 pypi_0 pypi torchfile 0.1.0 pypi_0 pypi torchvision 0.4.0 py37_cu100 pytorch tornado 6.0.2 py37h7b6447c_0 tqdm 4.31.1 py37_1 traitlets 4.3.2 py37_0 typed-ast 1.4.0 pypi_0 pypi typing 3.7.4 pypi_0 pypi typing-extensions 3.7.4 pypi_0 pypi unicodecsv 0.14.1 py37_0 unixodbc 2.3.7 h14c3975_0 urllib3 1.24.1 py37_0 virtualenv 16.6.2 pypi_0 pypi visdom 0.1.8.8 pypi_0 pypi wcwidth 0.1.7 py37_0 webencodings 0.5.1 py37_1 websocket-client 0.56.0 pypi_0 pypi werkzeug 0.14.1 py37_0 wheel 0.33.1 py37_0 widgetsnbextension 3.4.2 py37_0 wrapt 1.11.1 py37h7b6447c_0 wurlitzer 1.0.2 py37_0 xlrd 1.2.0 py37_0 xlsxwriter 1.1.5 py37_0 xlwt 1.3.0 py37_0 xz 5.2.4 h14c3975_4 yaml 0.1.7 had09818_2 zeromq 4.3.1 he6710b0_3 zict 0.1.4 py37_0 zipp 0.3.3 py37_1 zlib 1.2.11 h7b6447c_3 zstd 1.3.7 h0b5b093_0 ```
omry commented 5 years ago

@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.

calebho commented 5 years ago

It may be because your versions of dask* and distributed are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I'll open an issue in fairtask to bump the versions to v2

bamos commented 5 years ago

@calebho - I just downgraded to the dask* and distributed versions that are in the setup.py files in fairtask and fairtask-slurm and can confirm that the newer versions are causing the issue I filed here -- the example I posted above is working with the older versions.

@omry - this is creating a new job output directory since hydra.job.id is the slurm PID, and pre-emptions cause a new slurm PID. Filing a new issue to further discuss this

omry commented 5 years ago

@bamos, yes - I realized it by now. I think I will recommend not to have the job id in the directory in the future. Once I get some support from @calebho, I will be able to have a synlink from the hydra job directory to the stdout and stderr files created by fairtask.

omry commented 5 years ago

@calebho, putting this one on your plate as you are the one actually dealing with it.

omry commented 4 years ago

I filed a more focused task here: https://github.com/fairinternal/hydra-fair-plugins/issues/8