dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

dask-worker not handling KeyboardInterrupt correctly #2788

Closed TomAugspurger closed 4 years ago

TomAugspurger commented 5 years ago

Connect a dask-worker to the scheduler and then ctrl-c.

That should exit cleanly.

2019-06-19 14:53:49,386 distributed.worker[54182] INFO -------------------------------------------------
2019-06-19 14:53:49,393 distributed.worker[54182] INFO         Registered to:    tcp://192.168.7.20:8786
2019-06-19 14:53:49,393 distributed.worker[54182] INFO -------------------------------------------------
2019-06-19 14:53:49,394 distributed.core[54182] INFO Starting established connection
^C2019-06-19 14:53:51,525 distributed.dask_worker[54155] INFO Exiting on signal 2
2019-06-19 14:53:51,526 distributed.nanny[54155] INFO Closing Nanny at 'tcp://192.168.7.20:62826'
2019-06-19 14:53:51,528 distributed.dask_worker[54155] INFO End worker
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/bin/dask-worker", line 11, in <module>
    load_entry_point('distributed', 'console_scripts', 'dask-worker')()
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_worker.py", line 387, in go
    main()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_worker.py", line 380, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
2019-06-19 14:53:51,531 distributed.process[54155] WARNING reaping stray process <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>

This is related to my PR. Will take a look later.

TomAugspurger commented 5 years ago

I looked into this a bit today.

I think there's a bug in https://github.com/dask/distributed/blob/da6a01bdee1c6d90934c61ae056b14610cd56a6c/distributed/cli/dask_worker.py#L376. That should be a gen.coroutine, else any yield within that call stack will immediately exit the handler (tested with yield gen.sleep(0) in the close_all right above that. Anything after the yield gen.sleep(0) isn't run.

But even after fixing that, I'm still seeing a ctrl-c cause a TImeoutError. Will come back to this later.

mrocklin commented 5 years ago

Additionally, we might consider having SIGINT call something like the following in order to cleanly move data away:

worker.scheduler.close_workers(..., workers=[self.address])

cc @jcrist @jakirkham @andersy005 @carreau

Carreau commented 4 years ago

According to slurm documentation processes will be sent in order SIGCONT, SIGTERM, then SIGKILL when on a preemptible queue. I'm guessing sigcont as they might already be suspended. So maybe we ant to also trigger this on sigterm.

TomAugspurger commented 4 years ago

FWIW, I don't see this any more.

bash-5.0$ dask-worker tcp://192.168.7.20:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.7.20:61655'
distributed.worker - INFO -       Start worker at:   tcp://192.168.7.20:61657
distributed.worker - INFO -          Listening to:   tcp://192.168.7.20:61657
distributed.worker - INFO -          dashboard at:         192.168.7.20:61656
distributed.worker - INFO - Waiting to connect to:    tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/taugspurger/sandbox/distributed/dask-worker-space/worker-r_2ftwxz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
^Cdistributed.nanny - INFO - Closing Nanny at 'tcp://192.168.7.20:61655'
distributed.dask_worker - INFO - End worker
bash-5.0$ echo $?
0

Closing.