Closed TomAugspurger closed 4 years ago
I looked into this a bit today.
I think there's a bug in https://github.com/dask/distributed/blob/da6a01bdee1c6d90934c61ae056b14610cd56a6c/distributed/cli/dask_worker.py#L376. That should be a gen.coroutine
, else any yield
within that call stack will immediately exit the handler (tested with yield gen.sleep(0)
in the close_all
right above that. Anything after the yield gen.sleep(0)
isn't run.
But even after fixing that, I'm still seeing a ctrl-c cause a TImeoutError
. Will come back to this later.
Additionally, we might consider having SIGINT call something like the following in order to cleanly move data away:
worker.scheduler.close_workers(..., workers=[self.address])
cc @jcrist @jakirkham @andersy005 @carreau
According to slurm documentation processes will be sent in order SIGCONT, SIGTERM, then SIGKILL when on a preemptible queue. I'm guessing sigcont as they might already be suspended. So maybe we ant to also trigger this on sigterm.
FWIW, I don't see this any more.
bash-5.0$ dask-worker tcp://192.168.7.20:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.7.20:61655'
distributed.worker - INFO - Start worker at: tcp://192.168.7.20:61657
distributed.worker - INFO - Listening to: tcp://192.168.7.20:61657
distributed.worker - INFO - dashboard at: 192.168.7.20:61656
distributed.worker - INFO - Waiting to connect to: tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 17.18 GB
distributed.worker - INFO - Local Directory: /Users/taugspurger/sandbox/distributed/dask-worker-space/worker-r_2ftwxz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
^Cdistributed.nanny - INFO - Closing Nanny at 'tcp://192.168.7.20:61655'
distributed.dask_worker - INFO - End worker
bash-5.0$ echo $?
0
Closing.
Connect a
dask-worker
to the scheduler and thenctrl-c
.That should exit cleanly.
This is related to my PR. Will take a look later.