Open scottyhq opened 4 years ago
pinging @jhamman - not sure this has come up on the binderhub running on GCE
(un)Fortunately, I have not seen this on our GCP binderhub.
Thanks for raising this.
Looking at the tracebacks there seems to be nothing related to dask-kubernetes
in there. It seems like the distributed cluster itself is failing to exit cleanly. Therefore I'm going to move this over to the distributed repo and track it there.
I don't know what would cause the CommClosedErrors. If I were to dive in here I would just start looking at the lines mentioned in the traceback, and maybe try to back out what happens if a connection dies at that point.
Were your workers running with any sort of --death-timeout
value that wasn't respected?
Thanks @jacobtomlinson and @mrocklin for looking into this. Our dask_config.yaml currently looks like this: https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml
distributed:
version: 2
dashboard:
link: /user/{JUPYTERHUB_USER}/proxy/{port}/status
scheduler:
idle-timeout: 3600s
# uncomment to force new worker pods after 2 hrs
# worker:
# lifetime:
# duration: "2 hours"
# stagger: "10 s"
# restart: true
admin:
tick:
limit: 5s
logging:
distributed: warning
bokeh: critical
# http://stackoverflow.com/questions/21234772/python-tornado-disable-logging-to-stderr
tornado: critical
tornado.application: error
kubernetes:
name: dask-{JUPYTERHUB_USER}-{uuid}
worker-template:
spec:
serviceAccount: daskkubernetes
restartPolicy: Never
containers:
- name: dask-worker
image: ${JUPYTER_IMAGE_SPEC}
args:
- dask-worker
- --nthreads
- '2'
- --no-dashboard
- --memory-limit
- 7GB
- --death-timeout
- '60'
resources:
limits:
cpu: "1.75"
memory: 7G
requests:
cpu: 1
memory: 7G
labextension:
factory:
module: dask_kubernetes
class: KubeCluster
args: []
kwargs: {}
So --death-timeout '60' (I'm guessing this is seconds), is not being respected. I see also that we have worker timeout config under distributed
currently commented, which was discussed previously here https://github.com/pangeo-data/pangeo-stacks/pull/93. I suppose we could set that to a high value (something like 24 hours) just to ensure things don't run unintentionally for days when situations like this arise?
The pod showing as Running
is that one that concerns me. Any that show as Completed
will not be taking up any resource on the cluster.
I think the CommClosedError
exceptions we are seeing are unrelated. The worker has lost connection to the scheduler but is still trying to send keep alive messages which are failing. I've raised #3493 to try to resolve this.
The issue we are seeing here seems to be related to this loop in _register_with_scheduler
.
The death timeout works using asyncio.wait_for
, which can be foiled by blocking sync code.
I suspect there is something in the _register_with_scheduler
method which is blocking.
Hello, I am quite new to dask but I am trying to launch some workers on a SLURM cluser, using dask distributed and I get pretty much the same errors the my logs, while the workers continue to run even if my script was killed.
Was this issue fixed somehow or is there a workaround to this? Thanks for your suggestions!
Here is the log of one of my workers ... Let me know if you need more information ...
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.104.12:35330
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback <function Worker._register_with_scheduler.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/ebsofts/bokeh/1.3.4-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback ret = callback() File "/opt/ebsofts/bokeh/1.3.4-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result future.result() File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/worker.py", line 868, in heartbeat metrics=await self.get_metrics(), File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/core.py", line 747, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/core.py", line 874, in connect connection_args=self.connection_args, File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 227, in connect _raise(error) File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://192.168.104.12:35330' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x2b4680ce8b38>: ConnectionRefusedError: [Errno 111] Connection refused tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b464fa8f9e8>>, <Task finished coro=<Worker.heartbeat() done, defined at /opt/ebsofts/dask/2.3.$ Traceback (most recent call last): File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in connect quiet_exceptions=EnvironmentError, tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/ebsofts/bokeh/1.3.4-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback ret = callback() File "/opt/ebsofts/bokeh/1.3.4-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result future.result() File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/worker.py", line 868, in heartbeat metrics=await self.get_metrics(), File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/core.py", line 747, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/core.py", line 874, in connect connection_args=self.connection_args, File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 227, in connect _raise(error) File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://192.168.104.12:35330' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x2b4680ce86a0>: ConnectionRefusedError: [Errno 111] Connection refused tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b464fa8f9e8>>, <Task finished coro=<Worker.heartbeat() done, defined at /opt/ebsofts/dask/2.3.$ Traceback (most recent call last): File "/opt/ebsofts/dask/2.3.0-foss-2019a-Python-3.7.2/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in connect quiet_exceptions=EnvironmentError, tornado.util.TimeoutError: Timeout
Hey @2d1r. While similar I think this is a different issue. My initial guess would be that your scheduler is being lost or killed somehow.
I recommend you raise a new issue with this problem and share more information on how you are constructing your cluster. Preferably with code example.
We recently encountered an issue on binderhub where a dask pod failed to terminate, resulting in a node running for hours:
The pod listed as
Running
had a log showing an infinite loop ofCommClousedErrors
:And the pods listed as
Completed
had the following traceback in their logs:pinging @jhamman - not sure this has come up on the binderhub running on GCE
Versions: