Open sk1p opened 1 month ago
Log from this error (hard to reproduce!) :
[2024-05-21 11:13:55,880] INFO [libertem.web.state.snooze:106] Snoozing...
2024-05-21 11:13:55,894 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/worker.py", line 1252, in heartbeat
response = await retry_operation(
^^^^^^^^^^^^^^^^^^^^^^
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 452, in retry_operation
return await retry(
^^^^^^^^^^^^
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 431, in retry
return await coro()
^^^^^^^^^^^^
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1154, in send_recv
response = await comm.read(deserializers=deserializers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 236, in read
convert_stream_closed_error(self, e)
File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://10.8.164.164:56438 remote=tcp://10.8.164.164:42579>: Stream is closed
This issue appears to be the same : https://github.com/dask/distributed/issues/7891 although there it is in an interactive context.
Theses issues may be relevant: https://github.com/dask/distributed/issues/6384 https://github.com/dask/distributed/issues/6354
And more recently https://github.com/dask/distributed/pull/8522
While testing for 0.14 (#1623), we hit an issue where a dask worker was trying to heartbeat at a small time delta after snoozing (~10ms) - it would be good to write a reproducer and report this upstream. This was mostly an issue of printing an error to the log - the executor managed to unsnooze without issue afterwards.