LiberTEM / LiberTEM

Open pixelated STEM framework
https://libertem.github.io/LiberTEM/
GNU General Public License v3.0
108 stars 67 forks source link

Snooze stability #1643

Open sk1p opened 1 month ago

sk1p commented 1 month ago

While testing for 0.14 (#1623), we hit an issue where a dask worker was trying to heartbeat at a small time delta after snoozing (~10ms) - it would be good to write a reproducer and report this upstream. This was mostly an issue of printing an error to the log - the executor managed to unsnooze without issue afterwards.

matbryan52 commented 1 month ago

Log from this error (hard to reproduce!) :

[2024-05-21 11:13:55,880] INFO [libertem.web.state.snooze:106] Snoozing...
2024-05-21 11:13:55,894 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/worker.py", line 1252, in heartbeat
    response = await retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 452, in retry_operation
    return await retry(
           ^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 431, in retry
    return await coro()
           ^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1154, in send_recv
    response = await comm.read(deserializers=deserializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://10.8.164.164:56438 remote=tcp://10.8.164.164:42579>: Stream is closed
matbryan52 commented 1 month ago

This issue appears to be the same : https://github.com/dask/distributed/issues/7891 although there it is in an interactive context.

Theses issues may be relevant: https://github.com/dask/distributed/issues/6384 https://github.com/dask/distributed/issues/6354

And more recently https://github.com/dask/distributed/pull/8522