dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Update to `Server.digests_total_since_heartbeat` during heartbeat may kill worker #8669

Closed hendrikmakait closed 3 weeks ago

hendrikmakait commented 3 weeks ago

I don't have a reproducer, but here are some error logs from my test suite:

2024-06-03 19:40:34,118 - distributed.worker - ERROR - Unexpected exception during heartbeat. Closing worker.
Traceback (most recent call last):
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/worker.py", line 1256, in heartbeat
    metrics=await self.get_metrics(),
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/worker.py", line 1043, in get_metrics
    for k, v in self.digests_total_since_heartbeat.items():
RuntimeError: dictionary changed size during iteration
2024-06-03 19:40:34,131 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52143. Reason: worker-heartbeat-error