dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Network glitch during scatter() may result in memory leak #6412

Open crusaderky opened 2 years ago

crusaderky commented 2 years ago

If scatter fails either between these two lines (direct=True) https://github.com/dask/distributed/blob/9bb999d4b66670570d9d8f53d04501dec6a25e7e/distributed/client.py#L2275-L2281

or these (direct=False) https://github.com/dask/distributed/blob/9bb999d4b66670570d9d8f53d04501dec6a25e7e/distributed/scheduler.py#L4978-L4982

e.g. the network falls over after the data has reached the worker, but before the worker can respond OK on the RPC channel, then the scheduler will not know about the data and the worker will not inform it. The data will just sit there consuming memory, unknown to the scheduler, and only a new scatter/compute of the same key on the same worker will fix the issue.

fjetter commented 2 years ago

If the worker disconnects, we're closing it now. Is this problem still possible? I would expect either one of the calls to raise an exception or the key to be transitioned to lost

crusaderky commented 2 years ago

I expect we're not closing the worker on the scheduler if an RPC channel between the client and the worker dies.

I would expect either one of the calls to raise an exception or the key to be transitioned to lost

Precisely, if scatter_to_workers raises an exception, update_data is never called. There's nothing to transition, as the scheduler is completely unaware of the key.