Restart P2P if cluster significantly grew in size

fjetter commented 1 week ago

A downside of the new P2P algorithm is that it locks in the set of participating workers as soon as the very first p2p task is executing on a worker. This is a well known problem for downscaling clusters (also when a worker dies) and it is currently handled by restarting the entire P2P run.

For upscaling clusters there is currently no logic implemented. New workers are only allowed to participate in new P2P runs or non-P2P tasks.

This is particularly disturbing if one starts a cluster with few or even None / one workers and expected the adaptivity to handle. The most likely error case in this situation is that the entire P2P operation focuses on a single worker and this worker eventually dies with an out of disk exception (unless dataset is small, of course).

In the past we discussed some sophisticated implementations that involve ring hashing that would let us resume work but I would like to explicitly define this out of scope for the moment and instead pursue a simpler approach.

With the tools available to use I would assume that the easiest way to do this would be to restart a P2P operation if a certain heuristic is true.

For example: If cluster size increased by X% and P2P transfer progress is below Y% restart the P2P operation.

This heuristic should describe cases where we would at least finish more quickly with a restart than if we waited.

fjetter commented 1 week ago

I think a very simple heuristic like

Cluster size doubled in size
Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

would go a long way to avoid the very obvious fail case that is being described here https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266/8?u=fjetter where the users starts with one worker and runs into an error.

We can still fine tune later. When implementing this we should also think about what metrics would be useful to tune this further.

hendrikmakait commented 1 week ago

Rechunk still below 25% (I actually expect the transfer phase to scale linearily so we could crank this up to 50% or even further)

"below 25%" as in less than 25% of the transfer tasks have been completed?

fjetter commented 1 week ago

"below 25%" as in less than 25% of the transfer tasks have been completed?

yes

dask / distributed

Restart P2P if cluster significantly grew in size #8673