Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started

hendrikmakait commented 3 months ago

The idea here is similar to https://github.com/dask/distributed/issues/8673:

Since P2P fixes the set of involved workers during the initialization of a shuffle run, we don't benefit from workers who join the cluster afterward. This is particularly important because P2P can't succeed if the sum of available disk space across all involved workers is smaller than the size of the (serialized) data.

Even if we don't hit the heuristic suggested in https://github.com/dask/distributed/issues/8673, I think we should restart a P2P operation if the disk buffer on an involved worker encounters a OSError: [Errno 28] No space left on device and the worker count has grown since we started. We should add a circuit-breaker to this similar to the suspicious_count to avoid errors that are genuinely caused by inhomogeneous partitions (or because the cluster refuses to scale to a sufficient size).

fjetter commented 3 months ago

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

hendrikmakait commented 3 months ago

I'm mildly concerned about the suspicious count logic and what that would entail (from a complexity perspective) but otherwise +1

The circuit-breaker logic is the thing I'm least concerned about. I'd keep this pretty simple: Add yet-another dictionary to the scheduler plugin that keeps a count which we increase every time we restart due to an Errno28. Don't restart if we exceed some threshold. Done. (Clear out when we clear out all the other stuff.)

dask / distributed

Restart P2P on `OSError: [Errno 28] No space left on device` instead of failing if the cluster has grown since we started #8674