dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

[P2P] Be robust to timeouts during shuffle barrier #8659

Open hendrikmakait opened 3 weeks ago

hendrikmakait commented 3 weeks ago

Similar to #https://github.com/dask/distributed/issues/8011, the barrier task is not robust to timeouts. We should add a similar retry mechanism here.

hendrikmakait commented 3 weeks ago

@charlesbluca: Since you implemented #8011, would you be interested in working on this (and have the capacity)?