dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

P2P tasks log compute failures even if they are later restarted #8679

Open hendrikmakait opened 2 weeks ago

hendrikmakait commented 2 weeks ago

Problem

There is a race condition in P2P that causes tasks to log compute failures on the worker even though those tasks will get restarted later on and then succeed. This happens when:

  1. A worker involved in the P2P operation is removed
  2. We restart the P2P operation on the scheduler and schedule the messages to be sent to the workers
  3. A task on worker A is not cancelled yet, but its RPC calls fail because the remote worker B has already closed the shuffle run, throwing a P2PConsistencyError
  4. The task raises the P2PConsistencyError and fails while still seen as executing by worker A, which causes the error to get logged.

Solution

Instead of failing directly on a P2PConsistencyError, the task could double-check with the scheduler whether its shuffle run is still supposed to be active. If not, it could instead silently succeed as the result will get rejected by the scheduler as outdated.