Problem

There is a race condition in P2P that causes tasks to log compute failures on the worker even though those tasks will get restarted later on and then succeed. This happens when:

A worker involved in the P2P operation is removed
We restart the P2P operation on the scheduler and schedule the messages to be sent to the workers
A task on worker A is not cancelled yet, but its RPC calls fail because the remote worker B has already closed the shuffle run, throwing a P2PConsistencyError
The task raises the P2PConsistencyError and fails while still seen as executing by worker A, which causes the error to get logged.

Solution

Instead of failing directly on a P2PConsistencyError, the task could double-check with the scheduler whether its shuffle run is still supposed to be active. If not, it could instead silently succeed as the result will get rejected by the scheduler as outdated.

dask / distributed

P2P tasks log compute failures even if they are later restarted #8679

Problem

Solution