dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Better exception if scheduler disconnects from client #8690

Closed fjetter closed 2 days ago

fjetter commented 2 weeks ago

If the connection between scheduler and client is lost (e.g. if the scheduler dies) this triggers a reconnect loop on the client to reestablish the connection. If the scheduler is still alive, users will not notice this failure except they are working with previously created Futures. Those futures are cancelled automatically as soon as the client is initiating a reconnect (see here).

If that Future is used the next time, this raises a CancelledError(<key>) without further context and it is frequently unclear for users what this exactly means.

Instead, the user should receive an informative message telling them to check on their scheduler.

@gen_cluster(client=True)
async def test_client_scheduler_lost_sane_exception(c, s, a, b):
    fut = c.submit(inc, 1)
    await wait(fut)

    await s.close()

    with pytest.raises(CancelledError, match='connection to scheduler'):
        await fut

This issue is particularly troublesome if the user is not working with futures directly but the futures are embedded in a persisted collection which renders the entire collection unusable.

fjetter commented 2 weeks ago

A rather straightforward way to improve this is to allow the Future.cancel method that is being invoked in the reconnect method to accept an exception or message that is then properly forwarded and raised.