dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Worker addresses are treated as unique identifiers, but may not be #6392

Open gjoseph92 opened 2 years ago

gjoseph92 commented 2 years ago

The __hash__ of a WorkerState object is just its address: https://github.com/dask/distributed/blob/33fc50ca9817216bb4105b68f5e0859ebfb80fdb/distributed/scheduler.py#L480

As is the equality check (https://github.com/dask/distributed/pull/3321 https://github.com/dask/distributed/pull/3483): https://github.com/dask/distributed/blob/33fc50ca9817216bb4105b68f5e0859ebfb80fdb/distributed/scheduler.py#L501-L504

And in general, there are a number of places where we store things in dicts keyed by worker address, and assume that if ws.address in self.workers, then ws is self.workers[ws.address]. (stealing.py is especially guilty—most of its logic is basically built around this.)

However, it's completely valid for a worker to disconnect, then for a new worker to connect from the same address. (Even with reconnection removed https://github.com/dask/distributed/pull/6361, a Nanny https://github.com/dask/distributed/issues/6387 or a user script could do this.) These are logically different workers, though they happen to have the same address.

This can cause:

Outcomes:

Alternatives:

Causes https://github.com/dask/distributed/issues/6356, https://github.com/dask/distributed/issues/3256, https://github.com/dask/distributed/issues/6263, maybe https://github.com/dask/distributed/issues/3892

cc @crusaderky @fjetter @bnaul

fjetter commented 2 years ago

I think the first step to make the hash a bit more reliable is easy https://github.com/dask/distributed/pull/6398

changing the usage of the address is a bit more elaborate but I'm also in favor. I think we should rather use Worker.id in the entire code base. I don't think this is a very disruptive change, it is just a bit of work

crusaderky commented 2 years ago

I suspect it's going to be a lot of legwork. But I agree it's the sane thing to do.

fjetter commented 2 years ago

In https://github.com/dask/distributed/pull/6585 I'm extending the hash to incorporate a unique counter. apart from hash collisions, the hash function should now be sufficiently unique. We might even considering defining the hash as only the id.

I'm not sure if there is anything else we should do. Replacing all usage of address vs id/hash seems to be a bit excessive after giving it a bit more thought and would not protect us from state drift between scheduler and an extension as it is the case in the stealing deadlock. as long as we can rely on the equal methods, I'm good. thoughts?

hendrikmakait commented 2 years ago

I'm not sure if there is anything else we should do. Replacing all usage of address vs id/hash seems to be a bit excessive after giving it a bit more thought

I think that we should eventually go through the legwork of cleaning up the use of address as an identifier. This will define the best practice of identifying workers. Diving into the codebase, I have been under the impression that this is fine given that workers are quite often identified by their addresses or stored in dicts keyed by them. I suspect we will also find a number of places where we carelessly check for addresses where an ID-based check would have been needed.

and would not protect us from state drift between scheduler and an extension as it is the case in the stealing deadlock.

I'm not sure if I'm following, wasn't the fix in #6585 quite literally replacing an address-based check with an equality-based (i.e., ID-based) one?