Open gjoseph92 opened 3 years ago
Thanks for the detailed description and example test @gjoseph92!
As mentioned offline, when get_worker
is called from inside a task thread_state.execution_state["worker"]
should point to the corresponding worker which is running the task
In the case that get_worker
is called outside of a task, what should the "correct" worker be?
Just adding a note here since I didn't see it written down: setting thread_state.execution_state["worker"]
while deserializing (and serializing) on a worker would probably alleviate most of the problems we see with this issue. It typically seems to come up with stateful things that interact with worker machinery like Actors, ShuffleService, etc. that define a custom __setstate__
which tries to store the current get_worker()
in an instance variable when unpickled.
Split out from https://github.com/dask/distributed/pull/4937#issuecomment-866234111.
What happened:
When using a local cluster in async mode,
get_worker
always returns the same Worker instance, no matter which worker it's being called within.What you expected to happen:
get_worker
Minimal Complete Verifiable Example:
Anything else we need to know?:
This probably almost never affects users directly. But since most tests use an async local cluster with
@gen_cluster
, I'm concerned what edge-case behavior we might testing incorrectly.Also, note that the same issue affects
get_client
. This feels a tiny bit less bad (at least it's always the right client, unlikeget_worker
), but still can have some strange effects. In multiple places, worker code updates the default Client instance assuming it's in a separate process. With multiple workers trampling the default client, I wonder if this affects tests around advancedsecede
/client-within-task workloads.I feel like the proper solution here would be to set a contextvar for the current worker that's updated as we context-switch in and out of that worker. Identifying the points where those switches have to happen seems tricky though.
I also think it would be reasonable for
get_worker
to error iflen(Worker._instances) > 1
.Environment: