Open pfackeldey opened 3 years ago
Thanks for raising an issue @pfackeldey. Looking at your example, client.who_has
doesn't support an asynchronous
keyword argument. Whether client
operations are blocking or asynchronous is dictated by the asynchronous=
keyword argument when constructing your Client
object https://distributed.dask.org/en/latest/client.html#async-await-operation (the default value is asynchronous=False
. If you need your client
operations to be asynchronous, you should pass asynchronous=True
when you create the Client
instead of to individual methods
I should also add that active memory management in Dask is actively be worked on (xref https://github.com/dask/distributed/issues/4982), so this type of replica tracking should be much more transparent in the future
Thank you for your fast reply @jrbourbeau !
The documentation link https://distributed.dask.org/en/latest/client.html#async-await-operation states in the second part:
If you want to reuse the same client in asynchronous and synchronous environments you can apply the asynchronous=True keyword at each method call.
Thus I expected the above mentioned code to work. Otherwise I am willing to help to update/remove this part of the documentation to your wishes so others won't run into the same misunderstanding/problem as me. What is your opinion here?
Also thank you very much for pointing to the ongoing work on the active memory management, I'll keep an eye on this!
Best, Peter
What happened:
Dear
dask-distributed
developers,First of all, thank you for this wonderful project! We are using
dask-distributed
for our local HTCondor computing cluster. In our use-case we periodically kill and spawndask-workers
in HTCondorJobs, such that other HTCondorJobs from other users can slide in between our computing runs. We also need to work with heavy input, which we need to distribute to thedask-workers
beforehand usingclient.scatter
. Of course we want to replicate this as soon as newdask-worker
are spawned. Thus we added an asynchronous periodic callback to theclient
's IOLoop, which takes care of this replication. Unfortunately we noticed that theclient.who_has(..., asynchronous=True)
call deadlocks our scheduler (unfortunately without a stack trace). Any connection to the scheduler results then in a timeout.What you expected to happen:
We expected that we can add a asynchronous callback, which uses
client.who_has(..., asynchronous=True)
, to theclient
's IOLoop without deadlocking the scheduler.Minimal Complete Verifiable Example:
This is a minimal reproducible example, which shows the above-mentioned problem. Since it also happens on a
LocalCluster
the problem seems to be batch-system-agnostic.The output (only once!):
Afterwards the scheduler is stuck.
Anything else we need to know?:
-
Environment:
client.get_versions(check=True)
does not throw an error and outputs:Thank you very much in advance for your input and help!
Best, Peter