Closed rubenvdg closed 3 years ago
Hello, Ruben thank you for your question. You should be able to do so. I tested by running the quickstart and attempted to connect to the cluster using its name while getting the tip amount's mean.
It might be useful to know what the Python process is doing - I had a look at your logs and seen a KeyError on some workers and then a TLS handshake failed with remote
log.
Looking at some Worker logs I've noticed this log:
Event loop was unresponsive in Worker for 14.20s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
What I'm guessing here is that perhaps the Python process made the loop unresponsive at a similar time that you tried to connect and that caused the SSL handshake issue and the timeout?
Can you attempt to launch a cluster, run some computation and attempt to connect to the same cluster in the middle of the computation to see if you can do so? If you , that could prove my suspicion 🤔
Thanks! Will look into it. The workers are indeed reading large (compressed) chunks of data.
Ruben have you seen this issue popup again? Or were you able to do your computations without issues?
Thanks for checking in. We are still a bit in doubt how/where to set the configs (with timeouts) of distributed clusters.
We now do something along the lines of:
cluster = coiled.Cluster(name="clustert")
client = Client(cluster)
def set_config(key: str, value: str) -> str:
import dask
dask.config.set({key: value})
return dask.config.get(key)
client.run_on_scheduler(set_config, "distributed.comm.timeouts.connect", "300s")
client.run(set_config, "distributed.comm.timeouts.connect", "300s")
which seems to work, but we're still unsure if this is the proper way to go.
What is best practice for setting dask config? Should it be set on the workers as well? Also, you can set a timeout
via Client
, but that seems to be ignored for a distributed cluster.
All in all, we're making progress :-).
We also set distributed.comm.timeouts.tcp
to 600
btw. But from the dask github issues, you can see that the timeout handling is a problem more people run into (especially when running long-running tasks on workers).
Thank you for the update - I've also noticed that we might get a timeout error with the latest distributed version when trying to connect to a running cluster. Just something to be aware of if your cluster is running the latest version and locally you are running an older version.
Looking at how you are setting up your timeouts I'd say that's a good way to do it. I will double-check with the team to see if we can set this up in a different way
Hello Ruben, just wanted to update my previous comment. This is the correct way to update the timeout settings on the scheduler, I've added a quick troubleshooting article to our knowledge base, once we implement #75 you will be able to change the configuration for the scheduler or/and workers with the coiled.Cluster
constructor.
I am closing this issue now, but if you need any further help please feel free to open a new issue or reach out to us.
I have a cluster running ("test-cluster"), which is in use by a Python process (and currently running stuff).
If (in another Python session) I try to connect to the same cluster I catch a timeout:
I'm running the default coiled env.
Is this expected behavior? I.e. is it impossible to connect to an existing cluster if it's in use by another process?