Closed phobson closed 1 year ago
Update from the user:
The timeout error was because I was not connecting the spined-up cluster to the client. Now it is running okay, we can close the github issue. Thank you so much for the support.
However, I think (not sure) this may be an issue. 173458997-28d6cdc2-658d-408b-a8ca-29337944a135.PNG (899×395) (user-images.githubusercontent.com)
I receive this 'software' error for the first time when I try to spin up a GPU cluster, but in the subsequent run I don't get this issue and the cluster gets created. I forgot to mention this before. Attaching a new screenshot which I received again.
I receive this 'software' error for the first time when I try to spin up a GPU cluster, but in the subsequent run I don't get this issue and the cluster gets created.
Hm, seems like some sort of dask config system weirdness. It sounds like this happened just once and then went away, is that right?
Yes happens every time i open a notebook and want to spin up a cluster. 1st run it will show this error, when i run it again the cluster would spin up.
Also, how to increase the timeout. any parameters that we can pass.
I am surprised by this failure. I just removed my coiled.yaml
files in my config, and then called dask.config.get("coiled.software")
and things worked fine. I can't think of a reason for this failure.
Do things fail if you do the following?
import coiled, dask
dask.config.get("coiled.software")
@mrocklin The above line works fine with me. Thanks for checking.
But I am getting this intermittent connection issues. It worked yesterday after couple of tries, but didn't worked today. So in the 1st Snap, i tried to connect to client in the same cell - cluster got created but couldn't connect to client(gets timed out after 10s) In the 2st Snap, i tried to connect to client in different cell - cluster got created but couldn't connect to client (gets timed out again) Both worked fine day before. Can you please suggest any alternative. I am using AWS SageMaker Notebook.
After couple of try's it worked. Not sure what is issue is.
@nilanjanroy1 I noticed that you have dask and distributed version mismatches between the client and scheduler. You want them to be matching, this would be easier by updating your local environment in Sagemaker. I'm not sure if this has something to do with the connections problem but we should rule that one out.
Those a pretty old versions of dask and distributed on the client (i.e., the SageMaker Notebook). Using more recent versions might address some of the issues (or might not, hard to say for sure, but worth trying).
Hi @ncclementi @ntabris , I will upgrade both dask and distributed on the client and check if i do get any issue. Thanks for checking out.
Package sync should solve the mismatch issue. Closing as stale/resolved otherwise.
via support@coiled.io email
Hey folks,
We've got a user using GPUs. At first they were having issues with service limits, but we got that square away. With that no longer a blocker, they're seeing this timeout issue connecting to a cluster that seems to be successfully created.
The user says:
Does this mean that e.g., the AWS CLI isn't setup on the local machine?
More info: