Timeout/credentials issue with GPU Cluster

phobson commented 2 years ago

via support@coiled.io email

Hey folks,

We've got a user using GPUs. At first they were having issues with service limits, but we got that square away. With that no longer a blocker, they're seeing this timeout issue connecting to a cluster that seems to be successfully created.

WARNING:root:error sending AWS credentials to cluster: Timed out during handshake while connecting to tls://[35.87.219.87:8786](http://35.87.219.87:8786/) after 10 s
OSError: Timed out during handshake while connecting to tls://[35.87.219.87:8786](http://35.87.219.87:8786/) after 10 s

The user says:

Security group and all other parameters are being set up by Coiled only.

Does this mean that e.g., the AWS CLI isn't setup on the local machine?

More info:

``` TimeoutError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args) 318 # write, handshake = await asyncio.gather(comm.write(local_info), comm.read()) --> 319 handshake = await asyncio.wait_for(comm.read(), time_left()) 320 await asyncio.wait_for(comm.write(local_info), time_left()) ~/anaconda3/lib/python3.7/asyncio/tasks.py in wait_for(fut, timeout, loop) 448 await _cancel_and_wait(fut, loop=loop) --> 449 raise futures.TimeoutError() 450 finally: TimeoutError: The above exception was the direct cause of the following exception: OSError Traceback (most recent call last) in ----> 1 cluster = coiled.Cluster(worker_gpu=1, worker_vm_types=['g4dn.xlarge']) ~/anaconda3/lib/python3.7/site-packages/coiled/_beta/cluster.py in __init__(self, name, software, n_workers, worker_class, worker_options, worker_vm_types, worker_cpu, worker_memory, worker_gpu, worker_gpu_type, scheduler_class, scheduler_options, scheduler_vm_types, scheduler_cpu, scheduler_memory, asynchronous, cloud, account, shutdown_on_close, use_scheduler_public_ip, credentials, timeout, environ, backend_options, show_widget, configure_logging, wait_for_workers) 378 except Exception as e: 379 self.close() --> 380 raise e 381 382 def _ipython_display_(self): ~/anaconda3/lib/python3.7/site-packages/coiled/_beta/cluster.py in __init__(self, name, software, n_workers, worker_class, worker_options, worker_vm_types, worker_cpu, worker_memory, worker_gpu, worker_gpu_type, scheduler_class, scheduler_options, scheduler_vm_types, scheduler_cpu, scheduler_memory, asynchronous, cloud, account, shutdown_on_close, use_scheduler_public_ip, credentials, timeout, environ, backend_options, show_widget, configure_logging, wait_for_workers) 363 # a problem), just spam created by clusters who failed initial creation. 364 try: --> 365 self.sync(self._start) 366 except ClusterCreationError as e: 367 self.close() ~/anaconda3/lib/python3.7/site-packages/coiled/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs) 530 asynchronous=asynchronous, 531 callback_timeout=callback_timeout, --> 532 **kwargs, 533 ) 534 ~/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs) 187 return future 188 else: --> 189 return sync(self.loop, func, *args, **kwargs) 190 191 def _log(self, log): ~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs) 338 if error[0]: 339 typ, exc, tb = error[0] --> 340 raise exc.with_traceback(tb) 341 else: 342 return result[0] ~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f() 322 if callback_timeout is not None: 323 future = asyncio.wait_for(future, callback_timeout) --> 324 result[0] = yield future 325 except Exception as exc: 326 error[0] = sys.exc_info() ~/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self) 760 761 try: --> 762 value = future.result() 763 except Exception: 764 exc_info = sys.exc_info() ~/anaconda3/lib/python3.7/site-packages/coiled/context.py in wrapper(*args, **kwargs) 75 else: 76 with operation_context(name=f"{func.__module__}.{func.__qualname__}"): ---> 77 return await func(*args, **kwargs) 78 79 return wrapper ~/anaconda3/lib/python3.7/site-packages/coiled/_beta/cluster.py in _start(self) 598 raise 599 --> 600 await super(Cluster, self)._start() 601 602 # Set adaptive maximum value based on available config and user quota ~/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in _start(self) 71 72 async def _start(self): ---> 73 comm = await self.scheduler_comm.live_comm() 74 await comm.write({"op": "subscribe_worker_status"}) 75 self.scheduler_info = await comm.read() ~/anaconda3/lib/python3.7/site-packages/distributed/core.py in live_comm(self) 748 self.timeout, 749 deserialize=self.deserialize, --> 750 **self.connection_args, 751 ) 752 comm.name = "rpc" ~/anaconda3/lib/python3.7/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args) 324 raise IOError( 325 f"Timed out during handshake while connecting to {addr} after {timeout} s" --> 326 ) from exc 327 328 comm.remote_info = handshake OSError: Timed out during handshake while connecting to tls://35.87.219.87:8786 after 10 s ```

phobson commented 2 years ago

Update from the user:

The timeout error was because I was not connecting the spined-up cluster to the client. Now it is running okay, we can close the github issue. Thank you so much for the support.

However, I think (not sure) this may be an issue. 173458997-28d6cdc2-658d-408b-a8ca-29337944a135.PNG (899×395) (user-images.githubusercontent.com)

I receive this 'software' error for the first time when I try to spin up a GPU cluster, but in the subsequent run I don't get this issue and the cluster gets created. I forgot to mention this before. Attaching a new screenshot which I received again.

ntabris commented 2 years ago

I receive this 'software' error for the first time when I try to spin up a GPU cluster, but in the subsequent run I don't get this issue and the cluster gets created.

Hm, seems like some sort of dask config system weirdness. It sounds like this happened just once and then went away, is that right?

nilanjanroy1 commented 2 years ago

Yes happens every time i open a notebook and want to spin up a cluster. 1st run it will show this error, when i run it again the cluster would spin up.

Also, how to increase the timeout. any parameters that we can pass.

mrocklin commented 2 years ago

I am surprised by this failure. I just removed my coiled.yaml files in my config, and then called dask.config.get("coiled.software") and things worked fine. I can't think of a reason for this failure.

mrocklin commented 2 years ago

Do things fail if you do the following?

import coiled, dask

dask.config.get("coiled.software")

nilanjanroy1 commented 2 years ago

@mrocklin The above line works fine with me. Thanks for checking.

But I am getting this intermittent connection issues. It worked yesterday after couple of tries, but didn't worked today. So in the 1st Snap, i tried to connect to client in the same cell - cluster got created but couldn't connect to client(gets timed out after 10s) In the 2st Snap, i tried to connect to client in different cell - cluster got created but couldn't connect to client (gets timed out again) Both worked fine day before. Can you please suggest any alternative. I am using AWS SageMaker Notebook.

nilanjanroy1 commented 2 years ago

After couple of try's it worked. Not sure what is issue is.

ncclementi commented 2 years ago

@nilanjanroy1 I noticed that you have dask and distributed version mismatches between the client and scheduler. You want them to be matching, this would be easier by updating your local environment in Sagemaker. I'm not sure if this has something to do with the connections problem but we should rule that one out.

ntabris commented 2 years ago

Those a pretty old versions of dask and distributed on the client (i.e., the SageMaker Notebook). Using more recent versions might address some of the issues (or might not, hard to say for sure, but worth trying).

nilanjanroy1 commented 2 years ago

Hi @ncclementi @ntabris , I will upgrade both dask and distributed on the client and check if i do get any issue. Thanks for checking out.

shughes-uk commented 1 year ago

Package sync should solve the mismatch issue. Closing as stale/resolved otherwise.

coiled / feedback

Timeout/credentials issue with GPU Cluster #163