coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Cluster creation fails running quickstart #138

Closed fonnesbeck closed 1 year ago

fonnesbeck commented 3 years ago

Trying to follow the example on the quickstart fails at the call to coiled.Cluster, yet when I look at my dashboard, a cluster has been created and is running. Here is the error:

Cluster deleted successfully.
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/home/cfonnesbeck/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/protocol/core.py", line 107, in loads
    small_payload = frames.pop()
IndexError: pop from empty list
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
    318         # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
--> 319         handshake = await asyncio.wait_for(comm.read(), time_left())
    320         await asyncio.wait_for(comm.write(local_info), time_left())

~/miniconda3/envs/nn_matchup/lib/python3.8/asyncio/tasks.py in wait_for(fut, timeout, loop)
    493         if fut.done():
--> 494             return fut.result()
    495         else:

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/comm/tcp.py in read(self, deserializers)
    216 
--> 217                 msg = await from_frames(
    218                     frames,

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/comm/utils.py in from_frames(frames, deserialize, deserializers, allow_offload)
     79     else:
---> 80         res = _from_frames()
     81 

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/comm/utils.py in _from_frames()
     62         try:
---> 63             return protocol.loads(
     64                 frames, deserialize=deserialize, deserializers=deserializers

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/protocol/core.py in loads(frames, deserialize, deserializers)
    106         small_header = frames.pop()
--> 107         small_payload = frames.pop()
    108         msg = loads_msgpack(small_header, small_payload)

IndexError: pop from empty list

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
~/GitHub/nn_matchup/models/matchup_model_estimator.py in 
----> 2 cluster = coiled.Cluster(n_workers=10)

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/coiled/cluster.py in __init__(self, n_workers, configuration, software, worker_cpu, worker_gpu, worker_memory, worker_class, worker_options, scheduler_cpu, scheduler_memory, scheduler_class, scheduler_options, name, asynchronous, cloud, account, shutdown_on_close, backend_options, credentials, timeout)
    162         self._name = "coiled.Cluster"  # Used in Dask's Cluster._ipython_display_
    163         if not self.asynchronous:
--> 164             self.sync(self._start)
    165 
    166     @property

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    187             return future
    188         else:
--> 189             return sync(self.loop, func, *args, **kwargs)
    190 
    191     def _log(self, log):

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    349     if error[0]:
    350         typ, exc, tb = error[0]
--> 351         raise exc.with_traceback(tb)
    352     else:
    353         return result[0]

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/utils.py in f()
    332             if callback_timeout is not None:
    333                 future = asyncio.wait_for(future, callback_timeout)
--> 334             result[0] = yield future
    335         except Exception as exc:
    336             error[0] = sys.exc_info()

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/coiled/cluster.py in _start(self)
    248                 raise
    249 
--> 250             await super()._start()
    251 
    252             # TODO: Come up with a better long-term solution. Below we raise an informative error message

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/deploy/cluster.py in _start(self)
     71 
     72     async def _start(self):
---> 73         comm = await self.scheduler_comm.live_comm()
     74         await comm.write({"op": "subscribe_worker_status"})
     75         self.scheduler_info = await comm.read()

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/core.py in live_comm(self)
    744             del self.comms[s]
    745         if not open or comm.closed():
--> 746             comm = await connect(
    747                 self.address,
    748                 self.timeout,

~/miniconda3/envs/nn_matchup/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
    322         with suppress(Exception):
    323             await comm.close()
--> 324         raise IOError(
    325             f"Timed out during handshake while connecting to {addr} after {timeout} s"
    326         ) from exc

OSError: Timed out during handshake while connecting to tls://ec2-52-15-36-97.us-east-2.compute.amazonaws.com:8786 after 5 s
FabioRosado commented 3 years ago

Hello, @fonnesbeck thank you for creating this issue. Can I ask you to run the command coiled.list_local_versions(json=True) and let me know what versions you have installed?

When did the error appear? I can see a message "Cluster deleted successfully," which would appear when you run coiled.delete_cluster(name=<cluster name>) I'm not sure if this is related but it would be good to know.

Could you give the quickstart again and let us know if you see the same issue? Thank you

fonnesbeck commented 3 years ago

I have: python=3.8.8 coiled=0.0.38 dask=2021.03.0 distributed=2021.03.0

The error occurs just after I get the "Creating Cluster. ..." message runs. I deleted the cluster manually from the dashboard after the failure.

fonnesbeck commented 3 years ago

Still happening. Could this be because I'm running from a WSL on Windows? Some sort of port fowarding issue? Will try it from my Mac.

fonnesbeck commented 3 years ago

Can confirm same problem from macOS:

{'python_version': LooseVersion ('3.7.10'), 'coiled_version': '0.0.38', 'dask_version': '2021.03.0', 'distributed_version': '2021.03.0'}

FabioRosado commented 3 years ago

Thank you for the update, could you please update dask and distributed to the latest version 2021.03.1 and test again?

You can do it with pip install --upgrade distributed

FabioRosado commented 3 years ago

Hello @fonnesbeck how are you doing? Can I check if you were able to make this work? We have added a more informative error message when we notice version mismatches that might break. Hopefully, this will help.

shughes-uk commented 1 year ago

stale