coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Requesting more than 4 worker cpus creates a zombie cluster #101

Closed drorspei closed 3 years ago

drorspei commented 3 years ago

When asking for more than 4 worker cpus, the creation of the fargate task fails (which is expected), but then the cluster is listed as pending in the dashboard, with 0/0 "Num Workers", and subsequent calls to coiled.Cluster try to connect to this cluster that will never finish starting.

Here is my ipython log, the call at the end never returns:

In [28]: cluster = coiled.Cluster(n_workers=1, worker_cpu=8, name="drorspei1")
Creating Cluster. This takes about a minute ...Checking environment images
Valid environment image found
('Could not create task definition for drorspei-****-scheduler', ClientException('An error occurred (ClientException) when calling the RegisterTaskDefinition operation: No Fargate configuration exists for given values.'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-7dcfd21d0c0f> in <module>
----> 1 cluster = coiled.Cluster(n_workers=1, worker_cpu=8, name="drorspei1")

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/coiled/cluster.py in __init__(self, n_workers, configuration, software, worker_cpu, worker_gpu, worker_memory, worker_class, worker_options, scheduler_cpu, scheduler_memory, scheduler_class, scheduler_options, name, asynchronous, cloud, account, shutdown_on_close, backend_options, credentials)
    151         self._name = "coiled.Cluster"  # Used in Dask's Cluster._ipython_display_
    152         if not self.asynchronous:
--> 153             self.sync(self._start)
    154 
    155     @property

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    187             return future
    188         else:
--> 189             return sync(self.loop, func, *args, **kwargs)
    190 
    191     def _log(self, log):

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/coiled/cluster.py in _start(self)
    192                 raise ValueError(error_msg)
    193 
--> 194             self.cluster_id = await self.cloud.create_cluster(
    195                 account=self.account,
    196                 configuration=self.configuration,  # type: ignore

~/.pyenv/versions/3.8.5/envs/coiled/lib/python3.8/site-packages/coiled/core.py in _create_cluster(self, name, configuration, software, worker_cpu, worker_gpu, worker_memory, worker_class, worker_options, scheduler_cpu, scheduler_memory, scheduler_class, scheduler_options, account, workers, log_output, backend_options)
    363         error_details = await self._websocket_stream(ws, log_output, use_spinner=False)
    364         if error_details:
--> 365             raise ValueError(f"Unable to create cluster: {error_details}")
    366 
    367         return await self._get_cluster_by_name(name=name, account=account)

ValueError: Unable to create cluster: ('Could not create task definition for drorspei-****-scheduler', ClientException('An error occurred (ClientException) when calling the RegisterTaskDefinition operation: No Fargate configuration exists for given values.'))

In [29]: cluster = coiled.Cluster(n_workers=1, worker_cpu=4, name="drorspei1")
Using existing cluster: drorspei1

The immediate solution was to manually stop the zombie cluster in the dashboard.

FabioRosado commented 3 years ago

Hello @drorspei Thank you for reporting this to us and I'm sorry if this issue caused any inconvenience. I will test this on my end and discuss with the team a fix for this 😄

necaris commented 3 years ago
Agent Rami Chowdhury linked Freshdesk ticket 13 for this issue.
necaris commented 3 years ago

@FabioRosado test comment

necaris commented 3 years ago
Agent Rami Chowdhury created comment.

@Fabio Rosado it looks like the GitHub comment syncing is not working :-(
FabioRosado commented 3 years ago

Hello @drorspei, first of all, I apologise for using this issue to test our integration.

I wanted to give you an update on this issue; we have worked on a few fixes that address the validation issue and the creation of zombie clusters.

Thank you for taking the time to test the thing out and reporting this to us. We are planning a release soon that should address these issues.

For now, I'll close this issue, but please let us know if you need any further help or if you encounter any other issue.