coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Fargate spot clusters fail to create #91

Closed ericjeske closed 1 year ago

ericjeske commented 3 years ago

Intermittent failures to create spot clusters (within owned AWS account).

The cluster was created in the coiled UI and workers appeared in ECS.

Attempted to spin up non-spot cluster without any issues. Subsequent (and intermittent) attempts to spin up a spot cluster were successful.


~/SageMaker/lvr/lvr/ingest/utils_dask.py in create_client_cluster(version, n_workers, parallel, 
logging_func)
     49                 region='us-east-1',
     50                 shutdown_on_close=True,
---> 51                 backend_options={"fargate_spot": parallel == 'cloud_spot'}
     52             )
     53         else:

~/anaconda3/lib/python3.7/site-packages/coiled/cluster.py in __init__(self, n_workers, configura
tion, software, worker_cpu, worker_gpu, worker_memory, worker_class, worker_options, scheduler_c
pu, scheduler_memory, scheduler_class, scheduler_options, name, asynchronous, cloud, account, sh
utdown_on_close, region, backend_options)
    145 
    146         if not self.asynchronous:
--> 147             self.sync(self._start)
    148 
    149     @property

~/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in sync(self, func, asynch
ronous, callback_timeout, *args, **kwargs)
    181             return future
    182         else:
--> 183             return sync(self.loop, func, *args, **kwargs)
    184 
    185     def _log(self, log):

~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeou
t, *args, **kwargs)
    338     if error[0]:    339         typ, exc, tb = error[0]--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

~/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

~/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/lib/python3.7/site-packages/coiled/cluster.py in _start(self)
    205             )
    206             if self._start_n_workers:
--> 207                 await self._scale(self._start_n_workers)
    208 
    209         self.security, info = await self.cloud.security(

~/anaconda3/lib/python3.7/site-packages/coiled/cluster.py in _scale(self, n)
    314             account=self.account,
    315             cluster_id=self.cluster_id,  # type: ignore
--> 316             n=n,
    317         )
    318 

~/anaconda3/lib/python3.7/site-packages/coiled/core.py in _scale(self, cluster_id, n, account)
    505         if response.status >= 400:
    506             text = await response.text()
--> 507             raise Exception(text)
    508 
    509     def scale(self, cluster_id: int, n: int, account: str = None) -> None:

Exception: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>```
necaris commented 3 years ago

More context:

I think the issue I'm running into with cluster creation failures (both for high worker count fargate clusters as well as ~medium count fargate_spot clusters) is a race condition with waiting for workers. I've found that if I create a small cluster (5 workers // 10 cores) and then scale them manually by 10-20 workers, I don't run into any issues

shughes-uk commented 1 year ago

Fargate spot deprecated