dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
134 stars 109 forks source link

Error on RegisterTaskDefinition for worker when instantiating a FargateCluster #290

Open AndrewHannigan opened 3 years ago

AndrewHannigan commented 3 years ago

Occasionally I see the following error when dask cloudprovider attempts to create a worker task definition while creating a Fargate cluster:

Unexpected error: ClientException('An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/prefect/engine/flow_runner.py", line 442, in get_flow_run_state
    with self.check_for_cancellation(), executor.start():
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.8/dist-packages/prefect/executors/dask.py", line 223, in start
    with self.cluster_class(**self.cluster_kwargs) as cluster:
  File "/usr/local/lib/python3.8/dist-packages/dask_cloudprovider/aws/ecs.py", line 1367, in __init__
    super().__init__(fargate_scheduler=True, fargate_workers=True, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dask_cloudprovider/aws/ecs.py", line 733, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/distributed/deploy/spec.py", line 282, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/dist-packages/distributed/deploy/cluster.py", line 188, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/distributed/utils.py", line 353, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/dist-packages/distributed/utils.py", line 336, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/dist-packages/dask_cloudprovider/aws/ecs.py", line 887, in _start
    await self._create_worker_task_definition_arn()
  File "/usr/local/lib/python3.8/dist-packages/dask_cloudprovider/aws/ecs.py", line 1159, in _create_worker_task_definition_arn
    response = await ecs.register_task_definition(
  File "/usr/local/lib/python3.8/dist-packages/aiobotocore/client.py", line 154, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
AndrewHannigan commented 3 years ago

@jacobtomlinson any thoughts on what could possibly be going on here? One of the surprising things to me is that it seems like there are multiple calls being made to create a new task def revision at the same time. In the event that multiple requests were necessary (due to something blocking task revision on AWS side), I would have expected those attempts to be made serially.

jacobtomlinson commented 3 years ago

Not sure. We call register_task_definition twice when the FargteCluster object is created, once for the scheduler and once for the worker.

Are you creating multiple FargateCluster objects or something?