PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.86k stars 1.55k forks source link

Avoid rate limiting while using ECS Agent #4402

Closed zanieb closed 2 years ago

zanieb commented 3 years ago

There are very low rate limits in task definition registration with the same name and now that we are creating task definitions for every flow run and using the flow slug as the name users can easily encounter this limit. We need to

We can change the task definition name to include a short unique id or we can consider (and I think this may be the best option) using the flow run id instead of the slug. i.e. prefect-flow-run-{id}

We should be able to setup a retry via the boto3 client e.g.

import boto3
from botocore.config import Config

config = Config(
   retries = {
      'max_attempts': 10,
      'mode': 'adaptive'
   }
)

ec2 = boto3.client('ec2', config=config)

I think that change may require a look from someone on our team so we can ensure that the config is handled properly across our uses of the boto client.

Originally worked on by @joe1981al at https://github.com/PrefectHQ/prefect/issues/4380

joe1981al commented 3 years ago

@madkinsz Just an update, even with unique IDs we are still hitting rate limits for register / deregister. Not sure if boto3 is using the environment variables or if it's overwritten by L182-184

I'm wondering if there is a way to have the agent code generate the task definition then have that first task definition's ARN written back to the run_config for the flow... Either default or "reuse-task-definition" option...

zanieb commented 3 years ago

@joe1981al can you show what errors you're encountering? I thought the rate limits were for the task definition family names and a unique id would remove that bottleneck?

joe1981al commented 3 years ago

@madkinsz having a unique ID did pass the rate limit on the family names one as it has a lower rate limit than overall register / deregister. Amazon won't publish rate limits for ECS as they say they are dynamic...

An error occurred (ThrottlingException) when calling the DeregisterTaskDefinition operation: Rate exceeded

Rongrs commented 3 years ago

Hi, any progress on that? This issue is preventing us from using ECS agents right now, and it looked like there's a valid solution, no?

joe1981al commented 3 years ago

@Rongrs

You could try setting these env variables, it does help:

- Name: AWS_RETRY_MODE
  Value: adaptive
- Name: AWS_MAX_ATTEMPTS
  Value: 100
ryan-cf commented 3 years ago

We're also hitting this issue. E.g. even with the above ENV vars, it seems like we only need to have 10 or so flows trying to be launched at once for this error to pop up consistently (this happens if we're not careful when scheduling backfills).

Does that mean these vars aren't being picked up correctly or is this just still a limitation of ECS? I've set them in the container definition for the ECS Agent we have running in our fargate cluster

Any other ideas out there beyond the env variables or UUID? Happy to try modifying ECS agent code, just not quite sure what approach to take at this point that would actually solve the problem. Maybe we can make the agent smart enough avoid re-registering the task definition for the flow if it hasn't changed somehow?

joe1981al commented 3 years ago

@ryan-cf can you post the error message you are getting. Is it related to family like first below or general register/deregister like second below. The UUID modification in #4380 (closed, not merged) helped me get past the family issue. The Prefect team merged a fix, #4417 , that resolved the issue where AWS_RETRY_MODE was being overridden by code. After that I have not had any issues (still using custom UUID code for agent) with task not registering or deregistering.

An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
An error occurred (ThrottlingException) when calling the DeregisterTaskDefinition operation: Rate exceeded
bennnym commented 3 years ago

Hi @joe1981al ,

Can I confirm that you need to set these environment variables on the agent container, or will it suffice to add it in your ECSRUN ( run config object ) ?

Thanks in advance

joe1981al commented 3 years ago

@bennnym must be set in the agent container as ecsrun config will apply only to the ECS container created for the flow run and this issue is related to the AWS API calls through boto3 from the agent container.

bennnym commented 3 years ago

Hmm, didn't fix the issue for me. I'm still consistently getting the same error.