Closed zanieb closed 2 years ago
@madkinsz Just an update, even with unique IDs we are still hitting rate limits for register / deregister. Not sure if boto3 is using the environment variables or if it's overwritten by L182-184
I'm wondering if there is a way to have the agent code generate the task definition then have that first task definition's ARN written back to the run_config for the flow... Either default or "reuse-task-definition" option...
@joe1981al can you show what errors you're encountering? I thought the rate limits were for the task definition family names and a unique id would remove that bottleneck?
@madkinsz having a unique ID did pass the rate limit on the family names one as it has a lower rate limit than overall register / deregister. Amazon won't publish rate limits for ECS as they say they are dynamic...
An error occurred (ThrottlingException) when calling the DeregisterTaskDefinition operation: Rate exceeded
Hi, any progress on that? This issue is preventing us from using ECS agents right now, and it looked like there's a valid solution, no?
@Rongrs
You could try setting these env variables, it does help:
- Name: AWS_RETRY_MODE
Value: adaptive
- Name: AWS_MAX_ATTEMPTS
Value: 100
We're also hitting this issue. E.g. even with the above ENV vars, it seems like we only need to have 10 or so flows trying to be launched at once for this error to pop up consistently (this happens if we're not careful when scheduling backfills).
Does that mean these vars aren't being picked up correctly or is this just still a limitation of ECS? I've set them in the container definition for the ECS Agent we have running in our fargate cluster
Any other ideas out there beyond the env variables or UUID? Happy to try modifying ECS agent code, just not quite sure what approach to take at this point that would actually solve the problem. Maybe we can make the agent smart enough avoid re-registering the task definition for the flow if it hasn't changed somehow?
@ryan-cf can you post the error message you are getting. Is it related to family like first below or general register/deregister like second below. The UUID modification in #4380 (closed, not merged) helped me get past the family issue. The Prefect team merged a fix, #4417 , that resolved the issue where AWS_RETRY_MODE
was being overridden by code. After that I have not had any issues (still using custom UUID code for agent) with task not registering or deregistering.
An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
An error occurred (ThrottlingException) when calling the DeregisterTaskDefinition operation: Rate exceeded
Hi @joe1981al ,
Can I confirm that you need to set these environment variables on the agent container, or will it suffice to add it in your ECSRUN ( run config object ) ?
Thanks in advance
@bennnym must be set in the agent container as ecsrun
config will apply only to the ECS container created for the flow run and this issue is related to the AWS API calls through boto3 from the agent container.
Hmm, didn't fix the issue for me. I'm still consistently getting the same error.
There are very low rate limits in task definition registration with the same name and now that we are creating task definitions for every flow run and using the flow slug as the name users can easily encounter this limit. We need to
We can change the task definition name to include a short unique id or we can consider (and I think this may be the best option) using the flow run id instead of the slug. i.e.
prefect-flow-run-{id}
We should be able to setup a retry via the boto3 client e.g.
I think that change may require a look from someone on our team so we can ensure that the config is handled properly across our uses of the boto client.
Originally worked on by @joe1981al at https://github.com/PrefectHQ/prefect/issues/4380