Reduce chances of hitting AWS ECS Rate Limits

gabcoyne commented 2 years ago

AWS will rate limit task definition registration for a given task family. There are two factors that make it easy for a relatively low volume of Prefect flows to hit rate limits:

All Prefect Flows are registered under the same task family "prefect" defined here. This is relevant because the error raise is RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
Task definitions create a new log stream for each task created here, which in turn means that each flow run has a new task definition.

Solution: I feel like creating task families per deployment will be the easiest way around this, as this reduces the number of registrations for a given family. However this does not fully resolve the issue. It is nice to see a new log stream for every flow run, however the need to register task definitions introduces a chance for rate limits to be hit.

zanieb commented 2 years ago

I do not think infrastructure runs have access to deployment names either at this time. We may be able to add that but it’s not a small change.
I do not think using our CloudWatch logging setup is compatible with using your own task definition without reregistration, I'm not sure there is a way around this.

anna-geller commented 2 years ago

ok, I tested this out a bit more and it looks like when user provides the task_definition_arn in this format (with family name but without revision), everything works without hitting the ECS API limits: "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect"

my block:

ecs = ECSTask(
    aws_credentials=aws_creds,
    cluster="prefect",
    task_definition_arn="arn:aws:ecs:us-east-1:111111111111111:task-definition/prefect",
)
ecs.save(DEFAULT_BLOCK, overwrite=True)

I triggered a bunch of runs (more than 12 which would otherwise raise the limit) and everything worked fine

and even though I didn't set any custom CloudWatch logging here, all logs are still in the Prefect backend

@madkinsz LMK if we can close the issue for now and advise users running things at scale how they can register task definition themselves and to use it without explicit revision (using the latest active revision): "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect"

prefectcboyd commented 2 years ago

I think there is a separate issue here on this, at least one that I saw reported by a user in community, but receiving this error specifically : An error occurred (ClientException) when calling the RegisterTaskDefinition: Too many concurrent attempts to create a new revision of the specified family.

This error was returned in AWS Cloud Watch logs, and it was determined this issue was included for prefect v1 - https://github.com/PrefectHQ/prefect/pull/5059/files , but does not exist for the agent in v2.

This was hot-fixed by the customer in line 949 here: https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/ecs.py#L949

task_definition.setdefault("family", f"prefect-{self.labels['prefect.io/flow-run-id']}")
task_definition['family'] = f"{task_definition['family']}-{self.labels['prefect.io/flow-run-id']}"

I'm not sure if it's exactly the same core issue, but the error reported is the same, and seems to have resolved it for this customer. The core is that it's missing the flow run id?

anna-geller commented 2 years ago

it is the same issue indeed

the option of recommending registering task definition e.g. as part of CI/CD and setting that ARN explicitly on the block makes sense to me and this way it wouldn't require any changes to the current logic (which works great)

I defer to @madkinsz to determine how to approach it

zanieb commented 2 years ago

Too many concurrent attempts to create a new revision of the specified family.

I do not want to use a UUID for the family as we'll generate a bunch of families that way. Ideally, it'd be the deployment name. I'll see if I can find a clean way to get the deployment name to the infrastructure.

anna-geller commented 2 years ago

ideally, flow name combined with the deployment name (flow_name/deployment_name) -- e.g. many users name all their deployments "default" or "prod" or "k8s" and only deployment name wouldn't be helpful

carlo-catalyst commented 2 years ago

Too many concurrent attempts to create a new revision of the specified family.

I do not want to use a UUID for the family as we'll generate a bunch of families that way. Ideally, it'd be the deployment name. I'll see if I can find a clean way to get the deployment name to the infrastructure.

what is the downside of too many families? I believe this is how it worked in v1, and we didn't have issues. I'm open to other solutions, just curious.

bennnym commented 2 years ago

Further context in the community here.

I am indeed experiencing this issue.

We have started our migration from 1.0 -> 2.0 but this issue is going to have to make us stop that until this is rectified as we are unable to run our production environment at the scale required.

I would also like to understand the downside in multiple families ? This worked for us in prefect 1.0

anna-geller commented 2 years ago

there are not as many downsides, it's just a little weird because a task definition is supposed to work equivalent to a Docker image - a template from which you run many instances. In the same way as you wouldn't build a Docker image at runtime every time you run a flow, you usually wouldn't want a task definition per run UUID. We can potentially do a bit with caching here or setting the name to flow_name/deployment_name if it's possible to retrieve that on the infrastructure block - Michael mentioned that he'll investigate whether that's possible.

Workaround: For now, I would recommend incorporating building a task definition as part of your CI/CD, updating that task definition ARN on your ECSTask block also from CI, and this way the AWS rate limit issue should be gone.

bennnym commented 2 years ago

Thanks @anna-geller .

With your proposed workaround will the infra overrides still work at run time?

Isn't your proposed solution suggesting that the ecs task block could just create a task definition on save and this would also mitigate the issue?

It seems a bit careless of me to have to have a work around during a migration and then be forced to fix it down the line. But if it's the only option I guess I'd prefer to delay the migration until a valid solution is in place.

anna-geller commented 2 years ago

This is a valid solution. In fact, it's how AWS actually wants us to use this service, that's why this rate limiting exists. Which infra overrides do you have in mind? I would recommend setting all infrastructure related attributes as part of your CI e.g. you build an image in CI, you push it to ECR, you build a task definition and tell Prefect on the block which one to use (Prefect will always use the latest) by specifying an explicit task definition ARN

anna-geller commented 2 years ago

I think what we discuss here is more a convenience enhancement to avoid task definition registration through CI and doing this at flow runtime even at scale. This shouldn't block migration though. If you have issues figuring out CI, I can point you to resources that can help

bennnym commented 2 years ago

This is what I currently do with my CI.

I build an image and push it to ECR
I build the ECSTask block like so:


def build_and_save_ecs_task(repo_name: str):
    account_number = Sts.get_caller_identity()

    ecs = ECSTask(
        name=repo_name,
        image=f"{account_number}.dkr.ecr.ap-southeast-2.amazonaws.com/prefect-{repo_name}:latest",
        cpu=512,
        memory=256,
        **{
            "stream_output": True,
            "configure_cloudwatch_logs": True,
            "cluster": "prefect-cluster",
            "execution_role_arn": "arn:aws:iam::XXX:role/prefect-execution-role",
            "task_role_arn": "arn:aws:iam::XXX:role/prefect-task-role",
            "launch_type": "FARGATE",
            "vpc_id": "vpc-XXX",
            "env": {"PYTHONPATH": "$PYTHONPATH:.", "AWS_RETRY_MODE": "adaptive", "AWS_MAX_ATTEMPTS": "100"},
            "task_customizations": [
                {"op": "replace", "path": "/networkConfiguration/awsvpcConfiguration/assignPublicIp", "value": "DISABLED"},
                {
                    "op": "add",
                    "path": "/networkConfiguration/awsvpcConfiguration/subnets",
                    "value": ["subnet-XXX", "subnet-XXX", "subnet-XXX"],
                },
                {
                    "op": "add",
                    "path": "/networkConfiguration/awsvpcConfiguration/securityGroups",
                    "value": ["sg-XXX"],
                },
            ],
        },
    )
    ecs.save(name=repo_name, overwrite=True)

I deploy the flows and deployments from a deployments.yml file that I maintain with the info that is required.

eg

flows:
- flow_name: betfair_transactions
flow_path: flows/betfair/betfair_transactions.py
deployments:
  - name: strategy
    parameters: {"account_identifier": "strategy", "record_count": 5000}
    schedule: {"cron": "0 * * * *", "timezone": "Australia/Brisbane"}
    infra_overrides: {"cpu": 1024, "memory": 8192}
  - name: automated
    parameters: {"account_identifier": "automated", "record_count": 5000}
    schedule: {"cron": "0 * * * *", "timezone": "Australia/Brisbane"}

This yaml file is used to template what I need in the prefect deployments build and prefect deployments apply commands.

That make sense?

Given that I can still configure my memory and cpu, I see no issues with your proposal.

AmI right that you guys are just calling the boto3 client.run_task and providing the cpu and memory keys at run time?

If that is the case, I think this will work fine.

Are you just suggesting instead of supplying all the configurations to the EcsRun I make my own task definition and then just provide this arn? I can still just create one task definition for the whole repository? Which arguements would I still need to provide to the ECSTask ? I assume it is any that are mapped to client.run_task ?

anna-geller commented 2 years ago

thx for sharing, step 2 seems unnecessary since you are using latest image and no changes are made to the ECSTask block

you would need a different ECSTask definition for different cpu/memory setting

yes exactly, you would register the task definition from your CI and all you need on your block is to set the task_definition_arn to "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect" and Prefect will take the latest active revision automatically

anna-geller commented 2 years ago

so both ECSTask block creation and ECSTask registration would actually need to happen only once if you are using the latest ECR image

to make it easier, perhaps you can have 2 ecs task blocks - one configured with minimal cpu/memory and another one more generous? this way, it's trivial to switch when some deployment turns out to need more memory e.g. switch from -ib ecs-task/default to -ib ecs-task/high-memory

bennnym commented 2 years ago

Thanks @anna-geller .

Ah yeah that is not an ideal solution for our use case at all then.

We have many specific and different memory/CPU combinations to be used and the entire reason that we'd be against using this option is because it will end up with many many blocks for ecs. Seems like overkill tbh.

Step 2 is more there just for IAC and reproducibility in the repo.

bennnym commented 2 years ago

I think for now the hot fix shown in the thread above is probably the best solution for our needs.

PrefectHQ / prefect-aws

Reduce chances of hitting AWS ECS Rate Limits #148