Closed gabcoyne closed 1 year ago
ok, I tested this out a bit more and it looks like when user provides the task_definition_arn in this format (with family name but without revision), everything works without hitting the ECS API limits: "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect"
my block:
ecs = ECSTask(
aws_credentials=aws_creds,
cluster="prefect",
task_definition_arn="arn:aws:ecs:us-east-1:111111111111111:task-definition/prefect",
)
ecs.save(DEFAULT_BLOCK, overwrite=True)
I triggered a bunch of runs (more than 12 which would otherwise raise the limit) and everything worked fine
and even though I didn't set any custom CloudWatch logging here, all logs are still in the Prefect backend
@madkinsz LMK if we can close the issue for now and advise users running things at scale how they can register task definition themselves and to use it without explicit revision (using the latest active revision): "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect"
I think there is a separate issue here on this, at least one that I saw reported by a user in community, but receiving this error specifically : An error occurred (ClientException) when calling the RegisterTaskDefinition: Too many concurrent attempts to create a new revision of the specified family.
This error was returned in AWS Cloud Watch logs, and it was determined this issue was included for prefect v1 - https://github.com/PrefectHQ/prefect/pull/5059/files , but does not exist for the agent in v2.
This was hot-fixed by the customer in line 949 here: https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/ecs.py#L949
task_definition.setdefault("family", f"prefect-{self.labels['prefect.io/flow-run-id']}")
task_definition['family'] = f"{task_definition['family']}-{self.labels['prefect.io/flow-run-id']}"
I'm not sure if it's exactly the same core issue, but the error reported is the same, and seems to have resolved it for this customer. The core is that it's missing the flow run id?
it is the same issue indeed
the option of recommending registering task definition e.g. as part of CI/CD and setting that ARN explicitly on the block makes sense to me and this way it wouldn't require any changes to the current logic (which works great)
I defer to @madkinsz to determine how to approach it
Too many concurrent attempts to create a new revision of the specified family.
I do not want to use a UUID for the family as we'll generate a bunch of families that way. Ideally, it'd be the deployment name. I'll see if I can find a clean way to get the deployment name to the infrastructure.
ideally, flow name combined with the deployment name (flow_name/deployment_name) -- e.g. many users name all their deployments "default" or "prod" or "k8s" and only deployment name wouldn't be helpful
Too many concurrent attempts to create a new revision of the specified family.
I do not want to use a UUID for the family as we'll generate a bunch of families that way. Ideally, it'd be the deployment name. I'll see if I can find a clean way to get the deployment name to the infrastructure.
what is the downside of too many families? I believe this is how it worked in v1, and we didn't have issues. I'm open to other solutions, just curious.
Further context in the community here.
I am indeed experiencing this issue.
We have started our migration from 1.0 -> 2.0 but this issue is going to have to make us stop that until this is rectified as we are unable to run our production environment at the scale required.
I would also like to understand the downside in multiple families ? This worked for us in prefect 1.0
there are not as many downsides, it's just a little weird because a task definition is supposed to work equivalent to a Docker image - a template from which you run many instances. In the same way as you wouldn't build a Docker image at runtime every time you run a flow, you usually wouldn't want a task definition per run UUID. We can potentially do a bit with caching here or setting the name to flow_name/deployment_name if it's possible to retrieve that on the infrastructure block - Michael mentioned that he'll investigate whether that's possible.
Workaround: For now, I would recommend incorporating building a task definition as part of your CI/CD, updating that task definition ARN on your ECSTask block also from CI, and this way the AWS rate limit issue should be gone.
Thanks @anna-geller .
With your proposed workaround will the infra overrides still work at run time?
Isn't your proposed solution suggesting that the ecs task block could just create a task definition on save and this would also mitigate the issue?
It seems a bit careless of me to have to have a work around during a migration and then be forced to fix it down the line. But if it's the only option I guess I'd prefer to delay the migration until a valid solution is in place.
This is a valid solution. In fact, it's how AWS actually wants us to use this service, that's why this rate limiting exists. Which infra overrides do you have in mind? I would recommend setting all infrastructure related attributes as part of your CI e.g. you build an image in CI, you push it to ECR, you build a task definition and tell Prefect on the block which one to use (Prefect will always use the latest) by specifying an explicit task definition ARN
I think what we discuss here is more a convenience enhancement to avoid task definition registration through CI and doing this at flow runtime even at scale. This shouldn't block migration though. If you have issues figuring out CI, I can point you to resources that can help
This is what I currently do with my CI.
def build_and_save_ecs_task(repo_name: str):
account_number = Sts.get_caller_identity()
ecs = ECSTask(
name=repo_name,
image=f"{account_number}.dkr.ecr.ap-southeast-2.amazonaws.com/prefect-{repo_name}:latest",
cpu=512,
memory=256,
**{
"stream_output": True,
"configure_cloudwatch_logs": True,
"cluster": "prefect-cluster",
"execution_role_arn": "arn:aws:iam::XXX:role/prefect-execution-role",
"task_role_arn": "arn:aws:iam::XXX:role/prefect-task-role",
"launch_type": "FARGATE",
"vpc_id": "vpc-XXX",
"env": {"PYTHONPATH": "$PYTHONPATH:.", "AWS_RETRY_MODE": "adaptive", "AWS_MAX_ATTEMPTS": "100"},
"task_customizations": [
{"op": "replace", "path": "/networkConfiguration/awsvpcConfiguration/assignPublicIp", "value": "DISABLED"},
{
"op": "add",
"path": "/networkConfiguration/awsvpcConfiguration/subnets",
"value": ["subnet-XXX", "subnet-XXX", "subnet-XXX"],
},
{
"op": "add",
"path": "/networkConfiguration/awsvpcConfiguration/securityGroups",
"value": ["sg-XXX"],
},
],
},
)
ecs.save(name=repo_name, overwrite=True)
I deploy the flows and deployments from a deployments.yml
file that I maintain with the info that is required.
eg
flows:
- flow_name: betfair_transactions
flow_path: flows/betfair/betfair_transactions.py
deployments:
- name: strategy
parameters: {"account_identifier": "strategy", "record_count": 5000}
schedule: {"cron": "0 * * * *", "timezone": "Australia/Brisbane"}
infra_overrides: {"cpu": 1024, "memory": 8192}
- name: automated
parameters: {"account_identifier": "automated", "record_count": 5000}
schedule: {"cron": "0 * * * *", "timezone": "Australia/Brisbane"}
This yaml file is used to template what I need in the prefect deployments build
and prefect deployments apply
commands.
That make sense?
Given that I can still configure my memory and cpu, I see no issues with your proposal.
AmI right that you guys are just calling the boto3
client.run_task
and providing the cpu
and memory
keys at run time?
If that is the case, I think this will work fine.
Are you just suggesting instead of supplying all the configurations to the EcsRun I make my own task definition and then just provide this arn? I can still just create one task definition for the whole repository? Which arguements would I still need to provide to the ECSTask
? I assume it is any that are mapped to client.run_task
?
thx for sharing, step 2 seems unnecessary since you are using latest image and no changes are made to the ECSTask block
you would need a different ECSTask definition for different cpu/memory setting
yes exactly, you would register the task definition from your CI and all you need on your block is to set the task_definition_arn to "arn:aws:ecs:us-east-1:338306982838:task-definition/prefect" and Prefect will take the latest active revision automatically
so both ECSTask block creation and ECSTask registration would actually need to happen only once if you are using the latest ECR image
to make it easier, perhaps you can have 2 ecs task blocks - one configured with minimal cpu/memory and another one more generous? this way, it's trivial to switch when some deployment turns out to need more memory e.g. switch from -ib ecs-task/default
to -ib ecs-task/high-memory
Thanks @anna-geller .
Ah yeah that is not an ideal solution for our use case at all then.
We have many specific and different memory/CPU combinations to be used and the entire reason that we'd be against using this option is because it will end up with many many blocks for ecs. Seems like overkill tbh.
Step 2 is more there just for IAC and reproducibility in the repo.
I think for now the hot fix shown in the thread above is probably the best solution for our needs.
AWS will rate limit task definition registration for a given task family. There are two factors that make it easy for a relatively low volume of Prefect flows to hit rate limits:
RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
Solution: I feel like creating task families per deployment will be the easiest way around this, as this reduces the number of registrations for a given family. However this does not fully resolve the issue. It is nice to see a new log stream for every flow run, however the need to register task definitions introduces a chance for rate limits to be hit.