Closed taylor-curran closed 6 months ago
A small example of when this issue presents an issue for us:
from prefect import flow, deploy
import asyncio
from prefect.deployments.deployments import run_deployment
@flow(log_prints=True)
def hello(deployment_index):
print(f'hello {deployment_index}')
async def main():
deployments=[]
for x in range(200):
deployments.append(
await hello.to_deployment(
name=f"deplyment-{x}",
parameters=dict(
deployment_index=x,
),
)
)
deployed_deployments = await deploy(
*deployments,
image="same image for all deployments",
work_pool_name="same workpool for all deployments",
build=False,
)
for uuid in deployed_deployments:
await run_deployment(
name=uuid,
timeout=0,
)
if __name__ == "__main__":
asyncio.run(main())
Basically, a set of deployments, different parameters, but same image and work_pool, which results in one task_definition revision created per deployment, and we get the ClientException
.
I ran into this very same issue yesterday while trying out ECS worker pool deployments. This is not a problem we see today with our infra block agent deployments.
Expectation / Proposal
We just moved from defining our own task-definition for prefect-ecs-task, to letting prefect defining them by itself in runtime.
After the change, we seem to have an issue where prefect defines a task_definition per task. Error from aws:
Task-definition diff is:
taskDefinitionArn
, the revision-number is changedrevision
, the revision-number is changedregisteredAt
, the timestampThe main think I see I can change in the work_pool-config regarding this, is:
Auto Deregister Task Definition (Optional)
, we have tried with this both enabled and disabled.When doing a simple test like, where the same deployment is deployed multiple times, it runs ok, and same task_definition is used, both with and with out
Auto Deregister Task Definition (Optional)
:Traceback / Example
In the case where we are being throttled, we have different deployments using the same image/image-tag, and then could also use the same task-definition.
From looking at: prefect_aws/workers/ecs_worker.py#L735 You can see that we cache the task definition to avoid the over-registering of task definitions, but for this user in particular, they have too many different deployments.
We might have a situation, where what is used as cache-key is the deployment-id, so that results in the same task-definition is added to cache multiple times, and then being created multiple times. Maybe we could use a hash from the task-definition as cache-key, together with some other values from the worker/deployment, but not the deployment-id?