Open DbCrWk opened 5 days ago
hey @DbCrWk
We're happy to contribute documentation
any updates to the existing guides + a specialized guide with more of your exact situation would be super appreciated!
let us know if you need any help with the contribution process or have any questions!
How should I provide provide an update? Do you want documentation + a terraform template? @zzstoatzz
First check
Describe the issue
We use prefect to launch ECS-based jobs. Our CPU-only jobs work great on the push work queues. However, we also have several jobs that need GPUs. It has been very difficult to set this up properly and there is no single source of documentation that covers the end-to-end flow. In particular, our expectation is that we should be able to easily spin up GPU-based jobs on ECS and correctly autoscale the number of EC2 instances, including winding down to 0 instances if there are no jobs.
Here's what we've figured out (please correct us if there's a better way). We're happy to contribute documentation, sample code, and terraform templates for our solution:
You cannot use a serverless push work pool, and instead must use the hybrid AWS ECS pool. This fact is only gently hinted at because of the logical consequence of two things: a. AWS Fargate does not support GPU-based machines, see: https://github.com/aws/containers-roadmap/issues/88 b. AWS ECS Push work pools only support Fargate, see: https://docs.prefect.io/latest/concepts/work-pools/
You have to set up the following resources: a. An ECS cluster for a prefect worker. We recommend setting up a dedicated ECS cluster for just this prefect worker. b. An appropriate autoscaling group (ASG) that spins up very carefully configured EC2 instances. This ASG has to be set up exactly correctly with the right AMIs because of the vagaries of ECS, see here, here, and here. The desired capacity should be set to 0. c. An ECS cluster for GPU-based jobs with the previous ASG set up as a capacity provider, also with a very specific configuration.
However, it is unclear from the relevant prefect documentation that you actually cannot specify a launch type, and will otherwise get errors in submitting the flow to the infrastructure.
The final fact ^ seems to have cause a lot of confusion:
Describe the proposed change
We would recommend that:
Additional context
No response