PrefectHQ / prefect

Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
https://prefect.io
Apache License 2.0
15.23k stars 1.5k forks source link

Clarify documentation for GPU-based AWS jobs #13969

Open DbCrWk opened 5 days ago

DbCrWk commented 5 days ago

First check

Describe the issue

We use prefect to launch ECS-based jobs. Our CPU-only jobs work great on the push work queues. However, we also have several jobs that need GPUs. It has been very difficult to set this up properly and there is no single source of documentation that covers the end-to-end flow. In particular, our expectation is that we should be able to easily spin up GPU-based jobs on ECS and correctly autoscale the number of EC2 instances, including winding down to 0 instances if there are no jobs.

Here's what we've figured out (please correct us if there's a better way). We're happy to contribute documentation, sample code, and terraform templates for our solution:

  1. You cannot use a serverless push work pool, and instead must use the hybrid AWS ECS pool. This fact is only gently hinted at because of the logical consequence of two things: a. AWS Fargate does not support GPU-based machines, see: https://github.com/aws/containers-roadmap/issues/88 b. AWS ECS Push work pools only support Fargate, see: https://docs.prefect.io/latest/concepts/work-pools/

    AWS Elastic Container Service - Push: Execute flow runs within containers on AWS ECS. Works with existing ECS clusters and serverless execution via AWS Fargate.

  2. You have to set up the following resources: a. An ECS cluster for a prefect worker. We recommend setting up a dedicated ECS cluster for just this prefect worker. b. An appropriate autoscaling group (ASG) that spins up very carefully configured EC2 instances. This ASG has to be set up exactly correctly with the right AMIs because of the vagaries of ECS, see here, here, and here. The desired capacity should be set to 0. c. An ECS cluster for GPU-based jobs with the previous ASG set up as a capacity provider, also with a very specific configuration.

  1. Most importantly, you have to set a capacity provider strategy and not a launch type. You can set this on the work pool or a deployment itself. This fact is not documented directly, and instead is a logical consequence of the fact that the AWS RunTask API will, for whatever reason, ignore a capacity provider if the launch type is set, see: here and here.

    When you use cluster auto scaling, you must specify capacityProviderStrategy and not launchType.

However, it is unclear from the relevant prefect documentation that you actually cannot specify a launch type, and will otherwise get errors in submitting the flow to the infrastructure.

The final fact ^ seems to have cause a lot of confusion:

Describe the proposed change

We would recommend that:

  1. There should be a dedicated page for best practices with GPU-based jobs on AWS.
  2. The fact that EC2-based jobs need a hybrid work pool should be made more explicit.
  3. The fact that you need a capacity provider strategy and not a launch type should be made very clear in the relevant pages on work pools and the AWS integration.
  4. The sample terraform templates should be updated to include an end-to-end setup for GPU-based jobs.

Additional context

No response

zzstoatzz commented 5 days ago

hey @DbCrWk

We're happy to contribute documentation

any updates to the existing guides + a specialized guide with more of your exact situation would be super appreciated!

let us know if you need any help with the contribution process or have any questions!

DbCrWk commented 5 days ago

How should I provide provide an update? Do you want documentation + a terraform template? @zzstoatzz

discdiver commented 5 days ago

Thank you @DbCrWk! Re: the docs, we're updating our contributing section, so the README here is probably most useful at the moment.