PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.16k stars 1.58k forks source link

Setting capacityProviderStrategy not working in Push Work Pool #13030

Open kaaloo opened 1 year ago

kaaloo commented 1 year ago

Expectation / Proposal

I would like to be able to use a Push Work Pool on AWS that can scale to zero because of the cost of idle GPU instances. I have created the following template in the Advanced Tab of the Work Pool.

https://gist.github.com/kaaloo/de723421fb6fda6965fda3d3af5b6dc2

It attempts to replace the launchType section in the task run request with a capacityProviderStrategy section by setting up a capacity_provider variable which is then used as follows. The launch_type variable and launchType section in the template have been removed.

      "capacityProviderStrategy": [
        {
          "base": 0,
          "weight": 1,
          "capacityProvider": "{{ capacity_provider }}"
        }
      ]

Unfortunately, the current code base considers the launchType section to be mandatory and provides a default value of FARGATE which is added back in. This behavior is not compatible with what I'm trying to achieve.

https://repost.aws/questions/QUnKdakxvQROuUEjq2UWpa9g/should-ecs-ec2-asgprovider-capacity-provider-be-able-to-scale-up-from-zero-0-1

Traceback / Example

It seems to me that the issue stems somewhere around these lines of code:

https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/workers/ecs_worker.py#L854C1-L854C1

However, I'm not familiar enough with the workings of the new Push Work Pools to get much further in resolving the issue myself. However, with some guidance I could give it a try. Especially if that can help get the fix out faster.

jakekaplan commented 1 year ago

Hi @kaaloo thanks for filing the issue. You're right that right now ECS Work pool templates don't support capacityProviderStrategy. I'm happy to work with you if you'd like to contribute the feature!

Right now we ensure that there is a default launch type no matter what: https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/workers/ecs_worker.py#L1233

We also validate other fields based on the launch type: https://github.com/PrefectHQ/prefect-aws/blob/main/prefect_aws/workers/ecs_worker.py#L854C1-L854C1

It seems like we'll want to make sure launchType is not specified if capacityProviderStrategy and maybe some other relevant validation.

Push work pool are a closed source implementation, although they mimic a lot of the logic from the ECSWorker in this repo. If you're willing to contribute a fix I'm happy to port the implementation to push work pool version.