flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.72k stars 645 forks source link

[Core feature] Workflow level throttling #5125

Open a4501150 opened 7 months ago

a4501150 commented 7 months ago

Motivation: Why do you think this is important?

Currently flyte does not support workflow level throttling that limits the same concurrent workflows that can be running under a flyte project / domain. (same workflow means the same binary identified by workflow name).

In our use case, we use flyte both for workflow scheduling and orchestration. One of our workflow will hit our backend API heavily and we want to limit the resource pressure on our backend API. Although we can implement throttling on backend API but backend API will also use in synchronized manner and we don't want to affect the current synchronized API performance and behavior, thus implement throttling on workflow side is the best choice.

Compare to other workflow engines, this is the major feature that lacks in flyte.

Goal: What should the final outcome look like, ideally?

For instance, when we trigger a workflow by creating a CreateExecution with a launch plan (preferred) OR when we register a workflow, we should be able to configure an extra parameter, eg. max_concurrent_execution as type long which can limit the concurrent workflows that running at the same time.

For example, let's say we have a workflow A, with a Task A. We created a launch plan for this workflow with max_concurrent_execution as 5.

Describe alternatives you've considered

We are now using a hacky way by creating a resource lock task to achieve the functionalities:

This effectively limit the concurrent executions of a given launchPlan (workflow) by have the throttling task as the first task in the given workflow DAG.

However, this will create many semi-dangling resource lock tasks if the amount of concurrent execution requests is large and max_concurrent_execution is relatively small.

Another alternatives solution

Task level throttling can further enhance the fine control level, however this maybe harder to implement than workflow level throttling.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 7 months ago

Thank you for opening your first issue here! 🛠

dosubot[bot] commented 7 months ago

Hey there, @a4501150! Great to see you bringing up another interesting point for discussion. How have you been?

Thank you for your detailed feature request. Workflow level throttling is indeed a valuable feature that can help manage resource utilization and prevent overloading of backend APIs.

Based on the current Flyte architecture, it is feasible to implement this feature by extending the ExecutionManager and related components. Here's a high-level approach:

  1. Track Active Executions: Extend the ExecutionManager to include a mechanism for tracking the number of currently active executions for each project and domain. This could involve maintaining a counter or a more complex data structure within the ExecutionManager or leveraging an external datastore.

  2. Check Limits Before Execution: Modify the execution creation process to check the current number of active executions against a predefined limit before allowing a new execution to start. If the limit is reached, the request to start a new execution could be rejected or queued until the number of active executions decreases.

  3. Configuration for Limits: Introduce configuration options within the FlyteAdmin configuration to define the limits for concurrent executions at the project and domain levels. These limits could be specified globally, per project, or per domain.

  4. Update Metrics: Utilize the existing metrics infrastructure to monitor and report on the number of active executions and any throttling events. This would be valuable for administrators to understand the system's behavior and adjust limits as necessary.

  5. User Feedback: Enhance the API responses to include information about throttling, such as rejection messages or estimated wait times, to provide feedback to users when their executions are being limited.

The current state management in Flyte, as represented in the NodeExecution and NodeExecutionClosure structures, can be leveraged to monitor and control the concurrency of executions. Additional logic can be introduced in the state transition functions or by introducing new functions and states that specifically handle throttling conditions. For example, a new state like QUEUED could be introduced for executions that are ready to run but are being held due to throttling constraints.

Please note that this is a high-level approach and the actual implementation would require a detailed design and thorough testing to ensure it meets the requirements and does not introduce any regressions.

For more details on the current execution management in Flyte, you can refer to the ExecutionManager code in the Flyte repository.

I hope this provides a good starting point for implementing workflow level throttling in Flyte. If you have any further questions or need more details, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
kumare3 commented 7 months ago

I think this is related to - https://github.com/flyteorg/flyte/issues/267

kumare3 commented 7 months ago

Also I think this is a great feature to add, would Spotify folks be willing to collaborate on this?

Incase concurrency limits are specified, the execution is created in Planned state and has a description field that explains why it is not running right now. Then Admin has a reconciliation loop that looks at all Planned workflows to ensure they meet the constraint before launching it.

a4501150 commented 7 months ago

Hey @kumare3 thanks for reply! I think we will be able to collaborate on this, needs some coordination with the team for planning.

Incase concurrency limits are specified, the execution is created in Planned state and has a description field that explains why it is not running right now. Then Admin has a reconciliation loop that looks at all Planned workflows to ensure they meet the constraint before launching it.

This sounds like a great starter solution!

sshardool commented 5 months ago

Wondering if this is fixed or if someone is looking into this @a4501150 ? If not, we can contribute this.

kumare3 commented 5 months ago

I have not seen any contributions. @a4501150 / @RRap0so i think Spotify wants to work on this?

a4501150 commented 4 months ago

hey sorry for late reply, @sshardool @kumare3 I don't think Spotify has capacity to work on this at the moment. It's great if @sshardool can help on this.

Cc @andresgomezfrr @RRap0so to double confirm on this so work didn't get duplicated.

sshardool commented 4 months ago

Thanks folks. We are evaluating this and should be to actively engage on this in the next couple of weeks.

davidmirror-ops commented 2 months ago

@sshardool do you know if this is in progress? A few other users are hitting the same limitation. Thanks!

sshardool commented 2 months ago

Hello @davidmirror-ops - unfortunately we have not been able to take this up yet and will likely take us some time to get to it. If some else is willing to contribute I would be happy to review, otherwise I will update here once we start work on this.

kumare3 commented 1 month ago

This is something we are thinking of working on with the LinkedIn team @sshardool