Open a4501150 opened 7 months ago
Thank you for opening your first issue here! 🛠
Hey there, @a4501150! Great to see you bringing up another interesting point for discussion. How have you been?
Thank you for your detailed feature request. Workflow level throttling is indeed a valuable feature that can help manage resource utilization and prevent overloading of backend APIs.
Based on the current Flyte architecture, it is feasible to implement this feature by extending the ExecutionManager
and related components. Here's a high-level approach:
Track Active Executions: Extend the ExecutionManager
to include a mechanism for tracking the number of currently active executions for each project and domain. This could involve maintaining a counter or a more complex data structure within the ExecutionManager
or leveraging an external datastore.
Check Limits Before Execution: Modify the execution creation process to check the current number of active executions against a predefined limit before allowing a new execution to start. If the limit is reached, the request to start a new execution could be rejected or queued until the number of active executions decreases.
Configuration for Limits: Introduce configuration options within the FlyteAdmin configuration to define the limits for concurrent executions at the project and domain levels. These limits could be specified globally, per project, or per domain.
Update Metrics: Utilize the existing metrics infrastructure to monitor and report on the number of active executions and any throttling events. This would be valuable for administrators to understand the system's behavior and adjust limits as necessary.
User Feedback: Enhance the API responses to include information about throttling, such as rejection messages or estimated wait times, to provide feedback to users when their executions are being limited.
The current state management in Flyte, as represented in the NodeExecution
and NodeExecutionClosure
structures, can be leveraged to monitor and control the concurrency of executions. Additional logic can be introduced in the state transition functions or by introducing new functions and states that specifically handle throttling conditions. For example, a new state like QUEUED
could be introduced for executions that are ready to run but are being held due to throttling constraints.
Please note that this is a high-level approach and the actual implementation would require a detailed design and thorough testing to ensure it meets the requirements and does not introduce any regressions.
For more details on the current execution management in Flyte, you can refer to the ExecutionManager
code in the Flyte repository.
I hope this provides a good starting point for implementing workflow level throttling in Flyte. If you have any further questions or need more details, feel free to ask.
I think this is related to - https://github.com/flyteorg/flyte/issues/267
Also I think this is a great feature to add, would Spotify folks be willing to collaborate on this?
Incase concurrency limits are specified, the execution is created in Planned
state and has a description field that explains why it is not running right now.
Then Admin has a reconciliation loop that looks at all Planned
workflows to ensure they meet the constraint before launching it.
Hey @kumare3 thanks for reply! I think we will be able to collaborate on this, needs some coordination with the team for planning.
Incase concurrency limits are specified, the execution is created in Planned state and has a description field that explains why it is not running right now. Then Admin has a reconciliation loop that looks at all Planned workflows to ensure they meet the constraint before launching it.
This sounds like a great starter solution!
Wondering if this is fixed or if someone is looking into this @a4501150 ? If not, we can contribute this.
I have not seen any contributions. @a4501150 / @RRap0so i think Spotify wants to work on this?
hey sorry for late reply, @sshardool @kumare3 I don't think Spotify has capacity to work on this at the moment. It's great if @sshardool can help on this.
Cc @andresgomezfrr @RRap0so to double confirm on this so work didn't get duplicated.
Thanks folks. We are evaluating this and should be to actively engage on this in the next couple of weeks.
@sshardool do you know if this is in progress? A few other users are hitting the same limitation. Thanks!
Hello @davidmirror-ops - unfortunately we have not been able to take this up yet and will likely take us some time to get to it. If some else is willing to contribute I would be happy to review, otherwise I will update here once we start work on this.
This is something we are thinking of working on with the LinkedIn team @sshardool
Motivation: Why do you think this is important?
Currently flyte does not support workflow level throttling that limits the
same
concurrent workflows that can be running under a flyte project / domain. (same workflow means the same binary identified by workflow name).In our use case, we use flyte both for workflow scheduling and orchestration. One of our workflow will hit our backend API heavily and we want to limit the resource pressure on our backend API. Although we can implement throttling on backend API but backend API will also use in synchronized manner and we don't want to affect the current synchronized API performance and behavior, thus implement throttling on workflow side is the best choice.
Compare to other workflow engines, this is the major feature that lacks in flyte.
Goal: What should the final outcome look like, ideally?
For instance, when we trigger a workflow by creating a
CreateExecution
with a launch plan (preferred) OR when we register a workflow, we should be able to configure an extra parameter, eg.max_concurrent_execution
as type long which can limit the concurrent workflows that running at the same time.For example, let's say we have a
workflow A
, with aTask A
. We created a launch plan for this workflow withmax_concurrent_execution
as 5.CreateExecution
requests with same launch plan to flyte grpc APIQUEUED
max_concurrent_execution
workflows in status ofRUNNING
at the same timeDescribe alternatives you've considered
We are now using a hacky way by creating a
resource lock
task to achieve the functionalities:for loop
andThread.sleep()
inside the for loop along with querying the flyte endpoint to get the topmax_concurrent_execution
running executions sorted by execution created time.max_concurrent_execution
, and the execution id obtained by the the task via ENVFLYTE_INTERNAL_EXECUTION_ID
is not one of the returned top X executions, then the task will just wait there in the for loop withthread.sleep()
This effectively limit the concurrent executions of a given launchPlan (workflow) by have the throttling task as the first task in the given workflow DAG.
However, this will create many semi-dangling
resource lock
tasks if the amount of concurrent execution requests is large andmax_concurrent_execution
is relatively small.Another alternatives solution
Task level throttling can further enhance the fine control level, however this maybe harder to implement than workflow level throttling.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?