[Core feature] Reduce unnecessary workflow state stored on etcd to avoid hitting etcd size limits

Tom-Newton commented 10 months ago

Motivation: Why do you think this is important?

I think the bigger workflows flyte can support the better. As I understand limitation currently is usually the workflow state growing too big to be stored in etcd. The limit is usually about 2MiB. This is certainly the limit I was running into when trying to run workflows with about 50,000 nodes and max parallelism of 500.

There are ways to workaround this by wrapping stuff in sub-launch plans. When stuff is in a sub launchplan it is considered separately and the state is stored in a separate fltyeworkflow custom resource.

Goal: What should the final outcome look like, ideally?

Increase the workflow size achievable in a single launchplan by adjusting how workflows are collapsed to avoid storing unnecessary data.

Describe alternatives you've considered

There was some discussion on slack https://flyte-org.slack.com/archives/CP2HDHKE1/p1699902595630389?thread_ts=1699469696.078829&cid=CP2HDHKE1

Enable launching other flyte entities from map tasks - I don't know how this would work, probably its more complicated.
Add more layers to the fanout and put launchplan layers between them - sub-optimal user experience.
Use a more compressed format e.g. protobuf to store the workflow state - I still quite like the idea but its more complicated.
useOffloadedWorkflowClosure - this avoids writing the spec field to etcd but the status field is also really big.

None of these ideas are mutually exclusive. Implementing all of them would enable the biggest workflows.

Propose: Link/Inline OR Additional context

Add 2 new features around collapsing of workflow state. These should have feature flags to control them.

Allow sub nodes to be collapsed irrespective of their terminal state.
When using failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE only keep the error that we want to return as the error of the whole execution. Other errors can be cleared to avoid storing lots of big tracebacks in etcd.

I have been using a version of flytepropeller with these 2 changes in prod for a few weeks. It seems to work quite nicely and it gains us some room on the etcd size limit.

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

dosubot[bot] commented 10 months ago

🤖

Hello @Tom-Newton,

I'm Dosu, a friendly bot here to help while we wait for a human maintainer to join us. I'm designed to assist with bugs, answer questions, and guide you in becoming a contributor if you're interested.

I'm taking a good look at the issue you've raised and will provide a more comprehensive answer in a short while. Stay tuned!

Tom-Newton commented 10 months ago

@hamersaw looked at my initial PRs and agreed that they were on a sensible track. I will update them to put these changes behind feature flags and add some tests.

flyteorg / flyte