Open Tom-Newton opened 10 months ago
🤖
Hello @Tom-Newton,
I'm Dosu, a friendly bot here to help while we wait for a human maintainer to join us. I'm designed to assist with bugs, answer questions, and guide you in becoming a contributor if you're interested.
I'm taking a good look at the issue you've raised and will provide a more comprehensive answer in a short while. Stay tuned!
@hamersaw looked at my initial PRs and agreed that they were on a sensible track. I will update them to put these changes behind feature flags and add some tests.
Motivation: Why do you think this is important?
I think the bigger workflows flyte can support the better. As I understand limitation currently is usually the workflow state growing too big to be stored in etcd. The limit is usually about 2MiB. This is certainly the limit I was running into when trying to run workflows with about 50,000 nodes and max parallelism of 500.
There are ways to workaround this by wrapping stuff in sub-launch plans. When stuff is in a sub launchplan it is considered separately and the state is stored in a separate fltyeworkflow custom resource.
Goal: What should the final outcome look like, ideally?
Increase the workflow size achievable in a single launchplan by adjusting how workflows are collapsed to avoid storing unnecessary data.
Describe alternatives you've considered
There was some discussion on slack https://flyte-org.slack.com/archives/CP2HDHKE1/p1699902595630389?thread_ts=1699469696.078829&cid=CP2HDHKE1
useOffloadedWorkflowClosure
- this avoids writing thespec
field to etcd but thestatus
field is also really big.None of these ideas are mutually exclusive. Implementing all of them would enable the biggest workflows.
Propose: Link/Inline OR Additional context
Add 2 new features around collapsing of workflow state. These should have feature flags to control them.
failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
only keep the error that we want to return as the error of the whole execution. Other errors can be cleared to avoid storing lots of big tracebacks in etcd.I have been using a version of flytepropeller with these 2 changes in prod for a few weeks. It seems to work quite nicely and it gains us some room on the etcd size limit.
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?