Open kumare3 opened 3 years ago
cc @EngHabu / @cosmicBboy / @kanterov
As old as this issue may be, we would absolutely LOVE this and it has been a stable feature on other orchestration engines (such as Kubeflow) for many years.
@kumare3 is this a flytekit only change or would it require changes to propellor to propagate error state
@dylanwilder the changes in the backend are mostly already done.. it's possible they have regressed because of the lack of end to end testing for it (because it's not implemented in flytekit).. would you be able to help with the flytekit side if things?
Potentially we could pitch in since we'd like to see this, do you have anything outlining what's required?
@dylanwilder I think this is really close. Maybe attempt to use it in an example and start the debugging journey from there? Happy to be pulled in once you get flytekit to produce the spec in case you deem it a problem with the backend...
Thanks will take a look and see!
@eapolinario is probably also looking into this
wondering if there is a way to support inputs and outputs that are different from the workflow interface in the failure handler..below is an example use case we were trying to implement:
Was just brainstorming with @pingsutw now on this... here are my thoughts on UX:
@workflow
def my_wf(a: int) -> str:
b = my_task(a=a)
flytekit.current_context().on_failure = clean_up(a=a, b=b)
return b
@task
def clean_up(err: Error, a: Optional[int], b: Optional[str]) -> str:
...
return b or clean_up(a=a, b=b)
as an alternative syntax is too hacky?clean_up
must take Error as the first parameter and can take any number of extra inputs as long as they are all Optional. Propeller will fill them in if they are available or None otherwise... it's the implementor's job to handle those cases correctly within clean_upwdyt @kumare3 @eapolinario @gitgraghu
Need some discussion about
PRs for failure node. (still WIP)
also cc @cosmicBboy @wild-endeavor
is this okay to close?
Motivation: Why do you think this is important? Flyte backend supports a Failure-node for every workflow / sub-workflow. This is not currently exposed in flytekit (python or Java)
Goal: What should the final outcome look like, ideally? Users should be able to define failure nodes for their workflows. An example for the python SDK is as follows
If my_wf() fails at any point during execution, it'll call my_error_handler() task and will pass some context (error info... etc.) to allow it to handle the error. The expectation is that my_error_handler() would do things like clean up resources, log/send customized notifications... etc. The thing it will NOT let you do is recover from failure... The execution of this workflow will still fail, be marked as failure and upstream callers will still be notified of its failure.
An example of sub-workflows:
In this case, my_parent_wf will continue running even if any of the nodes fails. The overall status of the execution will again be marked as failure but it'll let as many nodes as possible to execute... Whenever my_sub_wf fails, it'll invoke an instance of my_error_handler task to cleanup resources... etc.
Describe alternatives you've considered NA
[Optional] Propose: Link/Inline OR Additional context More discussion in https://github.com/flyteorg/flyte/issues/1012
Related flytekit java issue - #1012