flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.47k stars 585 forks source link

[Core feature] Flytekit plugin termination hook #2810

Open hajapy opened 2 years ago

hajapy commented 2 years ago

Motivation: Why do you think this is important?

As a flytekit plugin author, I would like to be able to define cleanup actions (in python) that should be run when a flyte task or workflow is aborted (or terminates unexpectedly). For example, this could be used to let the plugin cancel long-running operations in external services if a user aborts a workflow execution.

Goal: What should the final outcome look like, ideally?

The flytekit PythonTask interface could expose a new method (analogous to execute) that would be called upon termination of the task via an external signal.

Describe alternatives you've considered

An effort was made to use the stdlib's signal module to register handlers from the custom plugin module, but this didn't ultimately work (the signal seemed to be intercepted by pyflyte-execute).

Propose: Link/Inline OR Additional context

This suggestion was discussed in a flyte slack thread.

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dyu-bot commented 1 year ago

Hi there, wondering if there's been any progress on this one? (:

This is a feature we need currently for our use case. Cheers

kumare3 commented 1 year ago

There has been no progress on this one, but it is infact doable. Only thing is we need async support.

kumare3 commented 1 year ago

From the thread to preserve it

i have a custom task plugin which creates an AWS resource. what i want is, when i execute this task as part of workflow, and i terminate the workflow, the AWS resource should be deleted. i would assume that when a workflow is aborted a SIGTERM is sent to the pod running that task (lmk if thats not the case?) and i am trying to define a shutdown hook which will run some cleanup code.

i am not receiving a SIGTERM or a SIGINT in my code. how can i attach a shutdown hook that is run when the workflow is terminated

SIGTERM should be sent to the container: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

Some reasons for why you may not be catching it: https://stackoverflow.com/questions/72544506/sigterm-not-sent-on-pod-delete

It’s probably easier to just do a docker run with the same command that you can copy from the flyte Pod spec to try things locally… I would check the running processes in the container to see what’s process 1… etc. is it the shell or is it the python process

the shell command is PID 1. hmm, so the SIGTERM signal is sent to the the pyflyte_execute and its not forwarded to the python process.

kumare3 commented 1 year ago

Work is to be done to make pyflyte execute handle sigint and sigterm

dyu-bot commented 1 year ago

Thanks @kumare3 for the updates. I'm not sure where the voting booth is, but I'd like to cast a ➕1 for this. I imagine orphaned resources from early workflow termination is a common occurrence.

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

eapolinario commented 9 months ago

This is a pretty cool idea that increases the utility of plugins. I'd like to mention that as all graceful shutdown mechanisms, this one is also best effort (so not guarantees that it will execute 100% of the time).

github-actions[bot] commented 1 week ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏