reattach_on_restart doesn't work properly as exepected

paramjeet01 commented 4 months ago

Apache Airflow version

main (development)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

Issue : When a worker pod is killed , it is expected not to kill the task pods when reattach_on_restart is configured as True.

Case: Our current configuration of airflow includes usage of aws EC2 spot instances so we can say that workers are expected to be killed occasionally when the EC2 instance has be interrupted. When the EC2 instance has been removed from kubernetes nodes , it sends the SIGTERM to the worker pod which invokes the below _execute_task_with_callbacks method and on-kill method is called which will kill the task pod. So , reattach_on_restart won't work as expected since the task pod is deleted when a worker pod is killed. https://github.com/apache/airflow/blob/2d53c1089f78d8d1416f51af60e1e0354781c661/airflow/models/taskinstance.py#L2592-L2613

What you think should happen instead?

The task pod should remain active when a worker pod is terminated if reattach_on_restart is set to True.

How to reproduce

Create a task that uses kuberenetes pod operator.
Set reattach_on_restart to True in the task.
Delete the worker pod which will result in deletion of worker pod and the task pod.

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.2.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

Suggestion : The on_kill method should send another parameter which states who called this method (Is it called by airflow UI clear task button or due to SIGTERM signal ) so based on it we can keep the pod or delete the pod. https://github.com/apache/airflow/blob/fe4605a10e26f1b8a180979ba5765d1cb7fb0111/airflow/providers/cncf/kubernetes/operators/pod.py#L989-L1003

If someone helps me on the taskinstance.py code to add the above recommendation , I'll be able to refactor the on_kill method.

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

RNHTTR commented 4 months ago

~If I understand correctly, when reattach_on_restart is set to true, the KubernetesPodOperator will use the same worker pod if the container within the pod dies instead of creating a new pod. In the case where the pod itself is killed (which I think would be the case when the node that the worker pod is running on is wiped out), I don't think reattach_on_restart applies -- there will have to be a new worker pod.~

I was conflating KPO and KubernetesExecutor in my head

paramjeet01 commented 4 months ago

When reattach_on_restart is set to true, it is expected that if a worker pod dies, a new worker pod will adopt the task pod. However, due to the on_kill method called by task_instance, if a worker pod is killed, the task pod will also be terminated.

shahar1 commented 2 months ago

@paramjeet01 Would you still like to work on this issue? Also, it would be helpful to state in the title that the issue is related to KPO.

apache / airflow