Open yeachan153 opened 2 years ago
Do you have proposal to change the behaviour? Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?
@yeachan153 Did you ever solve this problem? We would love to be able to keep pods running during environment restarts, and it looks like your idea might work.
@wircho Increasing the termination_grace_period
should help to mitigate this issue.
Did you find a solution to this issue using KubernetesPodOperator
parameters?
We tried termination_grace_period
. Curious if anyone has any other solutions for workers reconnecting to the running pod?
Description
The
kubernetes_pod_operator
currently has areattach_on_restart
parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the
on_kill
method: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1425This ends up deleting the pod that was created: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L438
We currently got around this problem by removing the the
on_kill
call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for thekubernetes_pod_operator
and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.
Use case/motivation
Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.
We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).
Related issues
No response
Are you willing to submit a PR?
Code of Conduct