apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.15k stars 14.04k forks source link

On Airflow with Kubernetes Executor, sometime the Task keeps running even though the executor pod has failed and later it is cleaned up by Scheduler as Daemon task #40888

Closed shakeelansari63 closed 1 month ago

shakeelansari63 commented 1 month ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.9.2

What happened?

Hello,
After upgrading to Airflow 2.9.2 we have started seeing a strange issue with our Airflow tasks. Sometime if the task fails, the executor pod is deleted. But somehow Airflow does not get notified of the failure causing the task to continue running.
After some time, scheduler finds it as a daemon task and kills the task. I don't know if this is airflow issue or Kubernetes Provider's issue. But there seem to be some drop in communication between these two.
But this has been occurring very frequently since 2.9.2 upgrade.

In some rare instances, I have also seen this behavior after successful completion of task. Though task complete successfully, Airflow is not notified of task state. Causing the task being killed later by Scheduler. And task status changes to failed.

What you think should happen instead?

When the task fails and pod is killed, Airflow should get notified appropriately. And task should reflect correct status.

How to reproduce

I am not sure how to re-produce it. This is not something happening everytime. It happens to some job. And on re-run the job completes correctly.

Operating System

Ubuntu Linux (on AKS)

Versions of Apache Airflow Providers

apache-airflow==2.9.2
apache-airflow-provides-cnfc-kubernetes==8.3.0
apache-airflow-provides-celery==3.7.1
apache-airflow-provides-ssh==3.11.1
apache-airflow-provides-ftp==3.9.1
apache-airflow-provides-sftp==4.9.0
apache-airflow-provides-common-sql==1.14.0
apache-airflow-provides-redis==3.6.0
apache-airflow-provides-jdbc==4.2.2
apache-airflow-provides-odbc==4.4.1
apache-airflow-provides-apache-spark==4.8.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

We have deployed it AKS

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

ahipp13 commented 1 month ago

I am also seeing this issue as well, although with mine it has been mostly completed task that for some reason don't kill the pod and still show as "Running" in Kubernetes.

I have also seen this on 2 Airflow versions (2.8.3 and 2.9.2). We are running on Kubernetes version 1.29. Could this be the cause? If anybody has solved this or has debugging tips that would be greatly appreciated!

potiuk commented 1 month ago

Likely database connectibvity stability or possibly timeuts. It might be that the task fails to update status. Another option (and you can try it) is "schediule after task execution" configuration - you can set it to false and see if it helps. You can also take a look in your database locks for suspicious things (maybe your database runs out of resorces, or gets restarted from time to time or there is a tempoarary connectivity issue between pods and the DB.

Converting it to a discussion as it is not a concrete "issue", but best things for you to do is to try to find some correlated events in other logs when it happens.