Open kand617 opened 1 week ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
+1
In our current setup, we are using Celery workers as Airflow workers, and we have applied a memory limit on these workers as per our DevOps guidelines. When a Celery worker exceeds its memory limit, it encounters an Out-Of-Memory (OOM) error and restarts. This behavior leads to tasks that were in a running state becoming zombie tasks, which the Airflow scheduler detects.
However, we have observed that despite the scheduler detecting these tasks as zombie tasks, Airflow does not mark them as failed.
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.3
What happened?
We have upgraded to 2.8.3 and have been noticing a lot more zombie jobs. I have not upgraded to the latest version as I don't see anything in the change logs. An observation is that the worker pods that was working on the task is no longer present (likely scaled down).
Sample screenshot from the the UI.
After many moments still not stopped ![Uploading image.png…]()
Sample Pod
What you think should happen instead?
I would have expected the dags to be up for retry. Even after waiting 4 hours its not working. Whats even more strange is that
default_task_execution_timeout
was set but not respected.How to reproduce
I was unable to reproduce this. However i was unable reproduce a zombie dag too.
ideally this should have caused it to timeout as a zombie.
However it ran without any issues.
Operating System
K8
Versions of Apache Airflow Providers
Deployment
Official Apache Airflow Helm Chart
Deployment details
Running on K8
Anything else?
The issue happens too often. We have about
Are you willing to submit PR?
Code of Conduct