Open nclaeys opened 1 month ago
I was able to resolve the issue I reported in #40516 by removing the wrong piece of code in kubernetes_executor_utils.py
as suggested. Maybe that case can be used as example why this functionality should be changed?
Hi,
We are having multiple airflow instances and we observed that the scheduler thinks that tasks are still running and fails to schedule anything else i.e gets stuck whenever GKE cluster node upgrades are happening. We suspect that when a task pod running on the node is deleted/drained by the cluster (during the node upgrade process) the scheduler goes into this state. The only fix was to restart the scheduler and it started working again as expected.
We have weekly GKE upgrades scheduled and are noticing this issue in our airflow instances exactly during the upgrade.
I think as @nclaeys suggested and tested by @vlieven we should be removing that block which wrongly assumes that the pod termination is only issued by airflow scheduler and not from the cluster.
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
We had several sensors that failed to be rescheduled by the scheduler because it still thought that the worker tasks were running.
The root cause was that the scheduler missed an update event from worker task because the Kubernetes node, where the Airflow worker pod was running on, got deleted soon after the worker finished successful. This does not follow the assumption in the code that a delete of the worker is only issued by Airflow itself. The wrong code is in
kubernetes_executor_utils.py
:In the logs we see only skipping event messages for the worker pods instead of first an event that was processed. The comment of the if check say the following which is not necessarily true:
Our sensor failed such that it needed to be rescheduled and the task got requeued. At that time the scheduler never scheduled the task as it thought there was still one running and logged the following:
{base_executor.py:284} INFO - queued but still running;
The only way to fix it was to restart the scheduler as then the internal state of the scheduler was in sync with kubernetes.
What you think should happen instead?
The fundamental problem is that the watcher for events on kubernetes pods skipped the successful event of the worker.
In order to make sure we process all events there are 2 options:
How to reproduce
The issue is difficult to reproduce reliably. We do notice it on our huge production from time to time. It is however easy to see that the code is wrong in certain edge cases
Operating System
kubernetes: apache/airflow:slim-2.9.3-python3.11
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==8.0.1 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.2 apache-airflow-providers-fab==1.2.2 apache-airflow-providers-ftp==3.10.0 apache-airflow-providers-http==4.12.0 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-opsgenie==4.0.0 apache-airflow-providers-postgres==5.11.2 apache-airflow-providers-slack==7.3.2 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1
Deployment
Other 3rd-party Helm chart
Deployment details
Kubernetes deployment
Anything else?
/
Are you willing to submit PR?
Code of Conduct