Open hbc-acai opened 5 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
I'm having the same issue. Airfow tasks stuck after a few days of running. I handled it by restarting scheduler pods.
Could you please provide more details for reproducing the issue? (e.g., Airflow settings, what operators are being used in the tasks, versions of operators, etc.) More details are needed so someone can reproduce and fix the issue themselves.
I upgraded from 2.7.3 directly to 2.9.1 and this issue showed up. We did not catch it during testing because it only happened after running the scheduler for some time. Today I tested it more in our dev cluster by running more tasks, and the same issue showed up.
We use KubernetesExecutor, and mostly Python and PythonVirtualEnvOperator.
Meta database: PostgresDB 15.6 DAGs and logs Persistent Volumes: Azure Blob Storage Single Scheduler
Assuming you haven't installed a different version of the k8s provider, you're running on 8.1.1.
@hbc-acai @bangnh1 do either of you see pods stacking and not being deleted up in k8s? If you are exporting metrics to statsd/prometheus, what do you see for executor.open_slots
and executor.running_tasks
during that time period?
@jedcunningham Yes, I do see more than 10 more pods show in READY=0/1 and Status=complete states when I run kubectl get pods -n airflow
.
Sorry, how can I export metrics to statsd/prometheus?
Also, what is the correct why to check the k8s provider version? It is strange that I do not see a k8s provider in the package list for my scheduler:
airflow@airflow-scheduler-79b6db7fd8-x77mk:/opt/airflow$ pip freeze
adal==1.2.7
adlfs==2024.4.1
aiobotocore==2.12.3
aiofiles==23.2.1
aiohttp==3.9.5
aioitertools==0.11.0
aioodbc==0.4.1
aiosignal==1.3.1
alembic==1.13.1
amqp==5.2.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1
apispec==6.6.1
It's this one, apache-airflow-providers-cncf-kubernetes==8.1.1
, sorry, I tend to just refer to it as the "k8s provider", but its actually named "CNCF Kubernetes".
For metrics info, see https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.
I suspect you might be hitting #36998, and 8.3.0
from #39551 might fix it? Can you try upgrading to that version?
Thanks @jedcunningham. I upgraded apache-airflow-providers-cncf-kubernetes==8.3.0
and I have not seen the same error in the last few hours.
As no further action seems required, I'll close this issue. If you encounter this behavior again, please create another issue.
We have the same issue. I looked at the executor_open_slots and saw that when we have a spike this metric show almost 0 open slots. I wonder why is that because we have 4 schedulers and AIRFLOWCOREPARALLELISEM equals to 128. Is there different configuration for executor slots? and does executor reffere also to scheduler? I see also in executor_running_tasks that we go until 512 (4 * 128)..so as I understand executor_open_slots this metrics isn't correct and shows only one scheduler.
I think it must be some kind of race condition or missing error condition handling. I wonder if this one is involved with some cases where people manually kill tasks for example from the UI ? (this is a totally wild guess - I base it on the fact that it happens pretty often but not really always, so there might be some unusual thing to trigger it).
Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.
This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.
There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.
What you think should happen instead?
No response
How to reproduce
I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.
Operating System
Azure Kubernetes Service
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Using apache-airflow:2.9.1-python3.10 image
Anything else?
No response
Are you willing to submit PR?
Code of Conduct