Tasks Stuck at Scheduled State

hbc-acai commented 5 months ago

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.

This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.

There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.

[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks up for execution:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting the following tasks to queued state:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']

What you think should happen instead?

No response

How to reproduce

I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.

Operating System

Azure Kubernetes Service

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using apache-airflow:2.9.1-python3.10 image

Anything else?

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

boring-cyborg[bot] commented 5 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

bangnh1 commented 5 months ago

I'm having the same issue. Airfow tasks stuck after a few days of running. I handled it by restarting scheduler pods.

shahar1 commented 5 months ago

Could you please provide more details for reproducing the issue? (e.g., Airflow settings, what operators are being used in the tasks, versions of operators, etc.) More details are needed so someone can reproduce and fix the issue themselves.

hbc-acai commented 5 months ago

I upgraded from 2.7.3 directly to 2.9.1 and this issue showed up. We did not catch it during testing because it only happened after running the scheduler for some time. Today I tested it more in our dev cluster by running more tasks, and the same issue showed up.

We use KubernetesExecutor, and mostly Python and PythonVirtualEnvOperator.

Meta database: PostgresDB 15.6 DAGs and logs Persistent Volumes: Azure Blob Storage Single Scheduler

jedcunningham commented 5 months ago

Assuming you haven't installed a different version of the k8s provider, you're running on 8.1.1.

@hbc-acai @bangnh1 do either of you see pods stacking and not being deleted up in k8s? If you are exporting metrics to statsd/prometheus, what do you see for executor.open_slots and executor.running_tasks during that time period?

hbc-acai commented 5 months ago

@jedcunningham Yes, I do see more than 10 more pods show in READY=0/1 and Status=complete states when I run kubectl get pods -n airflow.

Sorry, how can I export metrics to statsd/prometheus?

Also, what is the correct why to check the k8s provider version? It is strange that I do not see a k8s provider in the package list for my scheduler:

airflow@airflow-scheduler-79b6db7fd8-x77mk:/opt/airflow$ pip freeze
adal==1.2.7
adlfs==2024.4.1
aiobotocore==2.12.3
aiofiles==23.2.1
aiohttp==3.9.5
aioitertools==0.11.0
aioodbc==0.4.1
aiosignal==1.3.1
alembic==1.13.1
amqp==5.2.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1
apispec==6.6.1

jedcunningham commented 5 months ago

It's this one, apache-airflow-providers-cncf-kubernetes==8.1.1, sorry, I tend to just refer to it as the "k8s provider", but its actually named "CNCF Kubernetes".

For metrics info, see https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.

I suspect you might be hitting #36998, and 8.3.0 from #39551 might fix it? Can you try upgrading to that version?

hbc-acai commented 5 months ago

Thanks @jedcunningham. I upgraded apache-airflow-providers-cncf-kubernetes==8.3.0 and I have not seen the same error in the last few hours.

shahar1 commented 5 months ago

As no further action seems required, I'll close this issue. If you encounter this behavior again, please create another issue.

romsharon98 commented 1 month ago

We have the same issue. I looked at the executor_open_slots and saw that when we have a spike this metric show almost 0 open slots. I wonder why is that because we have 4 schedulers and AIRFLOWCOREPARALLELISEM equals to 128. Is there different configuration for executor slots? and does executor reffere also to scheduler? I see also in executor_running_tasks that we go until 512 (4 * 128)..so as I understand executor_open_slots this metrics isn't correct and shows only one scheduler.

potiuk commented 3 weeks ago

I think it must be some kind of race condition or missing error condition handling. I wonder if this one is involved with some cases where people manually kill tasks for example from the UI ? (this is a totally wild guess - I base it on the fact that it happens pretty often but not really always, so there might be some unusual thing to trigger it).

potiuk commented 2 weeks ago

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @hbc-acai ?

apache / airflow