apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.49k stars 14.13k forks source link

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

Open amrit2196 opened 1 week ago

amrit2196 commented 1 week ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.0

What happened?

I recently upgraded my airflow version from 2.5.3 to 2.10.0 in our environment, and the parallelism count is set to 32 with three schedulers in place, so what happens is that when more than 96 tasks run, whenever a new task is scheduled after that, it gets stuck in scheduled state, with the open slot count being zero, even though the previous tasks that ran have completed and have been cleared.

What you think should happen instead?

The open slot count should increase when the tasks are completed and the tasks queued up should be scheduled

How to reproduce

Just tried it by upgrading the changes and running 5 or 6 dags with 10 task in each dag and parallelism set to 32 for each scheduler. Point to be noted is that the same set of dag works fine when it was running in airflow version 2.5.3

Operating System

Redhat linux

Versions of Apache Airflow Providers

apache-airflow-providers-postgres==5.12.0 \ apache-airflow-providers-apache-hive==8.2.0 \ apache-airflow-providers-amazon==8.28.0 \ apache-airflow-providers-cncf-kubernetes==8.4.1 \ apache-airflow-providers-apache-livy==3.9.0 \ apache-airflow-providers-presto==5.6.0 \ apache-airflow-providers-http==4.13.0 \ apache-airflow-providers-trino==5.8.0 \ apache-airflow-providers-snowflake==5.7.0 \ apache-airflow-providers-salesforce==5.8.0 \ apache-airflow-providers-papermill==3.8.0 \ apache-airflow-providers-google==10.22.0 \ apache-airflow-providers-celery==3.8.1 \ apache-airflow-providers-redis==3.8.0 \ apache-airflow-providers-dbt-cloud==3.10.0 \ apache-airflow-providers-openlineage==1.11.0 \

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 week ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jscheffl commented 1 week ago

Can you create and post an example DAG to reproduce? I am a bit courious what effect might bring this bug to you. There are maybe hundreds of installations using 2.10 already and it would be a major bug if nobody has detected this, a moment before we release 2.10.2.

Can you tell which executor you are using?

amrit2196 commented 1 week ago

We are using kubernetes executor , but for task pod deletion we run a cronjob to delete task pods, which was working fine in 2.5.3, but not in this one

amrit2196 commented 1 week ago

We are currently running a simple tag with multiple tasks with sleep and checking a get request in each tasks

jscheffl commented 6 days ago

So to be able to understand this - and most likely it is something in the environment - I request that you inspect the scheduler logs. In recent versions there should be logs emitted when the scheduler is at the parallelism limit. Can you check for this?

Can you also please post an example DAG with which it is easy to reproduce? Then we could test it as regression.