Closed hkc-8010 closed 2 years ago
Upgrading would be best, as this appears to have been fixed about a year ago in 2.3.4. Note that it's only safe to disable use_row_level_locking if you're only using a single scheduler.
The specific deadlock I raised in this issue was fixed, yes, which was a big win. I was happy to help with that.
Unfortunately there are still other types of deadlocks that can occur. See for example https://github.com/apache/airflow/issues/27473 , which was closed for some reason.
Apache Airflow version
2.2.5 (latest released)
What happened
Customer has a dag that generates around 2500 tasks dynamically using a task group. While running the dag, a subset of the tasks (~1000) run successfully with no issue and (~1500) of the tasks are getting "skipped", and the dag fails. The same DAG runs successfully in Airflow v2.1.3 with same Airflow configuration.
While investigating the Airflow processes, We found that both the scheduler got restarted with below error during the DAG execution.
This issue seems to be related to #19957
What you think should happen instead
This issue was observed while running huge number of concurrent task created dynamically by a DAG. Some of the tasks are getting skipped due to restart of scheduler with Deadlock exception.
How to reproduce
DAG file:
Operating System
kubernetes cluster running on GCP linux (amd64)
Versions of Apache Airflow Providers
pip freeze | grep apache-airflow-providers
apache-airflow-providers-amazon==1!3.2.0 apache-airflow-providers-cncf-kubernetes==1!3.0.0 apache-airflow-providers-elasticsearch==1!2.2.0 apache-airflow-providers-ftp==1!2.1.2 apache-airflow-providers-google==1!6.7.0 apache-airflow-providers-http==1!2.1.2 apache-airflow-providers-imap==1!2.2.3 apache-airflow-providers-microsoft-azure==1!3.7.2 apache-airflow-providers-mysql==1!2.2.3 apache-airflow-providers-postgres==1!4.1.0 apache-airflow-providers-redis==1!2.0.4 apache-airflow-providers-slack==1!4.2.3 apache-airflow-providers-snowflake==2.6.0 apache-airflow-providers-sqlite==1!2.1.3 apache-airflow-providers-ssh==1!2.4.3
Deployment
Astronomer
Deployment details
Airflow v2.2.5-2 Scheduler count: 2 Scheduler resources: 20AU (2CPU and 7.5GB) Executor used: Celery Worker count : 2 Worker resources: 24AU (2.4 CPU and 9GB) Termination grace period : 2mins
Anything else
This issue happens in all the dag runs. Some of the tasks are getting skipped and some are getting succeeded and the scheduler fails with the Deadlock exception error.
Are you willing to submit PR?
Code of Conduct