apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.13k stars 14.31k forks source link

Scheduler crashes with psycopg2.errors.DeadlockDetected exception #23361

Closed hkc-8010 closed 2 years ago

hkc-8010 commented 2 years ago

Apache Airflow version

2.2.5 (latest released)

What happened

Customer has a dag that generates around 2500 tasks dynamically using a task group. While running the dag, a subset of the tasks (~1000) run successfully with no issue and (~1500) of the tasks are getting "skipped", and the dag fails. The same DAG runs successfully in Airflow v2.1.3 with same Airflow configuration.

While investigating the Airflow processes, We found that both the scheduler got restarted with below error during the DAG execution.

[2022-04-27 20:42:44,347] {scheduler_job.py:742} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1256, in _execute_context
    self.dialect.do_executemany(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 912, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1646244 waits for ShareLock on transaction 3915993452; blocked by process 1640692.
Process 1640692 waits for ShareLock on transaction 3915992745; blocked by process 1646244.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (189873,4) in relation "task_instance"

This issue seems to be related to #19957

What you think should happen instead

This issue was observed while running huge number of concurrent task created dynamically by a DAG. Some of the tasks are getting skipped due to restart of scheduler with Deadlock exception.

How to reproduce

DAG file:

from propmix_listings_details import BUCKET, ZIPS_FOLDER, CITIES_ZIP_COL_NAME, DETAILS_DEV_LIMIT, DETAILS_RETRY, DETAILS_CONCURRENCY, get_api_token, get_values, process_listing_ids_based_zip
from airflow.utils.task_group import TaskGroup
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
}

date = '{{ execution_date }}'
email_to = ['example@airflow.com']
# Using a DAG context manager, you don't have to specify the dag property of each task

state = 'Maha'
with DAG('listings_details_generator_{0}'.format(state),
        start_date=datetime(2021, 11, 18),
        schedule_interval=None,
        max_active_runs=1,
        concurrency=DETAILS_CONCURRENCY,
        dagrun_timeout=timedelta(minutes=10),
        catchup=False  # enable if you don't want historical dag runs to run
        ) as dag:
    t0 = DummyOperator(task_id='start')

    with TaskGroup(group_id='group_1') as tg1:
        token = get_api_token()
        zip_list = get_values(BUCKET, ZIPS_FOLDER+state, CITIES_ZIP_COL_NAME)
        for zip in zip_list[0:DETAILS_DEV_LIMIT]:
            details_operator = PythonOperator(
                task_id='details_{0}_{1}'.format(state, zip),  # task id is generated dynamically
                pool='pm_details_pool',
                python_callable=process_listing_ids_based_zip,
                task_concurrency=40,
                retries=3,
                retry_delay=timedelta(seconds=10),
                op_kwargs={'zip': zip, 'date': date, 'token':token, 'state':state}
            )

    t0 >> tg1

Operating System

kubernetes cluster running on GCP linux (amd64)

Versions of Apache Airflow Providers

pip freeze | grep apache-airflow-providers

apache-airflow-providers-amazon==1!3.2.0 apache-airflow-providers-cncf-kubernetes==1!3.0.0 apache-airflow-providers-elasticsearch==1!2.2.0 apache-airflow-providers-ftp==1!2.1.2 apache-airflow-providers-google==1!6.7.0 apache-airflow-providers-http==1!2.1.2 apache-airflow-providers-imap==1!2.2.3 apache-airflow-providers-microsoft-azure==1!3.7.2 apache-airflow-providers-mysql==1!2.2.3 apache-airflow-providers-postgres==1!4.1.0 apache-airflow-providers-redis==1!2.0.4 apache-airflow-providers-slack==1!4.2.3 apache-airflow-providers-snowflake==2.6.0 apache-airflow-providers-sqlite==1!2.1.3 apache-airflow-providers-ssh==1!2.4.3

Deployment

Astronomer

Deployment details

Airflow v2.2.5-2 Scheduler count: 2 Scheduler resources: 20AU (2CPU and 7.5GB) Executor used: Celery Worker count : 2 Worker resources: 24AU (2.4 CPU and 9GB) Termination grace period : 2mins

Anything else

This issue happens in all the dag runs. Some of the tasks are getting skipped and some are getting succeeded and the scheduler fails with the Deadlock exception error.

Are you willing to submit PR?

Code of Conduct

dstaple commented 1 year ago

Upgrading would be best, as this appears to have been fixed about a year ago in 2.3.4. Note that it's only safe to disable use_row_level_locking if you're only using a single scheduler.

The specific deadlock I raised in this issue was fixed, yes, which was a big win. I was happy to help with that.

Unfortunately there are still other types of deadlocks that can occur. See for example https://github.com/apache/airflow/issues/27473 , which was closed for some reason.