apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.71k stars 14.21k forks source link

Deadlock caused by airflow statistics query #26237

Closed NickYadance closed 2 years ago

NickYadance commented 2 years ago

Apache Airflow version

Other Airflow 2 version

What happened

Scheduler got crashed soon after reboot.

[SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s]
[parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)]
(Background on this error at: http://sqlalche.me/e/14/e3q8)
[2022-09-08 07:50:45,270] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor
[2022-09-08 07:50:45,292] {manager.py:341} WARNING - Ending without manager process.
[2022-09-08 07:50:45,292] {scheduler_job.py:768} INFO - Exited execute loop
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context
    cursor, statement, parameters, context
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
    cursor.execute(statement, parameters)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/share/miniconda2/envs/airflow/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
    _run_scheduler_job(args=args)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
    job.run()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 730, in _execute
    self.executor.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 82, in start
    self.kubernetes_executor.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 539, in start
    self.clear_not_launched_queued_tasks()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 515, in clear_not_launched_queued_tasks
    ).update({TaskInstance.state: State.SCHEDULED})
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3196, in update
    execution_options={"synchronize_session": synchronize_session},
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1670, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1520, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 314, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1399, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1749, in _execute_context
    e, statement, parameters, cursor, context
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1930, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context
    cursor, statement, parameters, context
  File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
    cursor.execute(statement, parameters)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s]
[parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)]
(Background on this error at: http://sqlalche.me/e/14/e3q8)

Database engine status shows that some cycle dependency exists between 2 transactions, the select and update query towards table task_instance. And each time the scheduler reboot, the update transaction will timeout and causing scheduler crash.

Noticed that the select query named with airflow statistics so i tried shutdown statsd (delete k8s deployments) blindly and it surprisingly worked. After that the scheduler reboot and run normally.

LATEST DETECTED DEADLOCK
------------------------
2022-09-08 06:31:35 139737395513088
*** (1) TRANSACTION:
TRANSACTION 4413895189, ACTIVE 67 sec starting index read
mysql tables in use 2, locked 2
LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s)
MySQL thread id 1330145004, OS thread handle 139724943472384, query id 66415927240 10.244.2.0 airflow statistics
SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.map_index AS task_instance_map_index, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_in

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413895189 lock_mode X locks rec but not gap
Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
 0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
 1: len 8; hex 747261696e5f4d59; asc xx;;
 2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
 3: len 4; hex 7fffffff; asc     ;;
 4: len 6; hex 000107166d96; asc     m ;;
 5: len 7; hex 01000000dc188c; asc        ;;
 6: len 7; hex 631989a00430e4; asc c    0 ;;
 7: SQL NULL;
 8: SQL NULL;
 9: len 7; hex 72756e6e696e67; asc running;;
 10: len 4; hex 80000002; asc     ;;
 11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes);
 12: len 4; hex 726f6f74; asc root;;
 13: len 4; hex 804e4728; asc  NG(;;
 14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
 15: len 7; hex 64656661756c74; asc default;;
 16: len 4; hex 80000003; asc     ;;
 17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
 18: len 7; hex 6319892f08ff4f; asc c  /  O;;
 19: len 4; hex 80000060; asc    `;;
 20: len 4; hex 80000004; asc     ;;
 21: len 5; hex 80047d942e; asc   } .;;
 22: len 4; hex 80000001; asc     ;;
 23: len 4; hex 804e478a; asc  NG ;;
 24: SQL NULL;
 25: SQL NULL;
 26: SQL NULL;
 27: SQL NULL;
 28: SQL NULL;

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413895189 lock_mode X locks rec but not gap waiting
Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
 0: len 4; hex 8005cdf8; asc     ;;
 1: len 6; hex 00010716a9ca; asc       ;;
 2: len 7; hex 02000002b72669; asc      &i;;
 3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
 4: len 7; hex 63162b10000000; asc c +    ;;
 5: len 6; hex 717565756564; asc queued;;
 6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
 7: len 1; hex 80; asc  ;;
 8: len 5; hex 80047d942e; asc   } .;;
 9: SQL NULL;
 10: SQL NULL;
 11: len 9; hex 7363686564756c6564; asc scheduled;;
 12: SQL NULL;
 13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
 14: len 4; hex 804ba58d; asc  K  ;;
 15: len 7; hex 63198bfe0195b2; asc c      ;;
 16: len 7; hex 63162b10000000; asc c +    ;;
 17: len 7; hex 63163920000000; asc c 9    ;;
 18: len 4; hex 80000001; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 4413893066, ACTIVE 73 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 2
MySQL thread id 1329985529, OS thread handle 139729063040768, query id 66416312012 10.244.0.215 airflow updating
UPDATE task_instance SET state='restarting' WHERE task_instance.task_id = 'xx' AND task_instance.dag_id = 'xx' AND task_instance.run_id = 'scheduled__2022-09-05T17:00:00+00:00' AND task_instance.map_index = -1

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413893066 lock_mode X locks rec but not gap
Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
 0: len 4; hex 8005cdf8; asc     ;;
 1: len 6; hex 00010716a9ca; asc       ;;
 2: len 7; hex 02000002b72669; asc      &i;;
 3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
 4: len 7; hex 63162b10000000; asc c +    ;;
 5: len 6; hex 717565756564; asc queued;;
 6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
 7: len 1; hex 80; asc  ;;
 8: len 5; hex 80047d942e; asc   } .;;
 9: SQL NULL;
 10: SQL NULL;
 11: len 9; hex 7363686564756c6564; asc scheduled;;
 12: SQL NULL;
 13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
 14: len 4; hex 804ba58d; asc  K  ;;
 15: len 7; hex 63198bfe0195b2; asc c      ;;
 16: len 7; hex 63162b10000000; asc c +    ;;
 17: len 7; hex 63163920000000; asc c 9    ;;
 18: len 4; hex 80000001; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413893066 lock_mode X locks rec but not gap waiting
Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
 0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
 1: len 8; hex 747261696e5f4d59; asc xx;;
 2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
 3: len 4; hex 7fffffff; asc     ;;
 4: len 6; hex 000107166d96; asc     m ;;
 5: len 7; hex 01000000dc188c; asc        ;;
 6: len 7; hex 631989a00430e4; asc c    0 ;;
 7: SQL NULL;
 8: SQL NULL;
 9: len 7; hex 72756e6e696e67; asc running;;
 10: len 4; hex 80000002; asc     ;;
 11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes);
 12: len 4; hex 726f6f74; asc root;;
 13: len 4; hex 804e4728; asc  NG(;;
 14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
 15: len 7; hex 64656661756c74; asc default;;
 16: len 4; hex 80000003; asc     ;;
 17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
 18: len 7; hex 6319892f08ff4f; asc c  /  O;;
 19: len 4; hex 80000060; asc    `;;
 20: len 4; hex 80000004; asc     ;;
 21: len 5; hex 80047d942e; asc   } .;;
 22: len 4; hex 80000001; asc     ;;
 23: len 4; hex 804e478a; asc  NG ;;
 24: SQL NULL;
 25: SQL NULL;
 26: SQL NULL;
 27: SQL NULL;
 28: SQL NULL;

What you think should happen instead

No response

How to reproduce

No response

Operating System

NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

aaronolszewski commented 1 year ago

@NickYadance I am running into this exact same issue. Has this been fixed in a later version?

NickYadance commented 1 year ago

No and the issue description is not actually accurate. The statistics query mainly indicates that the query has been waiting
for a long time for something, not 100% deadlock between transactions, maybe due to slow query, high mysql server IO ...

For me , i resolve this by optimizing the IO load of the mysql server. For the daedlock, maybe you can find some help here #27000 with enough debugging info.