apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.31k forks source link

Airflow Scheduler Deadlock - Transaction not rolled back on Exception? #24909

Closed amoGLingle closed 1 year ago

amoGLingle commented 2 years ago

Apache Airflow version

2.1.0

What happened

We have been running Airflow 2.1.0 with Scheduler HA (2 Schedulers) and 4 worker nodes, for about 8 months, having upgraded from 1.8. Recently (last 3/4 months) we've encountered the situation where the Schedulers Lock up with no tasks running.

Symptom: No tasks getting run. Nothing running at all. Restarted workers, no luck.

Looked at scheduler logs on 2 schedulers (syslogs) and saw numerous entries like:

[root@af2-dod-prod-master1 centos]# cat /var/log/messages | grep "list index"
Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
Mar 29 03:10:05 af2-dod-prod-master1 scl: list index out of range#033[0m
--
Mar 29 03:10:23 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:23,672#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - Error sending Celery task: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: Timeout, PID: 15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
Mar 29 03:10:23 af2-dod-prod-master1 scl: Celery Task ID: TaskInstanceKey(dag_id='dod_dsp_audience_edge', task_id='emit_datamine_druid_delay_to_influxdb', execution_date=datetime.datetime(2022, 3, 28, 20, 0, tzinfo=Timezone('UTC')), try_number=1)
--
Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m429} ERROR#033[0m - Marking run <DagRun dod_queue_execution_monitor_worker4 @ 2022-03-29 03:05:00+00:00: scheduled__2022-03-29T03:05:00+00:00, externally triggered: False> failed#033[0m
Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m608} WARNING#033[0m - Failed to record first_task_scheduling_delay metric:
Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
--
Mar 29 03:10:01 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:01,631#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - Error sending Celery task: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: Timeout, PID: 15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
Mar 29 03:10:01 af2-dod-prod-master1 scl: Celery Task ID: TaskInstanceKey(dag_id='dod_sync_monitor', task_id='load_dod_sync_post_data', execution_date=datetime.datetime(2022, 3, 29, 3, 5, tzinfo=Timezone('UTC')), try_number=1)

which seems a bug in airflow or celery - the documentation at http://sqlalche.me/e/13/7s2a says that this happens when an app improperly ignores a transaction exception and doesn’t roll back. Further explanation at https://docs.sqlalchemy.org/en/13/faq/sessions.html#faq-session-rollback

A prior AIRFLOW jira shows this has been seen before: https://issues.apache.org/jira/browse/AIRFLOW-6202?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22This%20Session%27s%20transaction%20has%20been%20rolled%20back%20due%20to%20a%20previous%20exception%20during%20flush.%22

We have encountered this issue 3 times in past ~4 months: twice on PROD cluster and once in the QA one.

What you think should happen instead

The dual Schedulers should not hang due to locked transaction. Tasks should keep executing. As my description above says, pointing out the relevant celery documentation, there seems to be a point in the code where the transaction isn't rolled back when it should be.

How to reproduce

I have no idea how to reproduce. This happens during normal course of running dags.

Operating System

Centos Linux 7

Versions of Apache Airflow Providers

prod-master1 centos]# pip list apache-airflow 2.1.0 apache-airflow-providers-apache-druid 2.0.0 apache-airflow-providers-apache-livy 2.0.0 apache-airflow-providers-cncf-kubernetes 2.0.0 apache-airflow-providers-ftp 1.1.0 apache-airflow-providers-http 2.0.0 apache-airflow-providers-imap 1.0.1 apache-airflow-providers-mysql 2.0.0 apache-airflow-providers-postgres 2.0.0 apache-airflow-providers-snowflake 2.1.0 apache-airflow-providers-sqlite 1.0.2

Deployment

Other

Deployment details

Manual hand deploy following instructions on Airflow website.

Anything else

This seems to occur only once every few months. When it does, our production dags just lock up. We have monitoring dags for each queue we have. Each runs a small a single task that pushes to influx/grafana and grafana alerting to pagerduty alerting when such lockups occur (or other issues as well, like networking outages, task runners down).

The description above shows logs with ERROR and pointer to where the issue might be: possibly not rolling back transaction in an exception.

Hope this can be (or has already been) found and fixed.

Thank You.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

amoGLingle commented 2 years ago

As an experiment, we're turning off one of the Schedulers - no HA - to see if we still get a deadlock.

potiuk commented 2 years ago

Question: Which version of which database do you have @amoGLingle ? And what kind of celery (which broker, which celery version etc. you have) ?

amoGLingle commented 2 years ago

ah sorry for the delay! DB is RDS mysql 8.0.23 executor = CeleryExecutor celery module vers 4.4.2

Full module list, just in case:

centos]# pip list
Package                                  Version
---------------------------------------- ---------
alembic                                  1.6.2
amqp                                     2.6.1
anyio                                    3.2.1
apache-airflow                           2.1.0
apache-airflow-providers-apache-druid    2.0.0
apache-airflow-providers-apache-livy     2.0.0
apache-airflow-providers-cncf-kubernetes 2.0.0
apache-airflow-providers-ftp             1.1.0
apache-airflow-providers-http            2.0.0
apache-airflow-providers-imap            1.0.1
apache-airflow-providers-mysql           2.0.0
apache-airflow-providers-postgres        2.0.0
apache-airflow-providers-snowflake       2.1.0
apache-airflow-providers-sqlite          1.0.2
apispec                                  3.3.2
argcomplete                              1.12.3
asn1crypto                               1.4.0
async-generator                          1.10
attrs                                    20.3.0
Authlib                                  0.15.5
azure-common                             1.1.27
azure-core                               1.17.0
azure-storage-blob                       12.8.1
Babel                                    2.9.1
bcrypt                                   3.2.0
billiard                                 3.6.4.0
blinker                                  1.4
boto3                                    1.17.102
botocore                                 1.20.102
cached-property                          1.5.2
cachetools                               4.2.2
cattrs                                   1.0.0
celery                                   4.4.2
certifi                                  2020.12.5
cffi                                     1.14.5
chardet                                  4.0.0
click                                    7.1.2
clickclick                               20.10.2
colorama                                 0.4.4
colorlog                                 5.0.1
commonmark                               0.9.1
contextvars                              2.4
croniter                                 1.0.13
cryptography                             3.4.7
dataclasses                              0.7
defusedxml                               0.7.1
dill                                     0.3.1.1
dnspython                                1.16.0
docutils                                 0.17.1
email-validator                          1.1.2
fab-oidc                                 0.0.9
Flask                                    1.1.2
Flask-Admin                              1.5.8
Flask-AppBuilder                         3.3.0
Flask-Babel                              1.0.0
Flask-Bcrypt                             0.7.1
Flask-Caching                            1.10.1
Flask-JWT-Extended                       3.25.1
Flask-Login                              0.4.1
Flask-Mail                               0.9.1
flask-oidc                               1.4.0
Flask-OpenID                             1.2.5
Flask-SQLAlchemy                         2.5.1
Flask-WTF                                0.14.3
google-auth                              1.32.0
graphviz                                 0.16
gunicorn                                 20.1.0
h11                                      0.12.0
httpcore                                 0.13.6
httplib2                                 0.20.2
httpx                                    0.18.2
idna                                     2.10
immutables                               0.15
importlib-metadata                       1.7.0
importlib-resources                      1.5.0
inflection                               0.5.1
influxdb                                 5.3.1
iso8601                                  0.1.14
isodate                                  0.6.0
itsdangerous                             1.1.0
Jinja2                                   2.11.3
jmespath                                 0.10.0
jsonschema                               3.2.0
kombu                                    4.6.11
kubernetes                               11.0.0
lazy-object-proxy                        1.4.3
ldap3                                    2.9
lockfile                                 0.12.2
Mako                                     1.1.4
Markdown                                 3.3.4
MarkupSafe                               1.1.1
marshmallow                              3.12.1
marshmallow-enum                         1.5.1
marshmallow-oneofschema                  2.1.0
marshmallow-sqlalchemy                   0.23.1
msgpack                                  1.0.2
msrest                                   0.6.21
mysql-connector-python                   8.0.22
mysqlclient                              2.0.3
numpy                                    1.19.5
oauth2client                             4.1.3
oauthlib                                 3.1.1
openapi-schema-validator                 0.1.5
openapi-spec-validator                   0.3.0
oscrypto                                 1.2.1
pandas                                   1.1.5
pendulum                                 2.1.2
pep562                                   1.0
pip                                      21.1.2
polling2                                 0.4.7
prison                                   0.1.3
protobuf                                 3.17.3
psutil                                   5.8.0
psycopg2-binary                          2.9.1
pyasn1                                   0.4.8
pyasn1-modules                           0.2.8
pycparser                                2.20
pycryptodomex                            3.10.1
pydruid                                  0.6.2
Pygments                                 2.9.0
PyJWT                                    1.7.1
pyOpenSSL                                20.0.1
pyparsing                                3.0.6
pyrsistent                               0.17.3
python-daemon                            2.3.0
python-dateutil                          2.8.1
python-editor                            1.0.4
python-ldap                              3.3.1
python-nvd3                              0.15.0
python-slugify                           4.0.1
python3-openid                           3.2.0
pytz                                     2021.1
pytzdata                                 2020.1
PyYAML                                   5.4.1
requests                                 2.25.1
requests-oauthlib                        1.3.0
rfc3986                                  1.5.0
rich                                     9.2.0
rsa                                      4.7.2
s3transfer                               0.4.2
semantic-version                         2.8.5
setproctitle                             1.2.2
setuptools                               57.0.0
setuptools-rust                          0.12.1
six                                      1.16.0
sniffio                                  1.2.0
snowflake-connector-python               2.5.1
snowflake-sqlalchemy                     1.2.5
SQLAlchemy                               1.3.24
SQLAlchemy-JSONField                     1.0.0
SQLAlchemy-Utils                         0.37.2
swagger-ui-bundle                        0.0.8
tabulate                                 0.8.9
tenacity                                 6.2.0
termcolor                                1.1.0
text-unidecode                           1.3
toml                                     0.10.2
typing                                   3.7.4.3
typing-extensions                        3.7.4.3
unicodecsv                               0.14.1
urllib3                                  1.26.6
vine                                     1.3.0
virtualenv                               15.1.0
websocket-client                         1.1.0
Werkzeug                                 1.0.1
wheel                                    0.36.2
WTForms                                  2.3.3
zipp                                     3.4.1
amoGLingle commented 2 years ago

Also, an update. We've been running with a single schedule since the last hang and haven't seen the issue since then. Not saying that HA is the issue, just that we haven't seen the deadlock. Thx, G

potiuk commented 2 years ago

Thanks. That might help with pin-pointing it.

amoGLingle commented 1 year ago

Hello, I think we found the culprit and can close this. We had been occasionally running the db cleanup dag that is part of https://github.com/teamclairvoyant/airflow-maintenance-dags There didn't seem to be a correlation, but the last time it got run within an hour the system locked up. I did notice that there's an updated version that we weren't running, but haven't bothered to install it: The risk of running it is too high.

Question: Does airflow have/support an official module/dag that does db cleanup?

potiuk commented 1 year ago

Question: Does airflow have/support an official module/dag that does db cleanup?

Look for airflow db clean command (added in 2.3 I think)

amoGLingle commented 1 year ago

thx