Closed amoGLingle closed 1 year ago
Thanks for opening your first issue here! Be sure to follow the issue template!
As an experiment, we're turning off one of the Schedulers - no HA - to see if we still get a deadlock.
Question: Which version of which database do you have @amoGLingle ? And what kind of celery (which broker, which celery version etc. you have) ?
ah sorry for the delay! DB is RDS mysql 8.0.23 executor = CeleryExecutor celery module vers 4.4.2
Full module list, just in case:
centos]# pip list
Package Version
---------------------------------------- ---------
alembic 1.6.2
amqp 2.6.1
anyio 3.2.1
apache-airflow 2.1.0
apache-airflow-providers-apache-druid 2.0.0
apache-airflow-providers-apache-livy 2.0.0
apache-airflow-providers-cncf-kubernetes 2.0.0
apache-airflow-providers-ftp 1.1.0
apache-airflow-providers-http 2.0.0
apache-airflow-providers-imap 1.0.1
apache-airflow-providers-mysql 2.0.0
apache-airflow-providers-postgres 2.0.0
apache-airflow-providers-snowflake 2.1.0
apache-airflow-providers-sqlite 1.0.2
apispec 3.3.2
argcomplete 1.12.3
asn1crypto 1.4.0
async-generator 1.10
attrs 20.3.0
Authlib 0.15.5
azure-common 1.1.27
azure-core 1.17.0
azure-storage-blob 12.8.1
Babel 2.9.1
bcrypt 3.2.0
billiard 3.6.4.0
blinker 1.4
boto3 1.17.102
botocore 1.20.102
cached-property 1.5.2
cachetools 4.2.2
cattrs 1.0.0
celery 4.4.2
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 7.1.2
clickclick 20.10.2
colorama 0.4.4
colorlog 5.0.1
commonmark 0.9.1
contextvars 2.4
croniter 1.0.13
cryptography 3.4.7
dataclasses 0.7
defusedxml 0.7.1
dill 0.3.1.1
dnspython 1.16.0
docutils 0.17.1
email-validator 1.1.2
fab-oidc 0.0.9
Flask 1.1.2
Flask-Admin 1.5.8
Flask-AppBuilder 3.3.0
Flask-Babel 1.0.0
Flask-Bcrypt 0.7.1
Flask-Caching 1.10.1
Flask-JWT-Extended 3.25.1
Flask-Login 0.4.1
Flask-Mail 0.9.1
flask-oidc 1.4.0
Flask-OpenID 1.2.5
Flask-SQLAlchemy 2.5.1
Flask-WTF 0.14.3
google-auth 1.32.0
graphviz 0.16
gunicorn 20.1.0
h11 0.12.0
httpcore 0.13.6
httplib2 0.20.2
httpx 0.18.2
idna 2.10
immutables 0.15
importlib-metadata 1.7.0
importlib-resources 1.5.0
inflection 0.5.1
influxdb 5.3.1
iso8601 0.1.14
isodate 0.6.0
itsdangerous 1.1.0
Jinja2 2.11.3
jmespath 0.10.0
jsonschema 3.2.0
kombu 4.6.11
kubernetes 11.0.0
lazy-object-proxy 1.4.3
ldap3 2.9
lockfile 0.12.2
Mako 1.1.4
Markdown 3.3.4
MarkupSafe 1.1.1
marshmallow 3.12.1
marshmallow-enum 1.5.1
marshmallow-oneofschema 2.1.0
marshmallow-sqlalchemy 0.23.1
msgpack 1.0.2
msrest 0.6.21
mysql-connector-python 8.0.22
mysqlclient 2.0.3
numpy 1.19.5
oauth2client 4.1.3
oauthlib 3.1.1
openapi-schema-validator 0.1.5
openapi-spec-validator 0.3.0
oscrypto 1.2.1
pandas 1.1.5
pendulum 2.1.2
pep562 1.0
pip 21.1.2
polling2 0.4.7
prison 0.1.3
protobuf 3.17.3
psutil 5.8.0
psycopg2-binary 2.9.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
pycryptodomex 3.10.1
pydruid 0.6.2
Pygments 2.9.0
PyJWT 1.7.1
pyOpenSSL 20.0.1
pyparsing 3.0.6
pyrsistent 0.17.3
python-daemon 2.3.0
python-dateutil 2.8.1
python-editor 1.0.4
python-ldap 3.3.1
python-nvd3 0.15.0
python-slugify 4.0.1
python3-openid 3.2.0
pytz 2021.1
pytzdata 2020.1
PyYAML 5.4.1
requests 2.25.1
requests-oauthlib 1.3.0
rfc3986 1.5.0
rich 9.2.0
rsa 4.7.2
s3transfer 0.4.2
semantic-version 2.8.5
setproctitle 1.2.2
setuptools 57.0.0
setuptools-rust 0.12.1
six 1.16.0
sniffio 1.2.0
snowflake-connector-python 2.5.1
snowflake-sqlalchemy 1.2.5
SQLAlchemy 1.3.24
SQLAlchemy-JSONField 1.0.0
SQLAlchemy-Utils 0.37.2
swagger-ui-bundle 0.0.8
tabulate 0.8.9
tenacity 6.2.0
termcolor 1.1.0
text-unidecode 1.3
toml 0.10.2
typing 3.7.4.3
typing-extensions 3.7.4.3
unicodecsv 0.14.1
urllib3 1.26.6
vine 1.3.0
virtualenv 15.1.0
websocket-client 1.1.0
Werkzeug 1.0.1
wheel 0.36.2
WTForms 2.3.3
zipp 3.4.1
Also, an update. We've been running with a single schedule since the last hang and haven't seen the issue since then. Not saying that HA is the issue, just that we haven't seen the deadlock. Thx, G
Thanks. That might help with pin-pointing it.
Hello, I think we found the culprit and can close this. We had been occasionally running the db cleanup dag that is part of https://github.com/teamclairvoyant/airflow-maintenance-dags There didn't seem to be a correlation, but the last time it got run within an hour the system locked up. I did notice that there's an updated version that we weren't running, but haven't bothered to install it: The risk of running it is too high.
Question: Does airflow have/support an official module/dag that does db cleanup?
Question: Does airflow have/support an official module/dag that does db cleanup?
Look for airflow db clean
command (added in 2.3 I think)
thx
Apache Airflow version
2.1.0
What happened
We have been running Airflow 2.1.0 with Scheduler HA (2 Schedulers) and 4 worker nodes, for about 8 months, having upgraded from 1.8. Recently (last 3/4 months) we've encountered the situation where the Schedulers Lock up with no tasks running.
Symptom: No tasks getting run. Nothing running at all. Restarted workers, no luck.
Looked at scheduler logs on 2 schedulers (syslogs) and saw numerous entries like:
which seems a bug in airflow or celery - the documentation at http://sqlalche.me/e/13/7s2a says that this happens when an app improperly ignores a transaction exception and doesn’t roll back. Further explanation at https://docs.sqlalchemy.org/en/13/faq/sessions.html#faq-session-rollback
A prior AIRFLOW jira shows this has been seen before: https://issues.apache.org/jira/browse/AIRFLOW-6202?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22This%20Session%27s%20transaction%20has%20been%20rolled%20back%20due%20to%20a%20previous%20exception%20during%20flush.%22
We have encountered this issue 3 times in past ~4 months: twice on PROD cluster and once in the QA one.
What you think should happen instead
The dual Schedulers should not hang due to locked transaction. Tasks should keep executing. As my description above says, pointing out the relevant celery documentation, there seems to be a point in the code where the transaction isn't rolled back when it should be.
How to reproduce
I have no idea how to reproduce. This happens during normal course of running dags.
Operating System
Centos Linux 7
Versions of Apache Airflow Providers
prod-master1 centos]# pip list apache-airflow 2.1.0 apache-airflow-providers-apache-druid 2.0.0 apache-airflow-providers-apache-livy 2.0.0 apache-airflow-providers-cncf-kubernetes 2.0.0 apache-airflow-providers-ftp 1.1.0 apache-airflow-providers-http 2.0.0 apache-airflow-providers-imap 1.0.1 apache-airflow-providers-mysql 2.0.0 apache-airflow-providers-postgres 2.0.0 apache-airflow-providers-snowflake 2.1.0 apache-airflow-providers-sqlite 1.0.2
Deployment
Other
Deployment details
Manual hand deploy following instructions on Airflow website.
Anything else
This seems to occur only once every few months. When it does, our production dags just lock up. We have monitoring dags for each queue we have. Each runs a small a single task that pushes to influx/grafana and grafana alerting to pagerduty alerting when such lockups occur (or other issues as well, like networking outages, task runners down).
The description above shows logs with ERROR and pointer to where the issue might be: possibly not rolling back transaction in an exception.
Hope this can be (or has already been) found and fixed.
Thank You.
Are you willing to submit PR?
Code of Conduct