Open islean opened 1 year ago
The issue could be related to
Not sure why we only notice it when applying revisions though. It is especially confusing since the alembic revisions do not use the connection pool.
Looking through the logs on cg-prod-services
around the time a lock arose:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
Entire error error.txt
A quick fix would be to turn off pooling, but it would not resolve the root cause. This is just a symptom of an issue in the application code.
SET GLOBAL innodb_print_all_deadlocks = ON;
in production to ensure we log the dead lock errors.
Part of the problem might be that we have one codebase for both a web api and a cli. The cli has processes which run for 10 + hours. The timeout for a connection is set to 8 hours, so these long running CLI commands fail and the session is not torn down correctly.
Happened again, this is a severe issue affecting the stability of our systems and potentially causing data loss. We need to prioritize it.
Waiting for this to occur again after logs have been enabled
Description
When deploying #2560, the alembic upgrade froze in production twice, blocking a lot of activity for the production team. Given that the deployment to stage succeeded several times, it is unlikely that the issue is linked to the PR, prompting the need for some investigation.
Suggested solution
Find why the deployment froze and fix it.
This can be closed when
Alembic revisions are applied properly in production.