Clinical-Genomics / cg

Glue between Clinical Genomics apps
8 stars 2 forks source link

Database deadlocks blocking production #2576

Open islean opened 1 year ago

islean commented 1 year ago

Description

When deploying #2560, the alembic upgrade froze in production twice, blocking a lot of activity for the production team. Given that the deployment to stage succeeded several times, it is unlikely that the issue is linked to the PR, prompting the need for some investigation.

Suggested solution

Find why the deployment froze and fix it.

This can be closed when

Alembic revisions are applied properly in production.

seallard commented 1 year ago

The issue could be related to

  1. We use connection pooling.
  2. We do not always release the resources as intended when the web app/cli context is torn down.

Not sure why we only notice it when applying revisions though. It is especially confusing since the alembic revisions do not use the connection pool.

Looking through the logs on cg-prod-services around the time a lock arose:

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')

Entire error error.txt

A quick fix would be to turn off pooling, but it would not resolve the root cause. This is just a symptom of an issue in the application code.

SET GLOBAL innodb_print_all_deadlocks = ON; in production to ensure we log the dead lock errors.

Part of the problem might be that we have one codebase for both a web api and a cli. The cli has processes which run for 10 + hours. The timeout for a connection is set to 8 hours, so these long running CLI commands fail and the session is not torn down correctly.

seallard commented 4 months ago

Happened again, this is a severe issue affecting the stability of our systems and potentially causing data loss. We need to prioritize it.

Vince-janv commented 2 months ago

Blocked by https://github.com/Clinical-Genomics/IT-issues/issues/880

Vince-janv commented 1 month ago

Waiting for this to occur again after logs have been enabled