DurhamARC / ManyFEWS

The Many Flood Early Warning System
GNU General Public License v3.0
7 stars 3 forks source link

FATAL: the database system is in recovery mode #95

Closed sjmf closed 2 years ago

sjmf commented 2 years ago

The connection to the database sometimes fails when running the flood model, with the exception django.db.utils.OperationalError: connection to server at "postgres" (172.28.0.4), port 5432 failed: FATAL: the database system is in recovery mode.

Viewing the stack traces, the postgres process is killed with signal 9 (SIGKILL).

I believe this is likely to be an Out of Memory error in Postgres.

Postgres Stack Trace:

``` manyfews_postgres | 2022-10-26 14:31:26.627 UTC [183] LOG: server process (PID 1768) was terminated by signal 9: Killed manyfews_postgres | 2022-10-26 14:31:26.627 UTC [183] DETAIL: Failed process was running: UPDATE "calculations_depthprediction" SET "model_version_id" = (CASE WHEN ("calculations_depthprediction"."id" = 2630526) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630540) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630541) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630542) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630543) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630544) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630545) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630546) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630547) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630552) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630556) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630560) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630565) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630566) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630567) THEN 2 WHEN ("calculations_depthprediction"."id" = 2630568) T manyfews_postgres | 2022-10-26 14:31:26.629 UTC [183] LOG: terminating any other active server processes manyfews_postgres | 2022-10-26 14:31:26.922 UTC [1848] manyfews@manyfews FATAL: the database system is in recovery mode ```

Celery Stack Trace:

``` [2022-10-26 15:31:31,870: ERROR/MainProcess] Pool callback raised exception: OperationalError('connection to server at "postgres" (172.28.0.4), port 5432 failed: FATAL: the database system is in recovery mode\n') manyfews_celery | Traceback (most recent call last): manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 282, in ensure_connection manyfews_celery | self.connect() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 263, in connect manyfews_celery | self.connection = self.get_new_connection(conn_params) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/postgresql/base.py", line 215, in get_new_connection manyfews_celery | connection = Database.connect(**conn_params) manyfews_celery | File "/venv/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect manyfews_celery | conn = _connect(dsn, connection_factory=connection_factory, **kwasync) manyfews_celery | psycopg2.OperationalError: connection to server at "postgres" (172.28.0.4), port 5432 failed: FATAL: the database system is in recovery mode manyfews_celery | manyfews_celery | manyfews_celery | The above exception was the direct cause of the following exception: manyfews_celery | manyfews_celery | Traceback (most recent call last): manyfews_celery | File "/venv/lib/python3.9/site-packages/billiard/pool.py", line 1796, in safe_apply_callback manyfews_celery | fun(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/celery/worker/request.py", line 577, in on_failure manyfews_celery | self.task.backend.mark_as_failure( manyfews_celery | File "/venv/lib/python3.9/site-packages/celery/backends/base.py", line 172, in mark_as_failure manyfews_celery | self.store_result(task_id, exc, state, manyfews_celery | File "/venv/lib/python3.9/site-packages/celery/backends/base.py", line 528, in store_result manyfews_celery | self._store_result(task_id, result, state, traceback, manyfews_celery | File "/venv/lib/python3.9/site-packages/django_celery_results/backends/database.py", line 66, in _store_result manyfews_celery | self.TaskModel._default_manager.store_result( manyfews_celery | File "/venv/lib/python3.9/site-packages/django_celery_results/managers.py", line 46, in _inner manyfews_celery | return fun(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django_celery_results/managers.py", line 168, in store_result manyfews_celery | obj, created = self.using(using).get_or_create(task_id=task_id, manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/query.py", line 928, in get_or_create manyfews_celery | return self.get(**kwargs), False manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/query.py", line 646, in get manyfews_celery | num = len(clone) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/query.py", line 376, in __len__ manyfews_celery | self._fetch_all() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/query.py", line 1866, in _fetch_all manyfews_celery | self._result_cache = list(self._iterable_class(self)) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/query.py", line 87, in __iter__ manyfews_celery | results = compiler.execute_sql( manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1396, in execute_sql manyfews_celery | cursor = self.connection.cursor() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 323, in cursor manyfews_celery | return self._cursor() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 299, in _cursor manyfews_celery | self.ensure_connection() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 282, in ensure_connection manyfews_celery | self.connect() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/utils.py", line 91, in __exit__ manyfews_celery | raise dj_exc_value.with_traceback(traceback) from exc_value manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 282, in ensure_connection manyfews_celery | self.connect() manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/base/base.py", line 263, in connect manyfews_celery | self.connection = self.get_new_connection(conn_params) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner manyfews_celery | return func(*args, **kwargs) manyfews_celery | File "/venv/lib/python3.9/site-packages/django/db/backends/postgresql/base.py", line 215, in get_new_connection manyfews_celery | connection = Database.connect(**conn_params) manyfews_celery | File "/venv/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect manyfews_celery | conn = _connect(dsn, connection_factory=connection_factory, **kwasync) manyfews_celery | django.db.utils.OperationalError: connection to server at "postgres" (172.28.0.4), port 5432 failed: FATAL: the database system is in recovery mode ```
sjmf commented 2 years ago

Although the Celery log doesn't confirm it, I suspect that the issue arises with the Django bulk_update method called in BulkConnectionManager.

Testing shows that Django generates a very large UPDATE with CASE WHEN statements.

I have decreased the DATABASE_CHUNK_SIZE variable to 1000, tested successfully on the development instance, and committed the change in e606554.

Further testing is required for the optimal DATABASE_CHUNK_SIZE per #92. Closing this issue for now as resolved, however, we should monitor the production database for such errors and consider increasing the work_mem setting in Postgres.