ppr-api: database Health-check should not commit for an invalid connection

jguertin commented 4 years ago

This issue is to address an issue appearing in Sentry: https://sentry.io/organizations/bc-government-registries/issues/1488039825/?project=1834187

It appears the connection was dropped when we performed "SELECT 1" against the database in the health check endpoint. Here's a bit more background on how we can optimistically deal with this kind of problem. https://docs.sqlalchemy.org/en/13/core/pooling.html#disconnect-handling-optimistic

The logs copied from the server provide a bit more context over what is available in Sentry:

Patroni health check failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 580, in do_execute
    cursor.execute(statement, parameters)
psycopg2.OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/endpoints/healthcheck.py", line 60, in db_health
    session.execute("SELECT 1")
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1269, in execute
    clause, params or {}
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 580, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

[SQL: SELECT 1]
(Background on this error at: http://sqlalche.me/e/e3q8 )
[2020-01-31 22:21:28 +0000] [13] [INFO] ('172.51.20.1', 60996) - "GET /operations/patroni HTTP/1.1" 503
[2020-01-31 22:21:28 +0000] [13] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 375, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.7/site-packages/fastapi/applications.py", line 139, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.7/contextlib.py", line 678, in __aexit__
    raise exc_details[1]
  File "/usr/local/lib/python3.7/contextlib.py", line 661, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/usr/local/lib/python3.7/contextlib.py", line 552, in _exit_wrapper
    return await cm_exit(cm, exc_type, exc, tb)
  File "/usr/local/lib/python3.7/contextlib.py", line 177, in __aexit__
    await self.gen.__anext__()
  File "/usr/local/lib/python3.7/site-packages/fastapi/concurrency.py", line 45, in contextmanager_in_threadpool
    await run_in_threadpool(cm.__exit__, None, None, None)
  File "/usr/local/lib/python3.7/site-packages/starlette/concurrency.py", line 25, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/app/models/patroni.py", line 31, in get_session
    raise ex
  File "/app/models/patroni.py", line 28, in get_session
    db.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 498, in commit
    t[1].commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1724, in commit
    self._do_commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1755, in _do_commit
    self.connection._commit_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 763, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1475, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 761, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 390, in connection
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1475, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 388, in connection
    return self._revalidate_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 463, in _revalidate_connection
    "Can't reconnect until invalid "
sqlalchemy.exc.InvalidRequestError: Can't reconnect until invalid transaction is rolled back
[2020-01-31 22:21:36 +0000] [14] [INFO] ('172.51.52.1', 48282) - "GET /operations/health HTTP/1.1" 200

jguertin commented 4 years ago

When solving the issue, we may instead look at pessimistic disconnect handling. The strategy is essentially to perform a SELECT 1 on connections when SQL Alchemy prepares to use it: https://docs.sqlalchemy.org/en/13/core/pooling.html#disconnect-handling-pessimistic

jguertin commented 4 years ago

Regarding pessimistic disconnect handling, Thor noted:

We've has been finding those settings aren't doing exactly what he needs. I'm using some the sqlalchemy settings and also have the DB liveness in the probe having to pod roll itself if it can't connect well. although with the DNS outtages in the cluster of late that makess for a fairly active set of rolls

Based on this we may need to use leverage the DB health check endpoint to reset connection as a secondary function.

bcgov / ppr-deprecated

ppr-api: database Health-check should not commit for an invalid connection #393