Restarting PostgreSQL sometimes causes Pool.getconn() to hang

gward commented 8 years ago

Imagine a simple application that runs an infinite loop, periodically querying a database via momoko:

@tornado.gen.coroutine
def run(dsn, io_loop):
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    while True:
        conn = yield pool.getconn()
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)
        time.sleep(random.uniform(0.2, 0.8))

As long as I never restart my PostgreSQL server, this works perfectly. But when I restart PG, bad stuff happens. Specifically:

almost every time, some calls to getconn() fail with DatabaseNotAvailable: No database connection available. That's annoying; I would think that a connection pool would take care of reconnecting when my database server restarts. I don't want to wrap a retry loop around momoko, because IMHO momoko should do that for me!
sometimes, the whole loop hangs forever because pool.getconn() never yields a result. My query-forever app is dead and has to be manually restarted.

The latter bug is what I'm really complaining about. If momoko's position is that PG restarts are an application problem rather than a connection pool problem, and it's my job to retry getconn() to handle PG restarts: not ideal, but I can live with it.

But if momoko occasionally freezes up when PG restarts, that's a more serious problem. IMHO that needs to be fixed.

gward commented 8 years ago

Oops, forgot to show my complete reproduction script:

$ cat pgrestart.py
#!/usr/bin/python

# What happens when momoko is connected to a database and the postgresql server is
# restarted? momoko should recover, but doesn't.

from __future__ import print_function

import sys
import logging
import random
import time

import psycopg2

from tornado import gen, ioloop
import momoko

log = logging.getLogger('pgrestart')

@gen.coroutine
def test(dsn, io_loop):
    log.info('creating pool: %r', dsn)
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    log.info('pool: %r', pool)

    while True:
        log.debug('calling pool.getconn()')
        try:
            conn = yield pool.getconn()
        except psycopg2.DatabaseError:
            log.exception('error from pool.getconn()')
            time.sleep(1)
            continue

        log.debug('querying conn %r', conn)
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)

        time.sleep(random.uniform(0.2, 0.8))

    pool.close()

def main():
    dsn = sys.argv[1]

    logging.basicConfig(format='[%(asctime)s %(levelname)-1.1s %(name)s] %(message)s',
                        level=logging.DEBUG)
    logging.getLogger("momoko").setLevel(logging.INFO)

    io_loop = ioloop.IOLoop.instance()
    future = test(dsn, io_loop)
    io_loop.add_future(future, lambda x: io_loop.stop())
    io_loop.start()

main()

If I run this normally, the output is very boring:

python pgrestart.py 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,807 I pgrestart] creating pool: 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,819 I pgrestart] pool: <momoko.connection.Pool object at 0x7f55c52ffc50>
[2016-07-04 16:25:46,819 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:46,820 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:46,821 I pgrestart] pid: 29496
[2016-07-04 16:25:47,528 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:47,530 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:47,531 I pgrestart] pid: 29496
[...repeat forever...]

But if I start it running, and then in other terminal I run sudo service postgresql restart, here's what I get almost every time:

[2016-07-04 16:26:52,676 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:52,682 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:52,683 I pgrestart] pid: 29510
[2016-07-04 16:26:53,445 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:53,448 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:26:54,450 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:54,470 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:54,471 I pgrestart] pid: 29543

That is, one query fails while PG is down, and then we're back in business with a new backend. I'd love it if momoko hid that and did the necessary retries for me, but that's not the serious problem.

Here is the serious problem, which I see occasionally:

[2016-07-04 16:27:12,581 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:12,583 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:27:12,584 I pgrestart] pid: 29634
[2016-07-04 16:27:13,087 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:13,088 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:27:14,089 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:14,090 E tornado.application] Exception in callback <functools.partial object at 0x7f020a8fa890>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available

...and that's it. My app is dead, frozen, hung. Nothing happens until I kill it and restart it.

Note that I can greatly increase the frequency of hangs by decreasing the sleep after "error from pool.getconn()". Eg. with a 1 sec sleep as shown, roughly 1 in 10 restarts result in a hung process. But if I reduce the sleep to 0.1 sec:

--- a/pgrestart.py
+++ b/pgrestart.py
@@ -32,7 +32,7 @@ def test(dsn, io_loop):
             conn = yield pool.getconn()
         except psycopg2.DatabaseError:
             log.exception('error from pool.getconn()')
-            time.sleep(1)
+            time.sleep(0.1)
             continue

         log.debug('querying conn %r', conn)

then my process hangs almost every single time. It's dead easy to reproduce.

gward commented 8 years ago

And here is the output of running pgrestart.py with momoko debugging cranked up:

[2016-07-04 16:31:47,029 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,029 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,030 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,031 D pgrestart] querying conn <momoko.connection.Connection object at 0x7fe91745a7d0>
[2016-07-04 16:31:47,032 I pgrestart] pid: 29784
[2016-07-04 16:31:47,032 D momoko] About to release connection 8
[2016-07-04 16:31:47,033 D momoko] The connection is alive
[2016-07-04 16:31:47,033 D momoko] Handling free connection 8
[2016-07-04 16:31:47,033 D momoko] No outstanding requests - adding to free pool
[2016-07-04 16:31:47,611 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,611 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,611 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,612 D momoko] Method failed Asynchronously
[2016-07-04 16:31:47,612 D momoko] About to release connection 8
[2016-07-04 16:31:47,612 D momoko] The connection is dead
[2016-07-04 16:31:47,612 D momoko] All connections are dead - aborting waiting queue
[2016-07-04 16:31:47,612 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:31:47,713 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,713 D momoko] All connections are dead
[2016-07-04 16:31:47,715 D momoko] Adding dead connection
[2016-07-04 16:31:47,716 E tornado.application] Exception in callback <functools.partial object at 0x7fe916be8520>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available

gward commented 8 years ago

Oh yeah, I'm seeing this on:

Ubuntu 15.10
system package postgresql-9.4 9.4.8-0ubuntu0.15.10
system package python 2.7.9-1 (but python --version reports 2.7.10: weird!)
system package python-psycopg2 2.6.1-1build1
latest git master of momoko (commit 678f494)

haizaar commented 8 years ago

You are right about both. Sorry for the late response - really busy at work and 10-month old daughter :)

The reconnect logic was only conceived to reconnect idle connections that were silently dropped by the server - for example, if you are running a lazy query once in an hour; and 30 minutes in the middle the server was restarted. Trying to implement general retry logic on such a deep level is a muddy water from my experience and that's why I leave it to application developer.

Regarding the hangs - this is bug indeed. It turns out that there even is a simpler reproduction - instead of doing restarts, just shutdown the server and your script would hang.

I've fixed this in master. Can you please have alook?

haizaar commented 8 years ago

And thanks for thorough reproduction (as usual :)

FSX / momoko

Restarting PostgreSQL sometimes causes Pool.getconn() to hang #147