FSX / momoko

Wraps (asynchronous) Psycopg2 for Tornado.
http://momoko.61924.nl/
Other
364 stars 73 forks source link

Restarting PostgreSQL sometimes causes Pool.getconn() to hang #147

Closed gward closed 8 years ago

gward commented 8 years ago

Imagine a simple application that runs an infinite loop, periodically querying a database via momoko:

@tornado.gen.coroutine
def run(dsn, io_loop):
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    while True:
        conn = yield pool.getconn()
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)
        time.sleep(random.uniform(0.2, 0.8))

As long as I never restart my PostgreSQL server, this works perfectly. But when I restart PG, bad stuff happens. Specifically:

The latter bug is what I'm really complaining about. If momoko's position is that PG restarts are an application problem rather than a connection pool problem, and it's my job to retry getconn() to handle PG restarts: not ideal, but I can live with it.

But if momoko occasionally freezes up when PG restarts, that's a more serious problem. IMHO that needs to be fixed.

gward commented 8 years ago

Oops, forgot to show my complete reproduction script:

$ cat pgrestart.py
#!/usr/bin/python

# What happens when momoko is connected to a database and the postgresql server is
# restarted? momoko should recover, but doesn't.

from __future__ import print_function

import sys
import logging
import random
import time

import psycopg2

from tornado import gen, ioloop
import momoko

log = logging.getLogger('pgrestart')

@gen.coroutine
def test(dsn, io_loop):
    log.info('creating pool: %r', dsn)
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    log.info('pool: %r', pool)

    while True:
        log.debug('calling pool.getconn()')
        try:
            conn = yield pool.getconn()
        except psycopg2.DatabaseError:
            log.exception('error from pool.getconn()')
            time.sleep(1)
            continue

        log.debug('querying conn %r', conn)
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)

        time.sleep(random.uniform(0.2, 0.8))

    pool.close()

def main():
    dsn = sys.argv[1]

    logging.basicConfig(format='[%(asctime)s %(levelname)-1.1s %(name)s] %(message)s',
                        level=logging.DEBUG)
    logging.getLogger("momoko").setLevel(logging.INFO)

    io_loop = ioloop.IOLoop.instance()
    future = test(dsn, io_loop)
    io_loop.add_future(future, lambda x: io_loop.stop())
    io_loop.start()

main()

If I run this normally, the output is very boring:

python pgrestart.py 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,807 I pgrestart] creating pool: 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,819 I pgrestart] pool: <momoko.connection.Pool object at 0x7f55c52ffc50>
[2016-07-04 16:25:46,819 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:46,820 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:46,821 I pgrestart] pid: 29496
[2016-07-04 16:25:47,528 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:47,530 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:47,531 I pgrestart] pid: 29496
[...repeat forever...]

But if I start it running, and then in other terminal I run sudo service postgresql restart, here's what I get almost every time:

[2016-07-04 16:26:52,676 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:52,682 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:52,683 I pgrestart] pid: 29510
[2016-07-04 16:26:53,445 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:53,448 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:26:54,450 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:54,470 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:54,471 I pgrestart] pid: 29543

That is, one query fails while PG is down, and then we're back in business with a new backend. I'd love it if momoko hid that and did the necessary retries for me, but that's not the serious problem.

Here is the serious problem, which I see occasionally:

[2016-07-04 16:27:12,581 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:12,583 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:27:12,584 I pgrestart] pid: 29634
[2016-07-04 16:27:13,087 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:13,088 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:27:14,089 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:14,090 E tornado.application] Exception in callback <functools.partial object at 0x7f020a8fa890>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available

...and that's it. My app is dead, frozen, hung. Nothing happens until I kill it and restart it.

Note that I can greatly increase the frequency of hangs by decreasing the sleep after "error from pool.getconn()". Eg. with a 1 sec sleep as shown, roughly 1 in 10 restarts result in a hung process. But if I reduce the sleep to 0.1 sec:

--- a/pgrestart.py
+++ b/pgrestart.py
@@ -32,7 +32,7 @@ def test(dsn, io_loop):
             conn = yield pool.getconn()
         except psycopg2.DatabaseError:
             log.exception('error from pool.getconn()')
-            time.sleep(1)
+            time.sleep(0.1)
             continue

         log.debug('querying conn %r', conn)

then my process hangs almost every single time. It's dead easy to reproduce.

gward commented 8 years ago

And here is the output of running pgrestart.py with momoko debugging cranked up:

[2016-07-04 16:31:47,029 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,029 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,030 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,031 D pgrestart] querying conn <momoko.connection.Connection object at 0x7fe91745a7d0>
[2016-07-04 16:31:47,032 I pgrestart] pid: 29784
[2016-07-04 16:31:47,032 D momoko] About to release connection 8
[2016-07-04 16:31:47,033 D momoko] The connection is alive
[2016-07-04 16:31:47,033 D momoko] Handling free connection 8
[2016-07-04 16:31:47,033 D momoko] No outstanding requests - adding to free pool
[2016-07-04 16:31:47,611 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,611 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,611 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,612 D momoko] Method failed Asynchronously
[2016-07-04 16:31:47,612 D momoko] About to release connection 8
[2016-07-04 16:31:47,612 D momoko] The connection is dead
[2016-07-04 16:31:47,612 D momoko] All connections are dead - aborting waiting queue
[2016-07-04 16:31:47,612 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:31:47,713 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,713 D momoko] All connections are dead
[2016-07-04 16:31:47,715 D momoko] Adding dead connection
[2016-07-04 16:31:47,716 E tornado.application] Exception in callback <functools.partial object at 0x7fe916be8520>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
gward commented 8 years ago

Oh yeah, I'm seeing this on:

haizaar commented 8 years ago

You are right about both. Sorry for the late response - really busy at work and 10-month old daughter :)

The reconnect logic was only conceived to reconnect idle connections that were silently dropped by the server - for example, if you are running a lazy query once in an hour; and 30 minutes in the middle the server was restarted. Trying to implement general retry logic on such a deep level is a muddy water from my experience and that's why I leave it to application developer.

Regarding the hangs - this is bug indeed. It turns out that there even is a simpler reproduction - instead of doing restarts, just shutdown the server and your script would hang.

I've fixed this in master. Can you please have alook?

haizaar commented 8 years ago

And thanks for thorough reproduction (as usual :)