FSX / momoko

Wraps (asynchronous) Psycopg2 for Tornado.
http://momoko.61924.nl/
Other
362 stars 73 forks source link

Pool fails to reconnect when using Unix-domain socket (OperationalError: server closed the connection unexpectedly) #145

Closed gward closed 8 years ago

gward commented 8 years ago

If I use momoko.Pool to connect over TCP, it handles database disconnects and restarts perfectly. But if I connect over a Unix-domain socket, it does not. The first request after a disconnect crashes with "OperationalError: server closed the connection unexpectedly".

Here's an example script:

$ cat lostconn.py
#!/usr/bin/python

from __future__ import print_function

import sys
import subprocess

from tornado import gen, ioloop
import momoko

# what happens when momoko is connected to a database and the connection is
# killed? should recover, but doesn't if connection is over unix-domain
# socket

@gen.coroutine
def test(dsn, io_loop):
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    print('pool:', pool)

    cur = yield pool.execute('select pg_backend_pid()')
    pid = cur.fetchall()[0][0]
    print('pid 1:', pid)
    cur.close()

    sql = 'select pg_terminate_backend(%s)' % pid
    cmd = ['psql', '-q', '-o/dev/null', 'momoko', '-c', sql]
    print('running:', subprocess.list2cmdline(cmd))
    subprocess.check_call(cmd)

    cur = yield pool.execute('select pg_backend_pid()')
    print('pid 2:', cur.fetchall()[0][0])
    cur.close()

    pool.close()

def main():
    dsn = sys.argv[1]

    io_loop = ioloop.IOLoop.instance()
    future = test(dsn, io_loop)
    io_loop.add_future(future, lambda x: io_loop.stop())
    io_loop.start()

main()

I'm testing with:

I've set things up with a momoko database that I can login to without authentication, either via TCP or Unix-domain socket:

$ psql -h localhost momoko -c 'select pg_backend_pid()'
 pg_backend_pid 
----------------
          15413
$ psql -h /var/run/postgresql momoko -c 'select pg_backend_pid()' 
 pg_backend_pid 
----------------
          15428

First, let's run the test script in "happy" mode, demonstrating that momoko successfully reconnects when using a TCP socket:

$ ./lostconn.py 'host=localhost dbname=momoko'                          
pool: <momoko.connection.Pool object at 0x7f6c42633550>
pid 1: 15434
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(15434)"
pid 2: 15447

Good!

Now try the same thing via Unix-domain socket:

$ ./lostconn.py 'host=/var/run/postgresql dbname=momoko sslmode=disable'
pool: <momoko.connection.Pool object at 0x7f367e723550>
pid 1: 15470
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(15470)"
ERROR:tornado.application:Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "./lostconn.py", line 32, in test
    cur = yield pool.execute('select pg_backend_pid()')
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/home/data/src/momoko/momoko/connection.py", line 462, in when_available
    future_or_result = method(conn, *args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 761, in execute
    cursor.execute(operation)
OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

I've seen the same behaviour with v2.2.1, v2.2.2, v2.2.3, and current git master.

Of possible interest: the failure isn't absolutely reliable. I did a couple of trials of 100 runs each, and I get the above failure ~70% of the time, but success ~30% of the time. Tried it with both pool size=1 and size=5: similar result.

gward commented 8 years ago

Oh yeah: this looks similar to #129, but I have a smaller repro script. Unclear if #129 turned out to be a bug in psycopg2 or momoko. I'll dig a little deeper tomorrow and see if I can figure that out.

gward commented 8 years ago

Oh yeah again: you can't see it in my little repro script, but this bug puts Pool in a bad state. Eg. in a long-running webapp, no database queries work after the disconnect. Everything fails immediately with that OperationalError, and momoko never recovers. Only workaround appears to be restarting the app.

haizaar commented 8 years ago

Can you please rerun your example with debug enabled -logging.getLogger("momoko").setLevel(logging.DEBUG) - and post the output of the failing section?

gward commented 8 years ago
$ ./lostconn.py 'host=/var/run/postgresql dbname=momoko'
DEBUG:momoko:Opening new connection
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
DEBUG:momoko:All initial connection requests complete
pool: <momoko.connection.Pool object at 0x7f2f1c72c0d0>
DEBUG:momoko:Acquired free connection 7
DEBUG:momoko:Obtained connection: 7
DEBUG:momoko:About to release connection 7
DEBUG:momoko:The connection is alive
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
pid 1: 21048
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(21048)"
DEBUG:momoko:Acquired free connection 7
DEBUG:momoko:Obtained connection: 7
DEBUG:momoko:Method failed synchronously
DEBUG:momoko:About to release connection 7
DEBUG:momoko:The connection is alive
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
ERROR:tornado.application:Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "./lostconn.py", line 33, in test
    cur = yield pool.execute('select pg_backend_pid()')
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/home/data/src/momoko/momoko/connection.py", line 462, in when_available
    future_or_result = method(conn, *args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 761, in execute
    cursor.execute(operation)
OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
haizaar commented 8 years ago

Another check please - can you change momoko's code to log the value of conn.closed here?

It should be True, otherwise it's a bug in psycopg2 (because if momoko gets error, and connection is alive, then I assume it's a ProgrammingError, or something else that is not appropriate to retry on Momoko's level).

gward commented 8 years ago

conn.closed is false in _retry().

$ git diff
diff --git a/momoko/connection.py b/momoko/connection.py
index bcee634..116b57c 100644
--- a/momoko/connection.py
+++ b/momoko/connection.py
@@ -492,6 +492,7 @@ class Pool(object):
         return future

     def _retry(self, retry, what, conn, keep, future):
+        log.debug('Connection._retry(): conn.closed = %r (retry = %r, what = %r, conn = %r, keep = %r, future = %r)', conn.closed, retry, what, conn, keep, future)
         if conn.closed:
             if not retry:
                 retry.append(conn)

$ ./lostconn.py 'host=/var/run/postgresql dbname=momoko'
DEBUG:momoko:Opening new connection
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
DEBUG:momoko:All initial connection requests complete
pool: <momoko.connection.Pool object at 0x7f700a338c10>
DEBUG:momoko:Acquired free connection 7
DEBUG:momoko:Obtained connection: 7
DEBUG:momoko:About to release connection 7
DEBUG:momoko:The connection is alive
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
pid 1: 21709
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(21709)"
DEBUG:momoko:Acquired free connection 7
DEBUG:momoko:Obtained connection: 7
DEBUG:momoko:Method failed synchronously
DEBUG:momoko:Connection._retry(): conn.closed = False (retry = [], what = <function when_available at 0x7f700a343578>, conn = <momoko.connection.Connection object at 0x7f700a338d10>, keep = False, future = <tornado.concurrent.Future object at 0x7f700f0e0cd0>)
DEBUG:momoko:About to release connection 7
DEBUG:momoko:The connection is alive
DEBUG:momoko:Handling free connection 7
DEBUG:momoko:No outstanding requests - adding to free pool
ERROR:tornado.application:Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "./lostconn.py", line 33, in test
    cur = yield pool.execute('select pg_backend_pid()')
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/home/data/src/momoko/momoko/connection.py", line 462, in when_available
    future_or_result = method(conn, *args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 762, in execute
    cursor.execute(operation)
OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

So, to clarify, this is a psycopg2 bug? What exactly is the bug? Reporting errors differently over TCP socket and Unix-domain socket?

haizaar commented 8 years ago

Yes, exactly. The exception says that "server closed the connection...", but the connection object states that it's not closed. And this happens only sometimes and only over Unix-domain sockets.

I remember back in early days of Momoko, the .closed properly did not work on Unix sockets at all in async mode. I reported a bug on that. It was fixed but not fully, as we can see now. You are welcome to report new bug/reopen the existing one to psycopg2. If you do, I'll appreciate if you reference it here.

I'm closing the issue since there is nothing really I can do on the subject rather to advise you to use TCP.

Feel free to disagree/reopen.

gward commented 8 years ago

I can't reproduce a problem with psycopg2. However, its connection object's closed attr is set to 2. Is momoko doing a closed == 1 somewhere, and incorrectly turning psycopg2 conn.closed == 2 to momoko conn.closed == False?

Anyways, here is my psycopg2 test script:

$ cat lostconn.py
#!/usr/bin/python

from __future__ import print_function

import sys

import logging
import select

import psycopg2
import psycopg2.extras

log = logging.getLogger()

print(psycopg2)
print(psycopg2.__version__)

def main():
    logging.basicConfig(level=logging.DEBUG)

    dsn = sys.argv[1]
    conn = psycopg2.connect(dsn, async=True)
    print('conn:', conn) #, 'fd:', conn.fileno())
    psycopg2.extras.wait_select(conn)

    with conn.cursor() as cur:
        cur.execute('select pg_backend_pid()')
        psycopg2.extras.wait_select(conn)
        pid = cur.fetchall()[0][0]
        print('pid:', pid)

        cur.execute('select pg_terminate_backend(pg_backend_pid())')
        try:
            psycopg2.extras.wait_select(conn)
            cur.fetchall()
        except psycopg2.DatabaseError as err:
            log.info('as expected, caught error reading query result: %s %s', type(err), err)
        print('self-terminated connection:', conn)
        assert conn.closed, 'closed: %r' % (conn.closed,)

main()

I'm using the exact same setup as described before: specifically, psycopg2 is the ubuntu package of 2.6.1. Output using TCP socket:

./lostconn.py 'host=localhost dbname=momoko sslmode=disable'
<module 'psycopg2' from '/usr/lib/python2.7/dist-packages/psycopg2/__init__.pyc'>
2.6.1 (dt dec mx pq3 ext lo64)
conn: <connection object at 0x7fcd11cee280; dsn: 'host=localhost dbname=momoko sslmode=disable', closed: 0>
pid: 23610
INFO:root:as expected, caught error reading query result: <class 'psycopg2.DatabaseError'> server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

self-terminated connection: <connection object at 0x7fcd11cee280; dsn: 'host=localhost dbname=momoko sslmode=disable', closed: 2>

Output using Unix-domain socket:

$ ./lostconn.py 'host=/var/run/postgresql dbname=momoko'      
<module 'psycopg2' from '/usr/lib/python2.7/dist-packages/psycopg2/__init__.pyc'>
2.6.1 (dt dec mx pq3 ext lo64)
conn: <connection object at 0x7f1ce8fa5280; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
pid: 23615
INFO:root:as expected, caught error reading query result: <class 'psycopg2.DatabaseError'> server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

self-terminated connection: <connection object at 0x7f1ce8fa5280; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 2>
haizaar commented 8 years ago

Here is the .closed property logic:

@property
def closed(self):
    """
    Indicates whether the connection is closed or not.
    """
    # 0 = open, 1 = closed, 2 = 'something horrible happened'
    return self.connection.closed > 0 if self.connection else True

The scenario you are reproducing with pure psycopg2 is different. In your original test you kill server using shell, and then your code fails syncrhonously on cur.execute, without entering event loop at all. In your pure psycopg2 however, server goes away while you are waiting in the event loop, so this is a different scenario.

Can you update your latter test with killing the server the same way you did before and check again? - If I'm right, your code should raise even before entering try/except clause.

gward commented 8 years ago

The scenario you are reproducing with pure psycopg2 is different. In your original test you kill server using shell [...] Can you update your latter test with killing the server the same way you did before

Oops! My bad. I tried it... same result.

Here is my revised psycopg2 test script:

#!/usr/bin/python

from __future__ import print_function

import sys
import logging
import select
import subprocess

import psycopg2
import psycopg2.extras

log = logging.getLogger()

print(psycopg2)
print(psycopg2.__version__)

def main():
    logging.basicConfig(level=logging.DEBUG)

    dsn = sys.argv[1]
    conn = psycopg2.connect(dsn, async=True)
    print('conn:', conn) #, 'fd:', conn.fileno())
    psycopg2.extras.wait_select(conn)

    with conn.cursor() as cur:
        cur.execute('select pg_backend_pid()')
        psycopg2.extras.wait_select(conn)
        pid = cur.fetchall()[0][0]
        print('pid 1:', pid)

        sql = 'select pg_terminate_backend(%s)' % pid
        cmd = ['psql', '-q', '-o/dev/null', 'momoko', '-c', sql]
        print('running:', subprocess.list2cmdline(cmd))
        subprocess.check_call(cmd)

        cur.execute('select pg_backend_pid()')
        try:
            psycopg2.extras.wait_select(conn)
            pid = cur.fetchall()[0][0]
            print('pid 2:', pid)
        except psycopg2.DatabaseError as err:
            log.info('as expected, caught error reading query result: %s %s', type(err), err)

        print('terminated connection:', conn)
        assert conn.closed, 'closed: %r' % (conn.closed,)

main()

Output using TCP socket:

./lostconn.py 'host=localhost dbname=momoko sslmode=disable'
<module 'psycopg2' from '/usr/lib/python2.7/dist-packages/psycopg2/__init__.pyc'>
2.6.1 (dt dec mx pq3 ext lo64)
conn: <connection object at 0x7f1f9205eb40; dsn: 'host=localhost dbname=momoko sslmode=disable', closed: 0>
pid 1: 24629
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(24629)"
INFO:root:as expected, caught error reading query result: <class 'psycopg2.DatabaseError'> server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

terminated connection: <connection object at 0x7f1f9205eb40; dsn: 'host=localhost dbname=momoko sslmode=disable', closed: 2>

Output using Unix-domain socket:

./lostconn.py 'host=/var/run/postgresql dbname=momoko'      
<module 'psycopg2' from '/usr/lib/python2.7/dist-packages/psycopg2/__init__.pyc'>
2.6.1 (dt dec mx pq3 ext lo64)
conn: <connection object at 0x7f92308dbb40; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
pid 1: 24647
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(24647)"
INFO:root:as expected, caught error reading query result: <class 'psycopg2.DatabaseError'> server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

terminated connection: <connection object at 0x7f92308dbb40; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 2>
haizaar commented 8 years ago

There must me some difference between the way mokoko example works and psycopg2 stand-alone. Because in mokoko output there is "Method failed synchronously" message, meaning that execute call raised even before entering to event loop. I.e. the scenario is not exactly the same.

So far the difference I see is that synthetic example reuses the same cursor object through context manager, while Momoko creates new cursor object for every execute call as written here.

Can you please change your sample to obtain new cursor object before each execute call?

gward commented 8 years ago

Can you please change your sample to obtain new cursor object before each execute call?

Sure. I just tried it, and it made no difference: the psycopg2 connection object's closed attr is 2 every time.

I'm going to take a different approach: start with my momoko example, copy the necessary bits of momoko into my example script, and then rip out as much code as I can while still seeing the incorrect closed attr.

gward commented 8 years ago

BTW, I can't reopen this bug. Would appreciate if you could reopen until I can clearly point to a problem with psycopg2.

gward commented 8 years ago

For the record, here is a repro script that uses momoko but focuses on the closed attribute. It clearly demonstrates inconsistent results coming from psycopg2.

$ cat badclosed.py
#!/usr/bin/python

from __future__ import print_function

import sys
import logging
import subprocess

import psycopg2
from tornado import gen, ioloop
import momoko

# if a connection is closed externally, its 'closed' attr should always be true
# (because psycopg2 should always set its 'closed' attr to non-zero)

@gen.coroutine
def test(dsn):
    conn = momoko.Connection(dsn=dsn)
    yield conn.connect()
    print('psycopg2 conn:', conn.connection)

    cur = yield conn.execute('select pg_backend_pid()')
    pid = cur.fetchall()[0][0]
    print('pid 1:', pid)
    cur.close()

    sql = 'select pg_terminate_backend(%s)' % pid
    cmd = ['psql', '-q', '-o/dev/null', 'momoko', '-c', sql]
    print('running:', subprocess.list2cmdline(cmd))
    subprocess.check_call(cmd)

    print('before trying another query: psycopg2 conn:', conn.connection)
    try:
        cur = yield conn.execute('select pg_backend_pid()')
    except psycopg2.DatabaseError as err:
        print('as expected, got %s: %s' % (err.__class__.__name__, err))
    print('after trying another query: psycopg2 conn: ', conn.connection)
    assert conn.connection.closed == 2, conn.connection.closed
    assert conn.closed, conn.closed

def main():
    dsn = sys.argv[1]

    logging.basicConfig()
    logging.getLogger("momoko").setLevel(logging.DEBUG)

    io_loop = ioloop.IOLoop.instance()
    future = test(dsn)
    io_loop.add_future(future, lambda x: io_loop.stop())
    io_loop.start()

main()

Run it once on a Unix-domain socket:

$ ./badclosed.py 'host=/var/run/postgresql dbname=momoko'
psycopg2 conn: <connection object at 0x7fce7b069e88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
pid 1: 9792
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(9792)"
before trying another query: psycopg2 conn: <connection object at 0x7fce7b069e88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
as expected, got OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

after trying another query: psycopg2 conn:  <connection object at 0x7fce7b069e88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
ERROR:tornado.application:Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 879, in run
    yielded = self.gen.send(value)
  File "./badclosed.py", line 39, in test
    assert conn.connection.closed == 2, conn.connection.closed
AssertionError: 0

Run it again, and get a different result:

$ ./badclosed.py 'host=/var/run/postgresql dbname=momoko'
psycopg2 conn: <connection object at 0x7f2c0d36ae88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
pid 1: 9811
running: psql -q -o/dev/null momoko -c "select pg_terminate_backend(9811)"
before trying another query: psycopg2 conn: <connection object at 0x7f2c0d36ae88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 0>
as expected, got DatabaseError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

after trying another query: psycopg2 conn:  <connection object at 0x7f2c0d36ae88; dsn: 'host=/var/run/postgresql dbname=momoko', closed: 2>
haizaar commented 8 years ago

Yeah, so there is definitely some fuzziness in psycopg2. I'm sure you'll catch it up. Sorry for not providing more assistance - pretty swamped with other things recently.

gward commented 8 years ago

OK I have a very clear reproduction script in psycopg2, and I found a fix. No PR for psycopg2 yet, but the bug is definitely there: https://github.com/psycopg/psycopg2/issues/443.

This bug is definitely not momoko's fault. @haizaar, thanks for your help in debugging this!

haizaar commented 8 years ago

Sure! Kudos on the thorough research.

gward commented 8 years ago

Extra confirmation: I just retried my two repro scripts that appeared to show a problem with momoko (lostconn.py and badclosed.py). I can still reproduce the problem using latest released psycopg2, but with my patch (https://github.com/psycopg/psycopg2/issues/443), momoko behaves perfectly.