Fixed error handling when getting a connection.

jchumnanvech commented 9 years ago

Right now we are encountering a bug when someone resets the postgres-db service. The connections become stale. When psycopg2 Error is thrown set_exc_info with error throws another exception which causes the connection to not be put back in the connection pool (line 560 is not executed because line 559 blows up). Its in the the tornado concurrent exception traceback logger.

It is trying to pass in the error into
self.formatted_tb = traceback.format_exception(*exc_info) and exc_info is only 1 parameter and it excepts 3 This is in concurrent.py line 113

Changing it to set exception seems to make it work.

haizaar commented 9 years ago

Nice catch. It's bug indeed, but of a different kind. The thing is that set_exception does not preserve tracebacks in Python 2 (see relevant docs).

That's why I use set_exc_info where it makes sense. So instead of converting to set_exception, can you try this:

ping_future.set_exc_info(sys.exc_info())

I also strive to cover all bugs with unittests. Can you please write one for this issue? I guess it would be enough to close conn's file descriptor to simulate server restart.

Thanks.

friedcell commented 9 years ago

Explored this for a bit and it seems a bit more complicated.

If Connection.ping dies in Pool._operate if will do a Pool.putconn in all cases but one - conn = fut.result() which can't happen from a _ping_future_connection anyways as connection is passed and fut is a fake future with a set result.

If a file descriptor is closed another error happens - AssertionError: Tried to release non-busy connection as Pool.putconn in Pool._ping_future_connection is called on an Ping._operate error that already does Pool.putconn in its own error handling.

Checking how restarts work I found psycopg#196 but restarting the server still doesn't mark the connection as closed properly (at least on osx using psycopg2 2.6.1). Once that happens connects are never recycled.

Connections to a host also fail in retrying as Connection._io_callback fails, retry loop is not triggered as the exception is added to the future and the query fails - this time the connection is correctly marked as closed and is recycled.

Am I missing something?

haizaar commented 9 years ago

OK, lets take it one thing in a time. @jchumnanvech original bug is an issue with incorrect API call. It's trivial and should be fixed. But the bug itself surfaced another issue - unittest do not properly verify connectivity issues handling.

Regarding psycopg#196. It was fixed back then when I complained about it and is still working as of psycopg2 2.6.1 on my Ubunutu 14.04. Probably it's OSX specific issue. @friedcell, I suggest you reopen the ticket on psycopg2 tracker.

Now, @friedcell discovered another bug: On connectivity errors, putconn is called twice on the same connection and hits assertion error. Need to open another bug on that and write a proper reproduction.

@friedcell, regarding your last paragraph - I fail to understand what you are trying to say (that's probably me :) ). Can you please open another issue for this and we'll continue to analyze it there? (Some code that catches concrete failure scenarios would really help).

friedcell commented 9 years ago

I'll open another issue for reconnection issues - I have tests for that but need to do some more testing on Ubuntu.

I don't yet have tests for double putconn and the set_exc_info issue...

haizaar commented 9 years ago

Looks like server restarts can be simulated pretty well with the following:

>>> import psycopg2
>>> conn = psycopg2.connect("user=foo password=bar host=127.0.0.1 port=5432")
>>> import socket
>>> s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
>>> s.shutdown(socket.SHUT_WR)
>>> cur = conn.cursor(); cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL connection has been closed unexpectedly

>>> conn.closed
2
>>>

The real server restart causes the following:

>>> cur = conn.cursor(); cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: terminating connection due to administrator command
SSL connection has been closed unexpectedly

The next step is to incorporate connection killing into unittests, see where it fails and continue from there.

haizaar commented 9 years ago

I don't yet have tests for double putconn and the set_exc_info issue...

I think I can tinker it.

friedcell commented 9 years ago

I got different errors on OSX - my biggest problem was the fact that psycopg2 does not report the connections as closed. Solved double putconn with a simple if conn in self.conns.busy: in _ping_future_connection

haizaar commented 9 years ago

@friedcell, I finally see what you are saying - the retrial code runs only when cursor.execute fails synchronously. Unfortunately when server restarts, it fails asynchronously later on in _io_callback and is propagated straight to the caller.

Something to work on...

Thanks for discovering this.

haizaar commented 9 years ago

With async code, shutting socket down for writes, does not set conn.closed. But shutting it for reads, does the trick. However DatabaseError except of OperationalError is thrown. So need to catch both of them for retrials:

from __future__ import print_function

import psycopg2
import select
import socket
import traceback

def wait(conn):
    while 1:
        state = conn.poll()
        if state == psycopg2.extensions.POLL_OK:
            break
        elif state == psycopg2.extensions.POLL_WRITE:
            select.select([], [conn.fileno()], [])
        elif state == psycopg2.extensions.POLL_READ:
            select.select([conn.fileno()], [], [])
        else:
            raise psycopg2.OperationalError("poll() returned %s" % state)

aconn = psycopg2.connect(dsn="user=foo password=bar host=127.0.0.1 port=5432", async=1)
wait(aconn)

s = socket.fromfd(aconn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
s.shutdown(socket.SHUT_RD)

try:
    acur = aconn.cursor()
    acur.execute("SELECT 1")
    wait(aconn)
except Exception as err:
    print("Execution failed: %s" % err)
    traceback.print_exc()
    print("aconn.closed = %s" % aconn.closed)
else:
    print(acur.fetchone())

And the result is:

$ python asynclose.py 
Execution failed: SSL SYSCALL error: EOF detected

Traceback (most recent call last):
  File "asynclose.py", line 35, in <module>
    wait(aconn)
  File "asynclose.py", line 12, in wait
    state = conn.poll()
OperationalError: SSL SYSCALL error: EOF detected

aconn.closed = 2

haizaar commented 9 years ago

I fixed all of, hopefully, it in my master branch. Guys, can you try this version: https://github.com/haizaar/momoko? Both ping issue and reconnects should work.

At the end, shutting down sockets, did not work - this somehow augments psycopg2 state in the weird way, such that most of the time it works, but one of every 30 test runs fails with wonky error. Spent a lot of time trying to understand whether it's a problem with my code, but conclusion was that it's not. And in fact after incorporating third-party tcp proxy, it all works smooth.

If you run tests, make sure to run make -C tcproxy first.

haizaar commented 9 years ago

Guys, does anyone has any experience with Travis CI? for some reason my proxy tests fail when run on Travis - can't force psycopg2 into connecting through TCP to my proxy.

Here is the failing build - https://travis-ci.org/haizaar/momoko/builds/77378307 (pulls from my master).

haizaar commented 9 years ago

Travis fixed. I've merged the code to master. Feedback is welcome.

friedcell commented 9 years ago

Tested in all scenarios that were failing for me before - works perfectly now. Sorry for the delay, fell into a zone where normal things don't happen very often.

haizaar commented 9 years ago

Excellent. I'll release 2.2.0 later this week. On Sep 1, 2015 21:38, "Marko Mrdjenovic" notifications@github.com wrote:

Tested in all scenarios that were failing for me before - works perfectly now. Sorry for the delay, fell into a zone where normal things don't happen very often.

— Reply to this email directly or view it on GitHub https://github.com/FSX/momoko/pull/115#issuecomment-136823072.

FSX / momoko

Fixed error handling when getting a connection. #115