Closed gward closed 8 years ago
Oops, forgot to show my complete reproduction script:
$ cat pgrestart.py
#!/usr/bin/python
# What happens when momoko is connected to a database and the postgresql server is
# restarted? momoko should recover, but doesn't.
from __future__ import print_function
import sys
import logging
import random
import time
import psycopg2
from tornado import gen, ioloop
import momoko
log = logging.getLogger('pgrestart')
@gen.coroutine
def test(dsn, io_loop):
log.info('creating pool: %r', dsn)
pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
yield pool.connect()
log.info('pool: %r', pool)
while True:
log.debug('calling pool.getconn()')
try:
conn = yield pool.getconn()
except psycopg2.DatabaseError:
log.exception('error from pool.getconn()')
time.sleep(1)
continue
log.debug('querying conn %r', conn)
cur = yield conn.execute('select pg_backend_pid()')
pid = cur.fetchall()[0][0]
log.info('pid: %r', pid)
cur.close()
pool.putconn(conn)
time.sleep(random.uniform(0.2, 0.8))
pool.close()
def main():
dsn = sys.argv[1]
logging.basicConfig(format='[%(asctime)s %(levelname)-1.1s %(name)s] %(message)s',
level=logging.DEBUG)
logging.getLogger("momoko").setLevel(logging.INFO)
io_loop = ioloop.IOLoop.instance()
future = test(dsn, io_loop)
io_loop.add_future(future, lambda x: io_loop.stop())
io_loop.start()
main()
If I run this normally, the output is very boring:
python pgrestart.py 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,807 I pgrestart] creating pool: 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,819 I pgrestart] pool: <momoko.connection.Pool object at 0x7f55c52ffc50>
[2016-07-04 16:25:46,819 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:46,820 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:46,821 I pgrestart] pid: 29496
[2016-07-04 16:25:47,528 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:47,530 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:47,531 I pgrestart] pid: 29496
[...repeat forever...]
But if I start it running, and then in other terminal I run sudo service postgresql restart
, here's what I get almost every time:
[2016-07-04 16:26:52,676 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:52,682 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:52,683 I pgrestart] pid: 29510
[2016-07-04 16:26:53,445 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:53,448 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
File "pgrestart.py", line 32, in test
conn = yield pool.getconn()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:26:54,450 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:54,470 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:54,471 I pgrestart] pid: 29543
That is, one query fails while PG is down, and then we're back in business with a new backend. I'd love it if momoko hid that and did the necessary retries for me, but that's not the serious problem.
Here is the serious problem, which I see occasionally:
[2016-07-04 16:27:12,581 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:12,583 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:27:12,584 I pgrestart] pid: 29634
[2016-07-04 16:27:13,087 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:13,088 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
File "pgrestart.py", line 32, in test
conn = yield pool.getconn()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:27:14,089 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:14,090 E tornado.application] Exception in callback <functools.partial object at 0x7f020a8fa890>
Traceback (most recent call last):
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
ret = callback()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
conn = fut.result()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
...and that's it. My app is dead, frozen, hung. Nothing happens until I kill it and restart it.
Note that I can greatly increase the frequency of hangs by decreasing the sleep after "error from pool.getconn()". Eg. with a 1 sec sleep as shown, roughly 1 in 10 restarts result in a hung process. But if I reduce the sleep to 0.1 sec:
--- a/pgrestart.py
+++ b/pgrestart.py
@@ -32,7 +32,7 @@ def test(dsn, io_loop):
conn = yield pool.getconn()
except psycopg2.DatabaseError:
log.exception('error from pool.getconn()')
- time.sleep(1)
+ time.sleep(0.1)
continue
log.debug('querying conn %r', conn)
then my process hangs almost every single time. It's dead easy to reproduce.
And here is the output of running pgrestart.py
with momoko debugging cranked up:
[2016-07-04 16:31:47,029 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,029 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,030 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,031 D pgrestart] querying conn <momoko.connection.Connection object at 0x7fe91745a7d0>
[2016-07-04 16:31:47,032 I pgrestart] pid: 29784
[2016-07-04 16:31:47,032 D momoko] About to release connection 8
[2016-07-04 16:31:47,033 D momoko] The connection is alive
[2016-07-04 16:31:47,033 D momoko] Handling free connection 8
[2016-07-04 16:31:47,033 D momoko] No outstanding requests - adding to free pool
[2016-07-04 16:31:47,611 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,611 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,611 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,612 D momoko] Method failed Asynchronously
[2016-07-04 16:31:47,612 D momoko] About to release connection 8
[2016-07-04 16:31:47,612 D momoko] The connection is dead
[2016-07-04 16:31:47,612 D momoko] All connections are dead - aborting waiting queue
[2016-07-04 16:31:47,612 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
File "pgrestart.py", line 32, in test
conn = yield pool.getconn()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:31:47,713 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,713 D momoko] All connections are dead
[2016-07-04 16:31:47,715 D momoko] Adding dead connection
[2016-07-04 16:31:47,716 E tornado.application] Exception in callback <functools.partial object at 0x7fe916be8520>
Traceback (most recent call last):
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
ret = callback()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
conn = fut.result()
File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
Oh yeah, I'm seeing this on:
python --version
reports 2.7.10: weird!)You are right about both. Sorry for the late response - really busy at work and 10-month old daughter :)
The reconnect logic was only conceived to reconnect idle connections that were silently dropped by the server - for example, if you are running a lazy query once in an hour; and 30 minutes in the middle the server was restarted. Trying to implement general retry logic on such a deep level is a muddy water from my experience and that's why I leave it to application developer.
Regarding the hangs - this is bug indeed. It turns out that there even is a simpler reproduction - instead of doing restarts, just shutdown the server and your script would hang.
I've fixed this in master. Can you please have alook?
And thanks for thorough reproduction (as usual :)
Imagine a simple application that runs an infinite loop, periodically querying a database via momoko:
As long as I never restart my PostgreSQL server, this works perfectly. But when I restart PG, bad stuff happens. Specifically:
getconn()
fail withDatabaseNotAvailable: No database connection available
. That's annoying; I would think that a connection pool would take care of reconnecting when my database server restarts. I don't want to wrap a retry loop around momoko, because IMHO momoko should do that for me!pool.getconn()
never yields a result. My query-forever app is dead and has to be manually restarted.The latter bug is what I'm really complaining about. If momoko's position is that PG restarts are an application problem rather than a connection pool problem, and it's my job to retry
getconn()
to handle PG restarts: not ideal, but I can live with it.But if momoko occasionally freezes up when PG restarts, that's a more serious problem. IMHO that needs to be fixed.