Open crabique opened 1 year ago
Thanks for the report! What error message(s) or warnings are you getting with your patch installed? Adding pool_pre_ping=True
shouldn't hurt either way but we should try to figure out what we need to do to make the re-connection process completely seamless from the user's POV.
Session where it fails and then recovers even with pre-ping enabled:
INFO: [<IP>:32832] disconnecting
ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>
INFO: Incoming connection from <IP>:32834.
DEBUG: Client requested NBD_FLAG_NO_ZEROES.
DEBUG: [<IP>:32834]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
DEBUG: [<IP>:32834]: opt=1, length=28, data=b'<UID>'
INFO: [<IP>:32834] Negotiated export: <UID>.
ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=InternalError('Attempt to acquire lock Version <UID> twice.',)>
INFO: Incoming connection from <IP>:32836.
DEBUG: Client requested NBD_FLAG_NO_ZEROES.
DEBUG: [<IP>:32836]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
DEBUG: [<IP>:32836]: opt=1, length=28, data=b'<UID>'
INFO: [<IP>:32836] Negotiated export: <UID>.
I saw some commits with relevant changes pushed to master, can this be considered resolved? And is there a release/tag I can try?
Turns out, the problem was my TCP proxy (ingress-nginx) killing the tcp-proxied PSQL connections after 600s (default stream-timeout value), it works perfectly when the ingress is taken out of the equation or when the timeout is increased.
However, when such timeouts happen, it seems the pool_pre_ping
only fixes half the problem, the other half is that the operation fails (btw not only the NBD serve, but also long backup or restore, cleanup etc).
Describe the bug
When using
benji nbd
with configuration pointing to a PostgreSQL pool, the NBD server stops initiating new client connections after some period of inactivity.The following errors can be found in the process output:
After that, all new connection requests cause the server to show something like that:
ERROR: [<IP>:44590] NBD_CMD_READ: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)
Traceback (most recent call last):
``` File "/benji/lib64/python3.6/site-packages/benji/nbdserver.py", line 346, in handler data = self.store.read(version, cow_version, offset, length) File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1340, in read read_list = self._block_list(version, offset, length) File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1311, in _block_list block = version.get_block_by_idx(block_idx) File "/benji/lib64/python3.6/site-packages/benji/database.py", line 586, in get_block_by_idx block = Session.query(Block).filter(Block.version_id == self.id, Block.idx == idx).one_or_none() File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2849, in one_or_none return self._iter().one_or_none() File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2910, in _iter execution_options={"_sa_orm_load_options": self.load_options}, File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1712, in execute result = conn._execute_20(statement, params or {}, execution_options) File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/benji/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection self, multiparams, params, execution_options File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1582, in _execute_clauseelement cache_hit=cache_hit, File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1798, in _execute_context conn = self._revalidate_connection() File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 645, in _revalidate_connection self._invalid_transaction() File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 625, in _invalid_transaction code="8s2b", sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b) ```This behaviour persists until the NBD server is restarted.
To Reproduce
Configure the following databaseEngine:
Try to connect to the NBD server using any client, then close the connection or keep it idle for 10-15 minutes (so that the server doesn't get any new requests), then try to connect again.
Expected behavior
NBD server works reliably without restarting it after every connection drop.
Platform and versions (please complete the following information):
bitnami/postgresql-ha@10.0.9
helm chart for the databaseAdditional context
Based on the suggestions here, I managed to make it self-recover in such cases by hotpatching this line in site-packages:
Still, this sometimes requires the client to initiate the connection 2+ times for the server to recover and actually accept it, but this more or less fixes the problem.
Please let me know if I you need any more details or if I can help in any way.