elemental-lf / benji

Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
https://benji-backup.me
Other
138 stars 43 forks source link

NBD server losing connection to PostgreSQL #154

Open crabique opened 1 year ago

crabique commented 1 year ago

Describe the bug

When using benji nbd with configuration pointing to a PostgreSQL pool, the NBD server stops initiating new client connections after some period of inactivity.

The following errors can be found in the process output:

   INFO: Incoming connection from <IP>:60948.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:60948]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:60948]: opt=1, length=28, data=b'<UID>'
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>

After that, all new connection requests cause the server to show something like that:

ERROR: [<IP>:44590] NBD_CMD_READ: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Traceback (most recent call last): ``` File "/benji/lib64/python3.6/site-packages/benji/nbdserver.py", line 346, in handler data = self.store.read(version, cow_version, offset, length) File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1340, in read read_list = self._block_list(version, offset, length) File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1311, in _block_list block = version.get_block_by_idx(block_idx) File "/benji/lib64/python3.6/site-packages/benji/database.py", line 586, in get_block_by_idx block = Session.query(Block).filter(Block.version_id == self.id, Block.idx == idx).one_or_none() File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2849, in one_or_none return self._iter().one_or_none() File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2910, in _iter execution_options={"_sa_orm_load_options": self.load_options}, File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1712, in execute result = conn._execute_20(statement, params or {}, execution_options) File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/benji/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection self, multiparams, params, execution_options File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1582, in _execute_clauseelement cache_hit=cache_hit, File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1798, in _execute_context conn = self._revalidate_connection() File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 645, in _revalidate_connection self._invalid_transaction() File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 625, in _invalid_transaction code="8s2b", sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b) ```

This behaviour persists until the NBD server is restarted.

To Reproduce

Configure the following databaseEngine:

databaseEngine: postgresql://<PG_USER>:<PG_PASS>@<PG_HOST>/<BENJI_DB>

Try to connect to the NBD server using any client, then close the connection or keep it idle for 10-15 minutes (so that the server doesn't get any new requests), then try to connect again.

Expected behavior

NBD server works reliably without restarting it after every connection drop.

Platform and versions (please complete the following information):

Additional context

Based on the suggestions here, I managed to make it self-recover in such cases by hotpatching this line in site-packages:

self._engine = sqlalchemy.create_engine(url, pool_pre_ping=True)

Still, this sometimes requires the client to initiate the connection 2+ times for the server to recover and actually accept it, but this more or less fixes the problem.

Please let me know if I you need any more details or if I can help in any way.

elemental-lf commented 1 year ago

Thanks for the report! What error message(s) or warnings are you getting with your patch installed? Adding pool_pre_ping=True shouldn't hurt either way but we should try to figure out what we need to do to make the re-connection process completely seamless from the user's POV.

crabique commented 1 year ago

Session where it fails and then recovers even with pre-ping enabled:

    INFO: [<IP>:32832] disconnecting
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>
    INFO: Incoming connection from <IP>:32834.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32834]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32834]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32834] Negotiated export: <UID>.
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=InternalError('Attempt to acquire lock Version <UID> twice.',)>
    INFO: Incoming connection from <IP>:32836.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32836]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32836]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32836] Negotiated export: <UID>.
crabique commented 1 year ago

I saw some commits with relevant changes pushed to master, can this be considered resolved? And is there a release/tag I can try?

crabique commented 4 months ago

Turns out, the problem was my TCP proxy (ingress-nginx) killing the tcp-proxied PSQL connections after 600s (default stream-timeout value), it works perfectly when the ingress is taken out of the equation or when the timeout is increased.

However, when such timeouts happen, it seems the pool_pre_ping only fixes half the problem, the other half is that the operation fails (btw not only the NBD serve, but also long backup or restore, cleanup etc).