Open jmacmahon opened 5 years ago
The pool removes clients pretty aggressively (e.g. even when the query error is an error in the query). Are you using pool.query
or acquiring clients manually? What does the error handling look like there?
We're using pg-promise
's .any
method, which from a quick look at the code looks like it uses the pool.
It looks like a pg-promise bug to me. https://github.com/vitaly-t/pg-promise/blob/3d2c670af42e9e1d93d4dc23095061550b40a7c3/lib/database.js#L333 The connection is returned to the pool even if the query fails because the connection dropped.
@charmander Could you, please point anywhere in documentation or otherwise, that a dropped connection object cannot be returned back to the pool? It's just this is the first time I hear something like that.
@vitaly-t It's in the docs: https://node-postgres.com/api/pool#releaseCallback
The API has been like that for as long as I can remember.
@sehrope What is in the docks about not allowing to release a dropped connection back to the pool? I don't see anything.
The pool is not aware of the connection status. It's up to the user to inform the pool that the connection is either still valid (via client.release()
) or invalid (via client.release(err)
).
From the docs:
The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.
Using pool.query(...)
does this automatically. If you manage the connection yourself or provide your own wrapper around pool.connect()
then you need to ensure that errant connections are evicted via client.release(err)
.
A "smarter" wrapper might differentiate between transient errors (ex: UNIQUE key violation) or permanent ones (ex: connection killed), but a simple default is to design applications with queries that do not fail and evict connections from the pool when they do.
@sehrope Thanks! I may need to make some changes in pg-promise now :)
Hi, I also experienced the same issue some time back.
client.release
. We can send a PR for that now.Client
didn’t get restored properly when a query ended with an error. Was that a wrong finding? If not, does someone know more on how it could be fixed so that clients are still reusable even after an error?EDIT: @sehrope I found that clients couldn’t be reused even after a UNIQUE key violation, but perhaps that was just something wrong with my setup. I’m afraid I can’t confirm it right now (away from computer).
Best, Joel
- I think the docs should explicitly mention that connection errors (and query errors, from my findings) should be passed to
client.release
. We can send a PR for that now.
+1
- I never looked into it too much the last time, but it seemed that some internal state in
Client
didn’t get restored properly when a query ended with an error. Was that a wrong finding? If not, does someone know more on how it could be fixed so that clients are still reusable even after an error?
A client should be reusable after running an errant command. If you're within a transaction you'll need to issue a ROLLBACK
before the server let's you run anything else. But the connection itself should still be usable and allow you to issue further commands.
If you encounter an error where that is not the case please provide an example as that would be a bug.
@sehrope I found that clients couldn’t be reused even after a UNIQUE key violation, but perhaps that was just something wrong with my setup. I’m afraid I can’t confirm it right now (away from computer).
This is likely due to being within a transaction. You'll need to either:
ROLLBACK
on failure before returning the connection to the pool.SAVEPOINT
(and roll back to the save point if it fails and you want to proceed).INSERT ... ON CONFLICT ...
).@sehrope thank you. It wasn’t in a transaction though; I’ll try to reproduce it in a minimal setup when I get back to computer.
Here's what troubles me... this documentation implies that the client is supposed to be able to identify a dropped connection from all other types of errors, if it is to pass a truthy value into the release function.
But this is far from trivial, especially when executing a transaction, there are so many different types of failures that may occur, and how to filter them out from what actually represents an issue in communication with the server.
Some time ago I had to write a complex function to detect just that:
Which I still use to check whether or not to attempt executing ROLLBACK
on a failed transaction.
And this is far from a friendly approach that can be considered for the client-side code.
My understanding thus far, the driver must be able to identify connectivity issues on its own, and mark the client as disconnected, because it is too complex to expect the client to do so. And this is the piece of information that's totally missing in documentation here.
So how should this be approached?
A while back I proposed https://github.com/brianc/node-pg-pool/pull/119, which we use internally. This won’t catch everything, but at least handle those cases more nicely. Obviously those internal properties shouldn’t be accessed like that, so the client would need an external API for that.
@johanneswuerbach If it does throw away broken connections from the pool in a timely fashion, then it's all we need. And as per my previous post, I fully support that broken connections must be handled between the driver and the pool, and not not by the client. The client should only receive a notification when the connection dies, as it does now (in part), and that's it.
@vitaly-t wrote:
Here's what troubles me... this documentation implies that the client is supposed to be able to identify a dropped connection from all other types of errors, if it is to pass a truthy value into the release function.
But this is far from trivial, especially when executing a transaction, there are so many different types of failures that may occur, and how to filter them out from what actually represents an issue in communication with the server.
The simplest approach is to treat any error as non-recoverable and instruct the pool to evict the connection. That will always be correct though it could lead to trashing your pool if errors are common enough in normal routes. For example if you try to create a resource and get a duplicate key error.
A more advanced approach would be to default to discarding on errors, but if you're wrapped in a transaction then attempt a ROLLBACK. If the ROLLBACK succeeds then consider the connection salvaged. Otherwise, discard it. This will handle the previous thrashing situation yet still evict the connection if it is totally broken (ex: TCP error).
Some time ago I had to write a complex function to detect just that:
I wouldn't bother with anything general like that. In theory you could check for a FATAL error from the server or something similar from the internal driver, but much safer and easier to default to evicting and not try to be too clever.
My understanding thus far, the driver must be able to identify connectivity issues on its own, and mark the client as disconnected, because it is too complex to expect the client to do so. And this is the piece of information that's totally missing in documentation here.
You always need a way for the client to inform the driver that the connection should be evicted, even if the connection is still usable. Say you had a route that alters the session in some way that is not recoverable (some SET ...
or executing some PL that mutates internal state). You would not want the pool to give back that connection to anyone else so you could do client.release(true)
to make sure that does not happen.
If it does throw away broken connections from the pool in a timely fashion, then it's all we need.
The failures could happen at any time so the end user must always deal with errors directly. Even a "test after checkout from pool" could fail between the test and the user's actual DB work. It might work the majority of the time but there's a race condition if it's killed in between the test completing and the user performing the actual work.
https://github.com/brianc/node-pg-pool/pull/119 throws away the connections when the library user calls client.release and the pg client is not in a queryable state. The queryable state is determined by the driver itself and should already be a good source of truth.
Non-queryable clients directly return errors if you try to call client.query already today, so some of that logic is already out there and tested.
The only thing not detectable so far is returning a client in an open and failed, but not rolled back transaction as that condition is currently not detect by the driver.
I wouldn't bother with anything general like that. In theory you could check for a FATAL error from the server or something similar from the internal driver, but much safer and easier to default to evicting and not try to be too clever.
If we are not clever enough, we may dispense with alive connection while inside a transaction by mistake. When in failure inside a transaction, and before executing ROLLBACK
we need to know if it is at all possible to do ROLLBACK
on the current connection.
Otherwise, if we just start throwing away live connections, we will end up with transactions locked on the server, without a good reason.
@johanneswuerbach Are you still using pg-promise
or switched to the driver directly? I'm asking in case I start playing with some changes for this, whether you would be interested in helping...
If we are not clever enough, we may dispense with alive connection while inside a transaction by mistake.
If you're the one handling transaction management then you know if you're in a transaction and should attempt a ROLLBACK for a failed unit of work.
If you're not the one handling transaction management then you're not handling transaction management and should not attempt anything.
Are you trying to infer if the user of your library failed a transaction outside of a library-level helper and trying to recover from it?
When in failure inside a transaction, and before executing ROLLBACK we need to know if it is at all possible to do ROLLBACK on the current connection.
If you're the one managing the transaction then just try to do the ROLLBACK. It either works (connection is good) or it doesn't work (connection is definitely no good).
It's even possible that the connection is in a workable state (i.e. TCP is fine, DB is still open) when you "verify" that a ROLLBACK is possible and it gets killed before you actually ROLLBACK. Best to just try it and see if it worked.
Otherwise, if we just start throwing away live connections, we will end up with transactions locked on the server, without a good reason.
The PostgreSQL server will ROLLBACK any open transactions for closed connections. Otherwise you could permanently lock up any server by running:
BEGIN;
SELECT * FROM some_table FOR UPDATE;
-- Exit the connection...
If you're the one handling transaction management then you know if you're in a transaction and should attempt a ROLLBACK for a failed unit of work.
It is not that simple. See the original issue, which forced me to start using that error verification.
Presumably, it is now fixed. I'm gonna do some good re-testing of this again. And if it is indeed fixed, it does simplify things a bit.
Previously, we could often end up with the driver's query
method suddenly dead-lock when trying to execute ROLLBACK
while using a broken connection.
I will get back on this after some more testing.
@vitaly-t Okay that makes sense as a workaround but only because not bubbling up the connection error was a bug. Assuming that's fixed you shouldn't have to do anything like that and blindly issuing a ROLLBACK on error should work.
@vitaly-t no, I'm still using pg-promise and happy to help. Not sure how much time I've, but I can definitely test any changes against our internal test suite, which exercises various failover scenarios.
Hello, the company I work for has been using
pg-promise
for our database service and we've run into an issue with failover, which we believe is an error in the underlyingnode-postgres
library. We believe the issue is due to the way Linux handles socket timeout events differently to macOS.Steps to reproduce:
Note: we believe this scenario is not specific to RDS, but rather any network outage or server failure which does not send a TCP FIN packet.
Expected outcome and actual outcome on macOS:
Actual outcome on Linux:
Detailed order of events
macOS
Linux