Open shousper opened 3 years ago
I'm also having a similar same issue. In my case, the client system shuts down so I have to restart the client. Quite bad if you want reliability. It's probably related to my system not having an error handling as suggested by https://github.com/brianc/node-postgres/issues/821#issuecomment-127101245 and https://github.com/brianc/node-postgres/issues/1075#issuecomment-232044034.
In your issue, are you using an error handler suggested in the other threads to catch the errors until the DNS change is completed? Are you doing exponential back-off on your read/write SQL queries?
@FagnerMartinsBrack I've not reached any solution I'm happy with yet.
knex
doesn't allow you to manage connections (or pools) yourself. I could perhaps wrote an entire database driver wrapper around pg
to handle the connections better, but I fear it'd require hacky interactions with internals of knex
. You can catch errors with some configuration hooks but you can't evict a connection from the pool without creating more problems.. The only other option I'm considering is wrapping the entire Knex
object to implement an EBO retry on a specific set of errors, but these solutions are all terrifying and require exhausting & expensive levels of testing to verify.
EDIT: Also considering submitting to AWS and paying them for RDS Proxy to make the problem go away, lol.
@shousper I didn't know RDS Proxy even existed. I'm using node-postgres
without knex. I guess if I use RDS Proxy, then I wouldn't need to use the Pool
class from pg
and just point the host to the proxy? Sounds like a great alternative if that improves the Multi-AZ reliability to almost real-time!
Can you let us know if RDS Proxy worked for you after you try it?
Can you let us know if RDS Proxy worked for you after you try it?
Ok, I did test it. I used RDS Proxy, changing the "DB_HOST" field without any other change, including no change to the Pool
code.
Result: less than 5s downtime after a reboot with failover.
I'll keep paying AWS for that stability and keep RDS Proxy + Multi-AZ active
Okay, I've done about as much thorough testing as I can. I'm honestly not sure any problem necessarily lies with either pg
or knex
, but I can say that RDS Proxy didn't solve anything for us. It actually made matters worse.
tl;dr. If you want fast failover, make sure your DNS resolution respects the TTL and use a heartbeat loop to determine if a connection should be discarded.
knex
uses 100% prepared statements that trigger a 100% session-pinned connection rate via RDS Proxy. We've found some folks hacking on the tarn
config along with some event hooks to aggressively close connections. I didn't find this desirable in our environment. Thrashing of the knex
connection pool and having RDS Proxy essentially do the same with every query didn't seem like the right path to me. Each connection gets pinned, the query is executed, then released, on top of the overhead of establishing a database connection for each query execution isn't zero overhead. I understand the benefits of what RDS Proxy are offering, but we need to be able to pool connections in our applications for rapid reuse. (Side note: I'm fairly sure RDS Proxy is subscribing to events from the RDS API to know when a failover occurs in order to quickly redirect traffic in such a short period of time.)
After much digging, many simulated failovers & disconnections, neither knex
or pg
are built to handle the network being ripped out from under your feet. I'm no network engineer, but I believe what RDS are doing when failing over occurs is just that: existing connections aren't closed gracefully, they're left dangling. On the plus side, this is a great way to test for such a network event happening. How does one automatically recover when your services are happily running, but the database (or another dependency) simply stops responding & your connections are left trying to send data into the ether? Should a database client be smart enough to tackle this? I don't know.
My approach to the problem is 2 fold: timeouts & probes.
We'd previously neglected to set global query/statement timeouts on the pg
client. Setting these at least yields timely errors for your API calls, etc. however neither pg
nor knex
are clever enough to infer that a series of timeout errors might mean the connection is dead and should be discarded. In knex
, the connection that encountered the timeout isn't provided with the error, so you cannot close it to have it removed from the pool. The timeout error originates from pg
, but isn't emitted via it's 'error'
event, I believe it's thrown & captured by knex
which is where the relationship is lost.
With this in mind, I created a probe or "heartbeat" loop to attach to each pg.Client
(generated by knex
). It essentially looks like this:
async function probe(c: pg.Client): Promise<void> {
let connected = true;
c.on('end', () => (connected = false));
try {
while (connected) {
await sleep(5_000);
await withRetry(() => c.query('SELECT 1'), {
retries: 3,
minTimeout: 0,
maxTimeout: 10,
});
}
} catch (error) {
logger.warn({ error }, 'Probe failed, closing connection');
await c.end();
}
}
The query above should be able to return in <10ms, and more than a few failures would indicate a connection issue, so we can end it. Ending the connection in this way can then be identified and thus removed from the pool. knex
exposes an afterCreate
option for it's pool (tarn
) that can be used to manipulate new pg.Client
instances like this.
This completely eliminates the need for RDS Proxy for us. We can failover the database and expect our services to automatically recover in roughly 15-30s (speculation, based on testing) and manage their own pools of connections.
What surprised me about this is I might've imagined PostgreSQL to have a heartbeat built into their client protocol.. but apparently not? I'm unaware of a better way to deal with similar loss of connection scenarios, and the overhead seems negligible if queries can be serviced faster and failovers are still quite brief.
The "database system is shutting down" error is actually harder (rarer?) to encounter than before. I'll be doing further testing across all our services later this week to further validate the approach.
Hi 👋🏻
Thanks for this awesomely maintained & documented library 🙏🏻
Problem
We're using AWS RDS for PostgreSQL with Multi-AZ and found the failover process taking a little too long for our liking (~5 minutes). During that time, we saw a flood of these errors from our applications:
Digging
From what I understand, the only real option is to force DNS to refresh ASAP. I found
pg
is, in some part, performing DNS lookups itself? I'm not sure this is the case for just the native component or all connections though.I've also come to learn node has it's own DNS troubles. The DNS TTL for RDS is 5 seconds, which is quite a bit less than the 60-120 seconds a failover is supposed to take and both are much less than the 5 minute outage we saw.
I can't honestly be sure if DNS is the issue here but it's the only angle I've got so far.
pg
doesn't expose a way to have direct control over DNS resolution of the host name, which is where I'm at.Question
Is this something
pg
could manage more gracefully or expose more control over? I'd appreciate any advice to cut down the outage window. However, we are usingpg
viaknex
which has it's own limitations..Versions
node
: v14.18pg
: v8.7.1Related
Possibly related (old) issue: https://github.com/brianc/node-postgres/issues/1075