AWS RDS failover triggers "the database system is shutting down" flood

shousper commented 3 years ago

Hi 👋🏻

Thanks for this awesomely maintained & documented library 🙏🏻

Problem

We're using AWS RDS for PostgreSQL with Multi-AZ and found the failover process taking a little too long for our liking (~5 minutes). During that time, we saw a flood of these errors from our applications:

{
    "code": "57P03",
    "file": "postmaster.c",
    "length": 106,
    "line": "2344",
    "message": "the database system is shutting down",
    "name": "error",
    "routine": "ProcessStartupPacket",
    "severity": "FATAL",
    "stack": "error: the database system is shutting down\n    at Parser.parseErrorMessage (/app/node_modules/pg-protocol/dist/parser.js:287:98)\n    at Parser.handlePacket (/app/node_modules/pg-protocol/dist/parser.js:126:29)\n    at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:39:38)\n    at TLSSocket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)\n    at TLSSocket.emit (events.js:400:28)\n    at addChunk (internal/streams/readable.js:290:12)\n    at readableAddChunk (internal/streams/readable.js:265:9)\n    at TLSSocket.Readable.push (internal/streams/readable.js:204:10)\n    at TLSWrap.onStreamRead (internal/stream_base_commons.js:188:23)",
    "type": "DatabaseError"
}

Digging

From what I understand, the only real option is to force DNS to refresh ASAP. I found pg is, in some part, performing DNS lookups itself? I'm not sure this is the case for just the native component or all connections though.

I've also come to learn node has it's own DNS troubles. The DNS TTL for RDS is 5 seconds, which is quite a bit less than the 60-120 seconds a failover is supposed to take and both are much less than the 5 minute outage we saw.

I can't honestly be sure if DNS is the issue here but it's the only angle I've got so far. pg doesn't expose a way to have direct control over DNS resolution of the host name, which is where I'm at.

Question

Is this something pg could manage more gracefully or expose more control over? I'd appreciate any advice to cut down the outage window. However, we are using pg via knex which has it's own limitations..

Versions

node: v14.18 pg: v8.7.1

I'm also having a similar same issue. In my case, the client system shuts down so I have to restart the client. Quite bad if you want reliability. It's probably related to my system not having an error handling as suggested by https://github.com/brianc/node-postgres/issues/821#issuecomment-127101245 and https://github.com/brianc/node-postgres/issues/1075#issuecomment-232044034.

In your issue, are you using an error handler suggested in the other threads to catch the errors until the DNS change is completed? Are you doing exponential back-off on your read/write SQL queries?

shousper commented 2 years ago

@FagnerMartinsBrack I've not reached any solution I'm happy with yet.

knex doesn't allow you to manage connections (or pools) yourself. I could perhaps wrote an entire database driver wrapper around pg to handle the connections better, but I fear it'd require hacky interactions with internals of knex. You can catch errors with some configuration hooks but you can't evict a connection from the pool without creating more problems.. The only other option I'm considering is wrapping the entire Knex object to implement an EBO retry on a specific set of errors, but these solutions are all terrifying and require exhausting & expensive levels of testing to verify.

EDIT: Also considering submitting to AWS and paying them for RDS Proxy to make the problem go away, lol.

FagnerMartinsBrack commented 2 years ago

@shousper I didn't know RDS Proxy even existed. I'm using node-postgres without knex. I guess if I use RDS Proxy, then I wouldn't need to use the Pool class from pg and just point the host to the proxy? Sounds like a great alternative if that improves the Multi-AZ reliability to almost real-time!

Can you let us know if RDS Proxy worked for you after you try it?

FagnerMartinsBrack commented 2 years ago

Can you let us know if RDS Proxy worked for you after you try it?

Ok, I did test it. I used RDS Proxy, changing the "DB_HOST" field without any other change, including no change to the Pool code. Result: less than 5s downtime after a reboot with failover.

I'll keep paying AWS for that stability and keep RDS Proxy + Multi-AZ active

shousper commented 2 years ago

Okay, I've done about as much thorough testing as I can. I'm honestly not sure any problem necessarily lies with either pg or knex, but I can say that RDS Proxy didn't solve anything for us. It actually made matters worse.

tl;dr. If you want fast failover, make sure your DNS resolution respects the TTL and use a heartbeat loop to determine if a connection should be discarded.

knex uses 100% prepared statements that trigger a 100% session-pinned connection rate via RDS Proxy. We've found some folks hacking on the tarn config along with some event hooks to aggressively close connections. I didn't find this desirable in our environment. Thrashing of the knex connection pool and having RDS Proxy essentially do the same with every query didn't seem like the right path to me. Each connection gets pinned, the query is executed, then released, on top of the overhead of establishing a database connection for each query execution isn't zero overhead. I understand the benefits of what RDS Proxy are offering, but we need to be able to pool connections in our applications for rapid reuse. (Side note: I'm fairly sure RDS Proxy is subscribing to events from the RDS API to know when a failover occurs in order to quickly redirect traffic in such a short period of time.)

After much digging, many simulated failovers & disconnections, neither knex or pg are built to handle the network being ripped out from under your feet. I'm no network engineer, but I believe what RDS are doing when failing over occurs is just that: existing connections aren't closed gracefully, they're left dangling. On the plus side, this is a great way to test for such a network event happening. How does one automatically recover when your services are happily running, but the database (or another dependency) simply stops responding & your connections are left trying to send data into the ether? Should a database client be smart enough to tackle this? I don't know.

My approach to the problem is 2 fold: timeouts & probes.

We'd previously neglected to set global query/statement timeouts on the pg client. Setting these at least yields timely errors for your API calls, etc. however neither pg nor knex are clever enough to infer that a series of timeout errors might mean the connection is dead and should be discarded. In knex, the connection that encountered the timeout isn't provided with the error, so you cannot close it to have it removed from the pool. The timeout error originates from pg, but isn't emitted via it's 'error' event, I believe it's thrown & captured by knex which is where the relationship is lost.

With this in mind, I created a probe or "heartbeat" loop to attach to each pg.Client (generated by knex). It essentially looks like this:

async function probe(c: pg.Client): Promise<void> {
  let connected = true;
  c.on('end', () => (connected = false));

  try {
    while (connected) {
      await sleep(5_000);
      await withRetry(() => c.query('SELECT 1'), {
        retries: 3,
        minTimeout: 0,
        maxTimeout: 10,
      });
    }
  } catch (error) {
    logger.warn({ error }, 'Probe failed, closing connection');
    await c.end();
  }
}

The query above should be able to return in <10ms, and more than a few failures would indicate a connection issue, so we can end it. Ending the connection in this way can then be identified and thus removed from the pool. knex exposes an afterCreate option for it's pool (tarn) that can be used to manipulate new pg.Client instances like this.

This completely eliminates the need for RDS Proxy for us. We can failover the database and expect our services to automatically recover in roughly 15-30s (speculation, based on testing) and manage their own pools of connections.

What surprised me about this is I might've imagined PostgreSQL to have a heartbeat built into their client protocol.. but apparently not? I'm unaware of a better way to deal with similar loss of connection scenarios, and the overhead seems negligible if queries can be serviced faster and failovers are still quite brief.

The "database system is shutting down" error is actually harder (rarer?) to encounter than before. I'll be doing further testing across all our services later this week to further validate the approach.

brianc / node-postgres