Unhandled exception when DB drops

rfshelley commented 9 years ago

I've connected to the database, and then explicitly stop the DB, and on the next call to query the DB I get the error:

events.js:72
        throw er; // Unhandled 'error' event
              ^
error: terminating connection due to administrator command
    at Connection.parseE (.../node_modules/pg/lib/connection.js:539:11)
    at Connection.parseMessage (.../node_modules/pg/lib/connection.js:366:17)
    at Socket.Connection.attachListeners (.../node_modules/pg/lib/connection.js:105:22)
    at Socket.emit (events.js:95:17)
    at Socket.stream.pause.paused (_stream_readable.js:764:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:426:10)
    at emitReadable (_stream_readable.js:422:5)
    at readableAddChunk (_stream_readable.js:165:9)
    at Socket.Readable.push (_stream_readable.js:127:10)

This occurred from a simple "Client.query()" call from a pooled connection. The thrown exception kills the server and requires me to restart. Is there a better way to gracefully handle downed databases with a reconnect policy or at catch the exception?

sehrope commented 9 years ago

I'm able to reproduce this with pg v4.4.0 on Linux. What's interesting is that you don't even have to run a query.

Here's the sample coffee-script program I used to test it (output is below). It creates a pooled connection, runs a test query, then prints that it's still alive continuously till it dies. Per the defaults the pooled connection should remain open for 30 seconds so that's enough time to manually stop the PostgreSQL server.

pg = require 'pg'

testQuery = (cb) ->
  pg.connect process.env.DATABASE_URL, (err, client, done) ->
    if err then return cb(err)
    client.query 'SELECT 1 AS x', (err, result) ->
      done(err)
      if err then return cb(err)
      console.log '%s - Ran a test query successfully', new Date()
      return cb(null)

showAlive = () ->
  console.log "%s - Still alive", new Date()
  setTimeout showAlive, 100

testQuery (err) ->
  if err
    console.error "Gracefully received error: ", err
    process.exit(1)
  showAlive()

Running with the server offline gives (proper behavior):

Gracefully received error:  [Error: Connection terminated]

Running with the server online and then shutting it down gives (node process dies):

Fri Jul 31 2015 19:10:38 GMT-0400 (EDT) - Ran a test query successfully
Fri Jul 31 2015 19:10:38 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:38 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:38 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:39 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:40 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:41 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:41 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:41 GMT-0400 (EDT) - Still alive
Fri Jul 31 2015 19:10:41 GMT-0400 (EDT) - Still alive

events.js:72
        throw er; // Unhandled 'error' event
              ^
error: terminating connection due to administrator command
  at [object Object].Connection.parseE (/home/sehrope/tmp/node-pg-test/node_modules/pg/lib/connection.js:539:11)
  at [object Object].Connection.parseMessage (/home/sehrope/tmp/node-pg-test/node_modules/pg/lib/connection.js:366:17)
  at Socket.<anonymous> (/home/sehrope/tmp/node-pg-test/node_modules/pg/lib/connection.js:105:22)
  at Socket.emit (events.js:95:17)
  at Socket.<anonymous> (_stream_readable.js:765:14)
  at Socket.emit (events.js:92:17)
  at emitReadable_ (_stream_readable.js:427:10)
  at emitReadable (_stream_readable.js:423:5)
  at readableAddChunk (_stream_readable.js:166:9)
  at Socket.Readable.push (_stream_readable.js:128:10)
  at TCP.onread (net.js:529:21)

Sure enough stopping the remote PostgreSQL server while it's running kills the node process. I would have expected that further use of the pooled connection (say running a query) would return back an error which a properly written application would then evict from the pool. Eventually it'd fix itself if the server comes back online. Instead the uncaught exception kills the node process as the error is internal to the driver and there's no handler for it.

sehrope commented 9 years ago

I've got a reproducible test case for this in a PR: https://github.com/brianc/node-postgres/pull/824

It's a bit easier than manually bouncing the server as we can use pg_terminate_backend(...) to kill the connection from a separate connection.

There's a similar test case for this already (see connection-pool/error-tests.js). The difference is that the existing test leaves the pooled connection checked out while it terminates the backend. The new test (which reproduces this issue) returns the connection back to the pool prior to terminating it.

vitaly-t commented 9 years ago

I would think that the issue is quite obvious.

You must be using a connection pool for getting connections. Ideally, the library would be able to detect a special message from the server to clear the current connection pool, but that's not as simple as it sounds.

If you want your application to be smart enough to detect server restarts, you need to catch that error and reset the connection pool, so that the next connection is actually created.

To confirm this theory, if your app waits the timeout (poolIdleTimeout) before re-using the connection, then you will see the problem gone by itself.

sehrope commented 9 years ago

Yes that's definitely the issue. Question now is how to handle it gracefully. Either the connection should be evicted from the pool (probably preferred) or the error should be saved and presented to the next user of that connection from the pool (so it can be gracefully handled).

vitaly-t commented 9 years ago

The best I can suggest right now - call pg.end() when you get server disconnect error. That will effectively shut-down everything, which is pretty much what you need. And if the server is up and running, you will be able to acquire a new connection right after that.

sehrope commented 9 years ago

The best I can suggest right now - call pg.end() when you get server disconnect error.

It should only be for the pool that received the error. Calling pg.end() would close all connections for all pools. Also, it's possible to get this error for a single connection if it's manually terminated by calling pg_terminate_backend(). It's not necessarily a server shutdown that would impact all connections in the pool.

vitaly-t commented 9 years ago

Yes, you are right. I created a feature request where I mentioned the issue with having multiple servers connected. Like I said, this is the best answer you can get right now, because node-postgres doesn't really have any good provision/support for the server shut-down / restart handling.

brianc commented 9 years ago

The pg object itself emits an error anytime a client which is connected and idle in the pool receives an error. If you can, try handling the error at that level.

//somewhere in app bootstrapping code:
pg.on('error', function() {
  //do whatever you need to do to handle errors from idle clients
})

If you come up with a general purpose solution to handle this situation in all cases I'm interested definitely in pulling it in - the problem is, dealing with errors is hard and ever app has a different error handling scenario, so the best thing usually to do from library code like this is emit the error as quickly as possible & allow someone to deal with it.

sehrope commented 9 years ago

The pg object itself emits an error anytime a client which is connected and idle in the pool receives an error. If you can, try handling the error at that level.

That should be perfect. I just took a peek at the pool code and it includes both the error and the errant client so should be pretty easy to auto evict it.

rfshelley commented 9 years ago

Thanks guys... With the pg.on(...) function and some extra error handling I was able to get it to reconnect successfully.

brianc commented 9 years ago

Great @rfshelley glad you got it figured out!

jotto commented 7 years ago

For posterity: if using the pgPool, put the error handler there instead of the pg object. Here's a wrapper that is resilient through most transient db issues (I've been using it on multi-az failover on AWS RDS)

const pg = require("pg");
const promiseRetry = require("promise-retry");

class Db {
  constructor(pgPool) {
    this.pgPool = pgPool;
    this.pgPool.on("error", err => {
      // just need to catch this error, don't need to do anything
    });
  }

  static create() {
    const poolArgs = {
      user: process.env.PGUSER || "postgres",
      database: process.env.PGDATABASE || "postgres",
      host: process.env.PGHOST || "localhost",
      password: process.env.PGPASSWORD || ""
    };

    const pgPool = new pg.Pool(poolArgs);
    return new this(pgPool);
  }

  runQuery(query, args = []) {
    return promiseRetry({ retries: 15, randomize: true }, (retry, attempt) => {
      return this.pgPool
        .query(query, args)
        .then(queryResult => queryResult.rows)
        .catch(err => {
          if (err.code === "ECONNREFUSED") {
            console.log(`retrying query, attempt #${attempt}`);
            return retry(err);
          } else {
            return Promise.reject(err);
          }
        });
    });
  }
}

module.exports = Db;

Note, that's 15 retries with exponential backoff which is way more than enough time for RDS issues (probably only need 1 retry).

brianc commented 7 years ago

Added this to the docs milestone so I can document error handling better.

ashking commented 6 years ago

Hi @brianc ,

I'm really confused with "terminating connection due to administrator command error" I've an error handler on the pool

pool.on("error", err => {
  // handle the error
})

and from the pool I'm creating a new connection as follows:

terminateAllConnections = async (callback) => {
            if (pool) {
                let res, db
                try {
                    db = await pool.connect()
                    db.query("SELECT pg_terminate_backend(pg_stat_activity.pid) \
                    FROM pg_stat_activity \
                    WHERE datname = $1::text \
                    AND pid <> pg_backend_pid();" , [pDatabase])
                }
                catch (err) {
                    console.log(err)
                    db.release(err)
                }
                finally {
                    db.release()
                    callback()
                }
            }
        }

When the query runs and fails with "terminating connection due to administrator command" the error handler is not called.

However, (to test) when I deliberately trigger an error with db.emit('error', new Error('OMG!')) instead of running a query, successfully gets to the error handler but running the query does not and exits the node application.

What might be wrong here?

charmander commented 6 years ago

@ashking How are you running the query that gets cancelled? Can you reproduce the problem in a single script?

DanielJoyce commented 5 years ago

The promise and callback api for query is not exactly symmetrical, and so exceptions can leak through even when you think you handle them all. Tracking down, but

But bug is here

Query.prototype.handleError = function (err, connection) {

End of this function in query.js, always emits error if promise api is being used (since no callback is set), which happens when query is not created directly, but built via calls to Client or Pool, so no handler is set ( or easy to set ).

charmander commented 5 years ago

@DanielJoyce A query’s callback is still set when the promise API is being used. See https://github.com/brianc/node-postgres/blob/e4578d2c7bca60cfb02ec004daa22c8ff810140e/lib/client.js#L475

It wouldn’t be set when creating a Query object manually.

DanielJoyce commented 5 years ago

Yeah I just noticed. So that is not the problem.

I'm seeing the error when I get a client from a pool, and do a client.release(true) while the client is running a long running query on the server. This causes the pool to kill the client connection, which causes Postgres to terminate the worker ( only way to reliably kill a long running query ).

But this then seems to trigger

{ Error: Connection terminated
    at Connection.con.once (/home/danieljoyce/git/tesla/projects/tesla-data-point/node_modules/pg/lib/client.js:251:9)
    at Object.onceWrapper (events.js:277:13)
    at Connection.emit (events.js:194:15)
    at Connection.EventEmitter.emit (domain.js:441:20)
    at Socket.<anonymous> (/home/danieljoyce/git/tesla/projects/tesla-data-point/node_modules/pg/lib/connection.js:77:10)
    at Socket.emit (events.js:189:13)
    at Socket.EventEmitter.emit (domain.js:441:20)
    at TCP._handle.close (net.js:600:12)
  level: 'error',
  [Symbol(level)]: 'error',
  [Symbol(message)]: '{"level":"error","message":"Connection terminated"}' }

which bubbles up and kills node ( unhandledPromiseRejection )

And no amoung of on('error') handlers on query, client, or pool, or try catch blocks seems to allow me to catch and handle it.

It seems the Connection Terminated error is being raised twice. I can catch it once, but its be emitted somewhere else in a way I can't catch. Maybe the connection creation in the pool is somehow reusing a stale socket?

DanielJoyce commented 5 years ago

In fact I catch it at a lower level and wrap it in DatabaseError, but it still seems to emit this error as part of a unhandledPromiseRejection.

DanielJoyce commented 5 years ago

Use pg node 7.12.0 with pg-pool, manual connection management. I think this is happening somewhere in client. It could be something on my end, but I am catching the error and wrapping it/handling, but Node is reporting the raw error from pg-node as not bein handled.

charmander commented 4 years ago

Pool and client error handling were documented between https://node-postgres.com/api/pool/ and https://node-postgres.com/api/client/, but a recent pg-pool minor version makes some additional error handling necessary in a bit of an unexpected way. We should open a new issue to discuss different APIs. Unexpected errors with error handling that’s thought to be right on more recent pgs (@DanielJoyce) that aren’t the same as the original issue here should also be new issues if they still apply.

brianc / node-postgres

Unhandled exception when DB drops #821