7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda)

jcollum commented 4 years ago

Code is running in a lambda. When we run the function below it runs fine. But if we run it and then wait 10 minutes then run it again, we get an error every time. We are querying Aurora Postgres on AWS.

Code:

const {Pool} = require('pg');

// Global Connection, can be re-used!!
const pgPool = new Pool({
    user: process.env.PG_USER,
    host: process.env.PG_HOST,
    database: process.env.PG_DATABASE,
    password: process.env.PG_PASSWORD,
    port: process.env.PG_PORT,
    max: process.env.MAX_CLIENTS
});

pgPool.on('error', (err, client) => {
    console.error('Unexpected error in Postgress connection pool', err);
});

async function performQuery(event) {

    let queryString = null;
    let args = null;

    try {
        // some setup code here...  
        const client = await pgPool.connect();
        try {
            const res = await client.query(queryString, args);
            return res.rows;
        } finally {
            client.release();
        }
    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err); // <-- this line is in the log below  
        throw err;
    }
}

This is what I see in the cloudwatch logs:

{
    "errorType": "Error",
    "errorMessage": "Connection terminated unexpectedly",
    "stack": [
    "Error: Connection terminated unexpectedly",
    "    at Connection.<anonymous> (/var/task/node_modules/pg/lib/client.js:255:9)",
    "    at Object.onceWrapper (events.js:312:28)",
    "    at Connection.emit (events.js:223:5)",
    "    at Connection.EventEmitter.emit (domain.js:475:20)",
    "    at Socket.<anonymous> (/var/task/node_modules/pg/lib/connection.js:78:10)",
    "    at Socket.emit (events.js:223:5)",
    "    at Socket.EventEmitter.emit (domain.js:475:20)",
    "    at TCP.<anonymous> (net.js:664:12)"
]
}

I've tried a few variations on this but the constant is the 10 minutes and the use of the Pool. To me this code is almost identical to the code in https://node-postgres.com/features/pooling.

So far it looks like the problem has been solved by using a Client instead:

const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        const res = await client.query(queryString, args);
        await client.end();
        return res.rows;

jamesdixon commented 4 years ago

@jcollum this is also happening for us and a number of others (see this thread https://github.com/knex/knex/issues/3636).

We're seeing this in Node 10 / 12 in the Lambda environment. Only solution at the moment was to run a custom Node 8 runtime for our Lambda functions.

briandamaged commented 4 years ago

@jamesdixon : Ah -- I was just about to point you to this thread! lol

592da commented 4 years ago

dealing with the same issue on production, working with serverless, over AWS lambdas, node v12 (over typeorm).

@briandamaged @jamesdixon any solution? did not manage to set the runtime to node8...

please help

brianc commented 4 years ago

It looks like it's related to the lambda environment or something and not pg. According to this it's hitting both pg & mysql....which makes it somewhat unlikely to be caused by this library. I'm not sure there's much I can do w/o having steps to reproduce, unfortunately. Looking at @jcollum's example it looks like it's somehow related to the lambda going idle and killing the open connections to the db. Weirdly if you only run the lambda once every 10 minutes that should be well outside the idle timeout and the pool should have closed all it's connections. Anyways I can make lots of assumptions about what's happening but w/o steps to reproduce I'm not gonna be able to provide much concrete support. I'll follow along if more info comes up I'll look into it. Also: PRs are always welcome for issues like this if you figure out what it is! 😉

jamesdixon commented 4 years ago

Thanks for the input, @brianc!

According to this it's hitting both pg & mysql....which makes it somewhat unlikely to be caused by this library

That was the original thought but later on in the thread you referenced, one of the Knex maintainers pointed out that the MySQL error was actually different and may not be related.

The one thing we do know is that the issue isn't present in Node 8 but rears its ugly head in Node 10/12. Are you aware of any major changes in those versions related to this library that might be exacerbated in a Lambda environment?

briandamaged commented 4 years ago

@jamesdixon : Howdy! Just to clarify: what I meant was that Knex did not appear to be directly related to the pg-specific control path. However, I suspect that the same lower-level issue is affecting both "pg" and "mysql".

My guess is that either the server or the network is aggressively closing the connections. This is then causing the client-side libraries to "get confused" when they discover that the connection is already closed. (At least, that appears to be the case for "pg". The Client class checks the this._ending flag to see whether or not it was expecting the connection to be closed)

jamesdixon commented 4 years ago

Got it! Thanks for the clarification!

On Tue, Feb 25, 2020 at 11:02 AM Brian Lauber notifications@github.com wrote:

@jamesdixon https://github.com/jamesdixon : Howdy! Just to clarify: what I meant was that Knex did not appear to be directly related to the pg-specific control path. However, I suspect that the same lower-level issue is affecting both "pg" and "mysql".

My guess is that either the server or the network is aggressively closing the connections. This is then causing the client-side libraries to "get confused" when they discover that the connection is already closed. (At least, that appears to be the case for "pg". The Client class checks the this._ending flag to see whether or not it was expecting the connection to be closed)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brianc/node-postgres/issues/2112?email_source=notifications&email_token=AAA24IXYNC5ZRZGUCSJDUOTREVMJ5A5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM44G7I#issuecomment-590988157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA24ITEVY7NCGYWPMDCCKDREVMJ5ANCNFSM4K2TYUCQ .

jcollum commented 4 years ago

I fixed it by not using a pool:

try {

// trimmed code here 

        const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        const res = await client.query(queryString, args);
        await client.end();
        return res.rows;

    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err);
        throw err;
    }

Minor performance hit there.

briandamaged commented 4 years ago

@jcollum : My guess is that the issue isn't actually "fixed", but rather "avoided". As in: whenever your code needs to perform a query, it is:

Establishing a new connection (via the freshly-created Client instance)
Performing the query
Ending the connection cleanly. (by invoking client.end() )

BTW: it looks like there is a connection leak in that snippet. It should probably do this:

try {
// trimmed code here 

        const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        try {
          const res = await client.query(queryString, args);
          return res.rows;
        } finally {
          await client.end();
        }

    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err);
        throw err;
    }

This will ensure that the connection will be closed even if an error occurs.

brianc commented 4 years ago

Yah opening a new client every time is something you can do, but the latency overhead of opening a client is aprox 20x the overhead of sending a query....so ideally you'd use a pool. You don't have to, particularly if your lambda is going cold then pooling really likely isn't buying you much. There might be a way to simulate this in test by creating a pool, reaching into the internals, and severing the streams manually behind the scenes. Anyone care to make a first pass crack at that in a gist or something? If it reproduces the issue I can spin it into a test, make it pass, and ship a fix!

jcollum commented 4 years ago

@jcollum : My guess is that the issue isn't actually "fixed", but rather "avoided". As in: whenever your code needs to perform a query, it is:

Well yes it's not a fix it's a workaround. Thanks for catching that potential leak.

briandamaged commented 4 years ago

@brianc : Thinking about this a bit more... I suspect that both "pg" and "knex" are encountering the same conceptual issue in their Pool implementations. Specifically: they are not re-checking the validity of the Client instances before yielding them from the pool. Here's what I'm seeing on the "knex" side, at least:

https://github.com/knex/knex/blob/8c07192ade0dde137f52b891d97944733f50713a/lib/client.js#L333-L335

Instead, this function should probably delegate to a pg-specific function that confirms that the Client instance is still valid.

I'll check the "pg" codebase in a few mins to see if I can spot a similar behavior there.

briandamaged commented 4 years ago

@brianc : So, it looks like the "pg" pool has some logic for removing idle clients; however, I suspect that it does not get invoked unless the underlying Client instance actually emits an "error". For ex:

https://github.com/brianc/node-postgres/blob/1c8b6b93cfa108a0ad0e0940f1bb26ecd101087b/packages/pg-pool/index.js#L216

However, it looks like the Client doesn't necessarily emit the "error" event when it detects that the connection has been closed:

https://github.com/brianc/node-postgres/blob/1c8b6b93cfa108a0ad0e0940f1bb26ecd101087b/packages/pg/lib/client.js#L252-L279

So, if this analysis is correct, then it means there is a way for the Pool to contain Client instances that have already been disconnected. (Or rather: their underlying connection instance has already been ended)

briandamaged commented 4 years ago

@brianc : Here's a snippet that seems to confirm my suspicion:

const {Pool} = require('pg');

function createPool() {
  return new Pool({
      user: process.env.PG_USER,
      host: process.env.PG_HOST,
      database: process.env.PG_DATABASE,
      password: process.env.PG_PASSWORD,
      port: process.env.PG_PORT,
      max: process.env.MAX_CLIENTS
  });
}

async function run() {
  const pool = createPool();

  const c1 = await pool.connect();
  // Place the Client back into the pool...
  await c1.release();

  // Now intentionally end the lower-level Connection while Client is already in the pool
  c1.connection.end();

  // Attempt to obtain the connection again...
  const c2 = await pool.connect();
  try {
    // Explodes
    const res = await c2.query("select * from accounts", null);
    console.log("Yay");
  } catch(err) {
    console.log("Initial error")
    console.log("----------------")
    console.log(err);
  } finally {
    c2.release();
  }

  // Attempt to obtain the connection again...
  const c3 = await pool.connect();
  try {
    // Surprisingly, it succeeds this time.  This is because the pool was already
    // "fixed" thanks to the 'error' event that it overheard.
    const res = await c3.query("select * from accounts", null);
    console.log("Yay");
  } catch(err) {
    console.log("Second error")
    console.log("----------------")
    console.log(err);
  } finally {
    c3.release();
  }
}

run();

Sample output:

Initial error
----------------
Error: write after end
    at writeAfterEnd (_stream_writable.js:236:12)
    at Socket.Writable.write (_stream_writable.js:287:5)
    at Socket.write (net.js:711:40)
    at Connection.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/connection.js:234:15)
    at Query.submit (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/query.js:164:16)
    at Client._pulseQueryQueue (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:446:43)
    at Client.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:541:8)
    at run (/Users/brianlauber/Documents/src/knex/oops.js:43:26)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:57118) UnhandledPromiseRejectionWarning: Error: write after end
    at writeAfterEnd (_stream_writable.js:236:12)
    at Socket.Writable.write (_stream_writable.js:287:5)
    at Socket.write (net.js:711:40)
    at Connection.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/connection.js:234:15)
    at Query.submit (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/query.js:164:16)
    at Client._pulseQueryQueue (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:446:43)
    at Client.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:541:8)
    at run (/Users/brianlauber/Documents/src/knex/oops.js:43:26)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:57118) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:57118) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Yay

brianc commented 4 years ago

that's perfect I'll get it looked at soon!

briandamaged commented 4 years ago

FYI: It looks like you can trigger the issue one layer deeper as well. Ex:

// Added the `.stream` portion here
c1.connection.stream.end();

jcollum commented 4 years ago

I'm impressed with how quickly this is being looked at

jamesdixon commented 4 years ago

@jcollum agreed.

Shout out to @briandamaged for digging into this. I'm so thankful for people smarter than I am 🤣

brianc commented 4 years ago

I'm impressed with how quickly this is being looked at

I've been trying to work on pg at least 5 days a week this year. I'll be looking at this first thing tomorrow morning. I'm not 100% sure that making the pool evict closed clients will fix the actual specific issue in lamdas but...it's good behavior to have either way.

jcollum commented 4 years ago

Thanks for being so on top of this.

On Tue, Feb 25, 2020 at 8:33 PM Brian C notifications@github.com wrote:

I'm impressed with how quickly this is being looked at

I've been trying to work on pg at least 5 days a week this year. I'll be looking at this first thing tomorrow morning. I'm not 100% sure that making the pool evict closed clients will fix the actual specific issue in lamdas but...it's good behavior to have either way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brianc/node-postgres/issues/2112?email_source=notifications&email_token=AAFMAAVJEGOUR5O3PGT3XZTREXWLJA5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6YE3I#issuecomment-591233645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMAAQNUNKE5UJFFXNV7CDREXWLJANCNFSM4K2TYUCQ .

jamesdixon commented 4 years ago

@brianc thanks for building such an important library. Your efforts are truly appreciated. Cheers!

brianc commented 4 years ago

:heart: thanks. Looking forward to making lots of folks node apps faster in the next few months when I do some work on perf!

On Tue, Feb 25, 2020 at 10:37 PM James Dixon notifications@github.com wrote:

@brianc https://github.com/brianc thanks for building such an important library. Your efforts are truly appreciated. Cheers!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brianc/node-postgres/issues/2112?email_source=notifications&email_token=AAAMHIMS2WOYO4X37VPYTO3REXWXLA5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6YKYY#issuecomment-591234403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMHIOKFBCV2UFEUSWWVRLREXWXLANCNFSM4K2TYUCQ .

briandamaged commented 4 years ago

@brianc : Here's a variation on the earlier script that provides some more evidence around the issue:

const {Pool} = require('pg');

function delay(t) {
  return new Promise(function(resolve) {
    setTimeout(resolve, t);
  });
}

function createPool() {
  return new Pool({
      user: process.env.PG_USER,
      host: process.env.PG_HOST,
      database: process.env.PG_DATABASE,
      password: process.env.PG_PASSWORD,
      port: process.env.PG_PORT,
      max: process.env.MAX_CLIENTS
  });
}

async function run() {
  const pool = createPool();

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
    } finally {
      client.release();
    }
  }

  const c1 = await pool.connect();
  await c1.query("select 1", null);
  await c1.release();

  // !!!
  // FYI: If you comment out this event handler, then the script
  //      will fail 100% of the time.  (Unhandled 'error' event)
  pool.on('error', function() {
    console.log("Pool handled error cleanly!")
  })

  // c1 is in the pool.  We're intentionally ending the
  // lower-level connection.
  c1.connection.stream.end();

  // !!!
  // FYI: With a 1 second delay, everything appears to work fine.
  //      However, if you reduce this delay too much, then you'll
  //      observe the errors that were reported.
  console.log("Delaying...");
  await delay(1000);
  console.log("Finished delay");

  await doStuff("ONE");
  await doStuff("TWO");
}

run();

There are 2 key things you can play around with here:

Removing the pool.on('error', ...) expression (which will cause the script to explode)
Adjusting / removing the delay. (Errors will begin to occur when the delay is set too low)

Even with a delay of 1 second, the problem seems to disappear. So, now I'm unsure if this script is accurately recreating the scenario described by this issue.

sehrope commented 4 years ago

Wrapping the pool clients would solve a lot of these problems. Right now pool clients are the raw client with additional functions to release the connection back to the pool. That means that code that directly manipulates the connection, ex: c1.connection.stream.end(), can break things internally. Ditto for continuing to use a client after you've told the pool that you're done with it. With a more restricted interface, say just a .query(...) and .release(...), it's harder to shoot yourself in the foot (though with JS it's never completely impossible).

You also need to ensure to call .release(err: Error) to evict broken clients:

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
    } finally {
      client.release();
    }
  }

Calling .release() (without a truthy error) instructs the pool that you're done with client and it's okay to return to the pool for reuse. With proper eviction you'll still get parts of your coding potentially receiving broken clients, but at least they'll eventually evict themselves out of the pool.

briandamaged commented 4 years ago

@sehrope : Right. The purpose of the code snippet was to recreate a scenario where the low-level socket was being closed/ended outside of the Pool's managed lifecycle. So, you're correct: normally, you would not want to do that.

As for the .release(..) part: Javascript does not support try / catch / else syntax. So, expecting the developer to call .release(..) differently depending upon whether or not an exception was raised seems like a usability issue. Ex: it's not possible to do this:

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
      client.release(err);
    } else {
      client.release();
    }

    // Other logic that needs to run after doing stuff w/ `client`
    // ....
  }

(Okay, fine. It's technically possible, but it forces the developer to create/check didErrorOccur variables)

In either case: the snippet's error is actually triggered by the first call to client.query(..). So, although the detail you pointed out about client.release(..) appears to be true, I'm not sure if it really impacts this snippet in any way. (Side note: the README.md for pg-pool doesn't seem to mention this detail about the call to .release(..), but the code itself looks like it aligns w/ what you're saying. https://github.com/brianc/node-postgres/tree/master/packages/pg-pool )

The main concept is: it's possible for Clients that are already in the Pool to lose their Connections. It's the Pool's responsibility to double-check the validity of a Client before handing it off to a caller. Based upon the issue reported here, it looks like there might be a corner case where this is not occurring.

(Or, alternatively: perhaps several people missed this detail about .release(..), and everybody has been inserting disconnected Clients back into the pool???)

sehrope commented 4 years ago

As for the .release(..) part: Javascript does not support try / catch / else syntax. So, expecting the developer to call .release(..) differently depending upon whether or not an exception was raised seems like a usability issue.

Yes it's a bit tricky and hopefully when the pool interface gets improved it'll be easier to use. That "pass an error to evict" logic for the pool has been for all the years I've used this driver. All my usage of the driver has a local wrapper that handles that logic so it's in one place in each app, but you're right that everybody kind of has to handle it themselves.

(Okay, fine. It's technically possible, but it forces the developer to create/check didErrorOccur variables)

Yes that's the logic to which I'm referring. Though rather than checking if an error occurred the usual approach is to have different release(...) calls based on whether the task resolved or rejected:

const pg = require('pg');
const pool = new Pool({ /* ... */ });

const withClient = async (task) => {
    const client = await pool.connect();
    try {
        const result = await task(client);
        // If we get here then the task was successful so assume the client is fine
        client.release();
        return result;
    } catch (err) {
        // If we get here then the task failed so assume the client is broken
        client.release(err);
        throw err;
    }
};

That logic is a bit over zealous as it'll evict potentially salvageable client errors (ex: after a successful ROLLBACK we'd expect the client to be usable again) but it's a decent default. If you expect to have lots of constraint violations that could thrash your pool you can make it a bit more intelligent to deal with that situation or have a transaction wrapper handle that.

In either case: the snippet's error is actually triggered by the first call to client.query(..). So, although the detail you pointed out about client.release(..) appears to be true, I'm not sure if it really impacts this snippet in any way.

It's important to evict broken connections or your pool will never heal and you'll keep getting the same broken connections. Evicting them will force your pool to (eventually) create new working connections.

(Side note: the README.md for pg-pool doesn't seem to mention this detail about the call to .release(..), but the code itself looks like it aligns w/ what you're saying. https://github.com/brianc/node-postgres/tree/master/packages/pg-pool )

Yes we should fix that. The docs site (https://node-postgres.com/api/pool#releaseCallback) has the right usage but the README must be really out of date.

The main concept is: it's possible for Clients that are already in the Pool to lose their Connections. It's the Pool's responsibility to double-check the validity of a Client before handing it off to a caller. Based upon the issue reported here, it looks like there might be a corner case where this is not occurring.

The end usage always needs to deal with broken connections as even if the pool tests the connection, it's a race condition between the pool testing the connection and the end usage actually doing some work with the connection:

Client asks to check out connection
Pool tests connection and verifies it's okay
Connection is killed
Client receives connection from pool

(Or, alternatively: perhaps several people missed this detail about .release(..), and everybody has been inserting disconnected Clients back into the pool???)

That's possible. For something like Lambda that can freeze your app and effectively kill any open sockets in the background, I'd suggest not using a pool at all if it's an option. Or you can add your own wrapper that repeatedly tries to fetch and test the connection prior to giving it to the rest of your code. Also, setting the idleTimeoutMillis on the pool to something lower than the Lambda freeze-after-inactive-seconds should limit the problem a bit as you'll be less likely to have any clients int he pool when the Lambda gets frozen.

briandamaged commented 4 years ago

@sehrope : Agreed -- there is always a race condition betw/ the handoff and the usage. But, based upon what has been reported, it seems that the Client/Connection has already become invalid prior to the handoff.

There is a generic pooling library named Tarn that mitigates this issue by re-validating a resource as part of the handoff. If it discovers that the resource has become invalid, it will:

Immediately remove the invalid resource from the pool
Attempt to provide the caller w/ another resource. (ie: grab another one from the pool if possible, or instantiate a new one)

Perhaps a similar strategy can be put in place within pg-pool?

michaelseibt commented 4 years ago

Cross-posting from knex/knex:

There's a good chance that it's a piece of network infrastructure that is discarding the connection silently. The hope is that there is some way to detect this condition. If so, then the pool implementations can verify that the connections are still valid before handing them off to the caller.

I think this is the gist of it. AWS freezes a Lambda for later usage, which includes memory pointers (variables and such) that have been created in a global-ish scope (this post on Medium summarizes it). I would assume this includes pooled clients, which never get a chance to react to closing Sockets.

From the AWS documentation:

After a Lambda function is executed, AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes, and thaws the context for reuse, if AWS Lambda chooses to reuse the context when the Lambda function is invoked again. This execution context reuse approach has the following implications: Objects declared outside of the function's handler method remain initialized, providing additional optimization when the function is invoked again. For example, if your Lambda function establishes a database connection, instead of reestablishing the connection, the original connection is used in subsequent invocations. We suggest adding logic in your code to check if a connection exists before creating one.

vitaly-t commented 4 years ago

People praise AWS Lambda, without knowing that it is the only hosting environment in the world that implements aggressive connection-recycle policy.

Nobody it seems is greedy enough as to inconvenience their clients so bad, to decide to drop live connections instead of what everyone else is doing - extending the IO capacity. It's just dumb corporate greed, backed up by self-assured market dominance, to maximize profit by reducing the cost, without increasing the capacity.

That's why issues like this one keep polluting the Internet. That's why I do not use AWS Lambda.

jcollum commented 4 years ago

If you are using AWS Lambda for long running processes you're using it wrong.

On Thu, Feb 27, 2020 at 11:37 AM Vitaly Tomilov notifications@github.com wrote:

People praise AWS Lambda, without knowing that it is the only hosting environment in the world that executes aggressive connection drop policy.

Nobody it seems is greedy enough as to inconvenience their clients so bad, to decide to drop live connections instead of what everyone else is doing - extending the IO capacity. It's just dumb corporate greed, backed up by self-assured market dominance.

That's why issues like this one keep polluting the Internet. That's why I do not use LWS Lambda.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brianc/node-postgres/issues/2112?email_source=notifications&email_token=AAFMAAX2BYUDXURW6K6WBRTRFAI6VA5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENFVN5Y#issuecomment-592140023, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMAAUWJDDF3RNT73IQTHLRFAI6VANCNFSM4K2TYUCQ .

michaelseibt commented 4 years ago

Adding a tiny bit more information to the thread from another perspective: We're using AWS RDS, which on its own end complains about the lost connection, too: LOG: could not receive data from client: Connection reset by peer

michaelseibt commented 4 years ago

We've just moved the database connection inside the Lambda handler function, and removed it from the state which is going to be re-used for other Lambda invocations. This might be a tiny little performance hit since we need to create a new DB connection for each Lambda invocation but mitigates the issue for now.

I'll stick around and try to support you guys nevertheless ✌️

micapam commented 4 years ago

Thoughts on the following? Assume pool is already instantiated and maxRetries is set to a positive integer. I'm trying to combine an auto-reconnection wrapper (as per discussion on #1789) with a small arbitrary query to ensure the client actually works before passing it off to the client code.

  for (let numRetries = 0; ; numRetries++) {
    let client

    try {
      client = await pool.connect()

      try {
        await client.query('select 1') // Verify connection actually works
      } catch (queryError) {
        client.release(queryError)
        console.warn('Test query failed', queryError)
        throw new Error('TEST_QUERY_FAILED')
      }

      if (numRetries > 0) console.debug('DB connected')
      return client
    } catch (e) {
      if (!e.toString().match(/(ECONNREFUSED|TEST_QUERY_FAILED)/)) throw e
      if (numRetries > maxRetries) throw e

      console.warn(`Error ${e} while connecting to DB. Retrying (${numRetries})`)
      await new Promise(resolve => setTimeout(resolve, 1000))
    }
  }

micapam commented 4 years ago

Regarding the above - in case anyone tries this approach - turns out that (at least in my case) the sample query did ameliorate the problem of invalid pooled clients but did not resolve the problem of 'Connection terminated unexpectedly' as discussed in this issue. To fix (or at least work around) that problem I created an additional function to wrap query execution and retry in the case of a real query failing - this was the point at which the 'Connection terminated unexpectedly' error occurred.

Notes:

The function getPgClient calls the code in my post above, and getSecret is simply a wrapper for AWS secrets manager which I'm using to store the DB password.
Of course this only works for the use case where the client code wants to run a single query and then release the client back to the pool - but that suits 100% of my use cases in the app I'm writing.

Here is the code:

const maxRetries = 3
const logClientError = error => console.error('DB client error', error)

export default async (sql, params = null) => {
  for (let numRetries = 0; ; numRetries++) {
    const client = await getPgClient({ getSecret })
    client.on('error', logClientError)

    try {
      const result = await client.query(sql, params)
      client.release()
      return result
    } catch (error) {
      if (client) {
        client.release(error)
        client.off('error', logClientError)
      }

      // TODO maybe check that the error is (at least potentially) recoverable?
      if (numRetries > maxRetries) throw error

      console.warn(`For query '${sql}', error: ${error}. Retrying ${numRetries}`)
      await new Promise(resolve => setTimeout(resolve, 1000))
    }
  }
}

ahungry commented 4 years ago

Is this still being worked on?

rendyfebry commented 4 years ago

Hi everyone, I'm coming from knex/knex#3636

I experienced this behaviour with Postgres + NLB setup. If I understand correctly this is not only a Lambda problem but just connection lost in general. NLB has this broken behaviour, it will just timeout after being idle for 350s, SILENTLY WITHOUT telling the client. At this point, knex/pg doesn't know the connection already closed, so when the next query come-in knex/pg will try to connect with the previous connection, then realize the connection already broken, and will throw connection terminated unexpectedly error. This blog sums up this behaviour clearly.

If I'm not mistaken the PR #2121 put on hold because @brianc not familiar with Lambda and this thing were also hard to reproduce. But in my case, since NLB consistently timeout after 350s, it should be easier to reproduce.

Steps to reproduce:

Put NLB in front of your Postgres server
Connect your script to the NLB instead of to the Postgres server directly
Make one success request/query, to trigger the pool
Wait for 350+ seconds
Make request/query again, you will see connection terminated unexpectedly error.

Oh in my case, psql also facing the same problem, which should not be a problem for an interactive tool like psql. But for knex/pg, my expectation is to retry internally when realized the connection already broken.

nl-brett-stime commented 4 years ago

@rendyfebry That seems plausible. One thing we might try is pre-emptively creating a new connection if we detect idleness ourselves. E.g., given a Lambda function:

global dbConn;
global lastInvocationTimestamp = 0;

// Max idle is actually 350s but allow 30s for completion of each invocation
const maxIdleBeforeResettingConnectionMs = 320000;

async function handler(event, context) {
  // Keep globals between invocations
  context.callbackWaitsForEmptyEventLoop = false;

  if (date.now().timestamp - lastInvocationTimestamp > maxIdleBeforeResettingConnectionMs) {
    dbConn = new DbConnection();
  }

  const result = await dbConn.fetchSomeStuff();

  lastInvocationTimestamp = date.now().timestamp;
  return result;
}

We'll try to remember to report our findings but node-postgres might want to provide some similar logic with something like maxIdleBeforeResettingConnectionMs as an optional parameter.

miguelmtzinf commented 3 years ago

Any insight about this? I am facing the same problem, my app throws an error when a connection sits iddle for a long time (using a connection pool, no AWS lambda)

brianc commented 3 years ago

Any insight about this? I am facing the same problem, my app throws an error when a connection sits iddle for a long time (using a connection pool, no AWS lambda)

if your app isn't in lambda then its not related to this issue. If you can write a self-contained script to reproduce the problem, please feel free to open a new issue. Note: in node, keeping connections open for long periods of time to backends will often result in your app throwing an error if/when the backend disconnects unexpectedly. You need to add `

pool.on('error', () => {
  console.log('something bad happend')
  // re-initialize pool here
})

or some other kind of error handling.

I'd suggest instead of that, make sure your idleTimeoutMillis is set to something resonable like 30 seconds. Then after 30 seconds of your app sitting idle the clients will disconnect.

miguelmtzinf commented 3 years ago

Any insight about this? I am facing the same problem, my app throws an error when a connection sits iddle for a long time (using a connection pool, no AWS lambda)

if your app isn't in lambda then its not related to this issue. If you can write a self-contained script to reproduce the problem, please feel free to open a new issue. Note: in node, keeping connections open for long periods of time to backends will often result in your app throwing an error if/when the backend disconnects unexpectedly. You need to add `
pool.on('error', () => {
  console.log('something bad happend')
  // re-initialize pool here
})
or some other kind of error handling.

I'd suggest instead of that, make sure your idleTimeoutMillis is set to something resonable like 30 seconds. Then after 30 seconds of your app sitting idle the clients will disconnect.

Hey @brianc, thanks for your quick reply and amazing work.

I am not using AWS lambda but experimenting this error too, so I would be glad to open a new issue if needed.

In my node backend application I am using a connection pool with the intention of keeping them all alive while the app is running. My idea is creating the connection pool at the startup in order to save time (creating connections with db) while receiving requests. I fixed idleTimeoutMillis to 0 for no releasing/removing connections. As a workaround, I am recreating the pool on the error handler (as you mentioned). Does this make sense for you?

PS: I am also seeing that when the pool is created, It does not check connectivity... is this normal?

Thanks in advance

julrich1 commented 3 years ago

I'm seeing the same behavior in a Kubernetes pod behind a load balancer. After left sitting for a few minutes it will always throw a connection error on the next query. It doesn't seem like this problem is limited to Lambda only as @miguelmartinezinf mentioned.

miguelmtzinf commented 3 years ago

I'm seeing the same behavior in a Kubernetes pod behind a load balancer. After left sitting for a few minutes it will always throw a connection error on the next query. It doesn't seem like this problem is limited to Lambda only as @miguelmartinezinf mentioned.

exactly, its happening to me under k8s env too (AKS)

adamprescott commented 3 years ago

I've been seeing the same behaviour too within an AWS Elastic Beanstalk environment. Our current work-around is, configure the health-check URL to run every 15 seconds on the load balancer. The Health-check endpoint runs select 1 to keep the connection active, almost like a keep-alive. It's not ideal, but it works for now.

timothyerwin commented 3 years ago

@brianc @adamprescott

I just started seeing this issue on Elastic Beanstalk too....is there any fix yet?

julrich1 commented 3 years ago

Hi everyone, I'm coming from knex/knex#3636

I experienced this behaviour with Postgres + NLB setup. If I understand correctly this is not only a Lambda problem but just connection lost in general. NLB has this broken behaviour, it will just timeout after being idle for 350s, SILENTLY WITHOUT telling the client. At this point, knex/pg doesn't know the connection already closed, so when the next query come-in knex/pg will try to connect with the previous connection, then realize the connection already broken, and will throw connection terminated unexpectedly error. This blog sums up this behaviour clearly.

If I'm not mistaken the PR #2121 put on hold because @brianc not familiar with Lambda and this thing were also hard to reproduce. But in my case, since NLB consistently timeout after 350s, it should be easier to reproduce.

Steps to reproduce:

Put NLB in front of your Postgres server

Connect your script to the NLB instead of to the Postgres server directly

Make one success request/query, to trigger the pool

Wait for 350+ seconds

Make request/query again, you will see connection terminated unexpectedly error.

Oh in my case, psql also facing the same problem, which should not be a problem for an interactive tool like psql. But for knex/pg, my expectation is to retry internally when realized the connection already broken.

This is exactly whats happening and no amount of configuring the idleTimeout seems to help fix the problem. Even setting it to a small value of 30 seconds or so doesn't properly detect a severed connection and it appears to try and reuse it still.

julrich1 commented 3 years ago

Using these values with Knex seems to solve the issue. Specifically setting the pool minimum to 0.

pool: { min: 0, max: 10, acquireTimeoutMillis: 30000, idleTimeoutMillis: 30000 }

cjnoname commented 3 years ago

I got the same issue, any one can save me? :)

Tharuja commented 3 years ago

I got below issue when connecting my PostgreSQL database on AWS RDS. ERROR Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: Connection terminated","reason":{"errorType":"Error","errorMessage":"Connection terminated","stack":["Error: Connection terminated","

Simply the code I tried was client.connect() and doing a query. So , I tried out below to clarify whether it is connected correctly:

client.connect(err => { if (err) { console.error('connection error', err.stack) } else { console.log('connected') } })

Then It says there is no such database even though I created it. According to these answers I used "postgres" as my database name instead of AWS DB Instance Name. Then all the problems solved. 😄

bdbrownell commented 3 years ago

I believe this issue is also happening to me on GCP - google cloud functions - I think this is similar to AWS Lamba functions. Just wanted to comment to note that it was happening in other serverless/cloud environments incase others are running into this issue from a google cloud project.

This snippet is what I will be trying, I missed this in the documentation but I have a feeling it will alleviate the problem.

pool.on('error', () => { console.log('something bad happend') // re-initialize pool here })
or some other kind of error handling.

michaelseibt commented 3 years ago

Cross-posting from TypeORM: https://github.com/typeorm/typeorm/issues/5112#issuecomment-611960564

Call this method from inside the event handler function.

const reconnectToDatabase = async () => {
  const connection = getConnection();
  const driver = connection.driver as any;
  for (const client of driver.master._clients) {
    try {
      await client.query('SELECT 1');
    } catch (error) {
      console.info('Reconnecting ...');
      await getConnection().driver.disconnect();
      await getConnection().driver.connect();
      break;
    }
  }
}

Please mind removing type annotations for plain node usage. This is targeting a TypeORM setup, but the concept should be working for node-pg as well. If someone could port this, highly appreciated :-)

We're flying with this since a while now, and it remediated the issue (tho, obviously, not fix it fully). The database connection is established outside the main handler function, allowing it to be re-used. Just this little snippet here checks if everything is still up and running. Might add a little TTL on that check, so it only fires every few minutes.

brianc / node-postgres

7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) #2112