7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda)

jcollum commented 4 years ago

Code is running in a lambda. When we run the function below it runs fine. But if we run it and then wait 10 minutes then run it again, we get an error every time. We are querying Aurora Postgres on AWS.

Code:

const {Pool} = require('pg');

// Global Connection, can be re-used!!
const pgPool = new Pool({
    user: process.env.PG_USER,
    host: process.env.PG_HOST,
    database: process.env.PG_DATABASE,
    password: process.env.PG_PASSWORD,
    port: process.env.PG_PORT,
    max: process.env.MAX_CLIENTS
});

pgPool.on('error', (err, client) => {
    console.error('Unexpected error in Postgress connection pool', err);
});

async function performQuery(event) {

    let queryString = null;
    let args = null;

    try {
        // some setup code here...  
        const client = await pgPool.connect();
        try {
            const res = await client.query(queryString, args);
            return res.rows;
        } finally {
            client.release();
        }
    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err); // <-- this line is in the log below  
        throw err;
    }
}

This is what I see in the cloudwatch logs:

{
    "errorType": "Error",
    "errorMessage": "Connection terminated unexpectedly",
    "stack": [
    "Error: Connection terminated unexpectedly",
    "    at Connection.<anonymous> (/var/task/node_modules/pg/lib/client.js:255:9)",
    "    at Object.onceWrapper (events.js:312:28)",
    "    at Connection.emit (events.js:223:5)",
    "    at Connection.EventEmitter.emit (domain.js:475:20)",
    "    at Socket.<anonymous> (/var/task/node_modules/pg/lib/connection.js:78:10)",
    "    at Socket.emit (events.js:223:5)",
    "    at Socket.EventEmitter.emit (domain.js:475:20)",
    "    at TCP.<anonymous> (net.js:664:12)"
]
}

I've tried a few variations on this but the constant is the 10 minutes and the use of the Pool. To me this code is almost identical to the code in https://node-postgres.com/features/pooling.

So far it looks like the problem has been solved by using a Client instead:

const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        const res = await client.query(queryString, args);
        await client.end();
        return res.rows;

rols2015 commented 3 years ago

please check: https://github.com/brianc/node-postgres/issues/2439#issuecomment-757691278

rick-seamless commented 2 years ago

My issue was that the Aurora reader I was connecting to was rebooting from falling behind the writer too much. Figured I'd post in case someone else was running into this.

balbatross commented 2 years ago

@brianc thank you for all your hard work, this to me is another case of big tech offloading it's problems to the open source community.

Running in to the same issue on EKS and seeing how many people are being affected makes me feel a little less crazy.

Gonna make my first sponsorship today to help support you, if I can track down a worthwhile solution I'll open a PR.

Arohanui

balbatross commented 2 years ago

For anyone else still struggling with this I think I may have found a resolution.

It seems to be largely an infrastructure provider issue as others have mentioned, AWS has a common practice of setting a 350 second timeout on idle TCP connections, Azure and GCP have different timeouts but similar issues.

From a cursory search ELB and NAT Gateway were the culprits for me.

The keepAlive flag implemented in node-pg follows the right setup from what I can see but it could be that the OS level and node aren't playing nice as the default timer for linux seems to be 7200.

I ended up resolving by setting up pgbouncer and adding a keepAlive timer of 300 to the config

pgbouncer.ini

tcp_keepalive = 1
tcp_keepidle = 300

Thank you all for helping me through the day

sharadpattanshetti commented 2 years ago

We added below config as well as the pool config but still get the same error:

'net.ipv4.tcp_keepalive_time 60 net.ipv4.tcp_keepalive_intvl 10 net.ipv4.tcp_keepalive_probes 6'

 database:
    client: pg
    connection:         
      host: 
      user: 
      password: 
    pool: 
      min: 0
      max: 30
      acquireTimeoutMillis: 30000
      idleTimeoutMillis: 30000

Dakuan commented 2 years ago

still a thing

ameer2468 commented 1 year ago

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

boromisp commented 1 year ago

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution. You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

bdbrownell commented 1 year ago

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution. You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

I have had similar experiences with this on Google Cloud using cloud functions, posted previously in here. This is actually the same conclusion that my team has come to as well. The way Cloud Functions are handled, at least on first generation functions on GCP, is that a single request is handled by a single instance. There will never be the ability for a connection pool to exist in this situation regardless of the timeout - Its one of the shortcomings of the FaaS architecture.

ameer2468 commented 1 year ago

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution. You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

Hey thanks for the reply, im not on AWS/AWS Lambda. My restful API is based on Fastify, here's a link just incase: https://www.fastify.io/

also im using client, not pool.

Edit: Will be doing this instead: https://github.com/fastify/fastify-postgres

mepc36 commented 1 year ago

For anyone else still struggling with this I think I may have found a resolution.

It seems to be largely an infrastructure provider issue as others have mentioned, AWS has a common practice of setting a 350 second timeout on idle TCP connections, Azure and GCP have different timeouts but similar issues.

From a cursory search ELB and NAT Gateway were the culprits for me.

The keepAlive flag implemented in node-pg follows the right setup from what I can see but it could be that the OS level and node aren't playing nice as the default timer for linux seems to be 7200.

I ended up resolving by setting up pgbouncer and adding a keepAlive timer of 300 to the config

pgbouncer.ini
tcp_keepalive = 1
tcp_keepidle = 300
Thank you all for helping me through the day

@balbatross Do you mind explaining in more detail how you set up pgbouncer, and why that fixed your problem? links to docs is fine. Googled but found nothing. Thanks!

bdbrownell commented 1 year ago

@mepc36 PGBouncer solves this by moving the pooling to PGbouncer instead of pg-node. It sits between the node application and the database and uses its own internal pooling to keep track of connections, which is transparent to your application. I don't think the configuration you quoted really has anything to do with why this solves the issue, PGbouncer is the solution to the problem. You would setup PGbouncer on a VM - compute engine on GCP - and your application would connect to that instead of postgres directly.

p.s. Nice to see a fellow developer from Philadelphia :grinning:

moltar commented 1 year ago

Quick question to everyone who is still experiencing this.

Do you use Aurora Serverless v1 by any chance?

The reason I'm asking is because I think this happens when the cluster experiences a scaling event. It drops all connections.

This doesn't happen on v2 by the way.

moltar commented 1 year ago

I think I got to the bottom of this with at least one particular case.

This is only applicable to Aurora Serverless v1, and maybe v2. I will test v2 later and report back.

Here's what happens:

A scaling event is triggered.

After ~ 2m the scaling event is triggered (cluster status = scaling-capacity), RDS runs this query:

-- 1: long string, looks like base64 encoded, but not
-- 2: The value of `SecondsBeforeTimeout` from `ScalingConfigurationInfo`
-- 3: Not sure what this is, seems to be always `2`

select * from aurora_serverless_prepare_seamless_scaling('long string', 60000, 2)

This query is blocking until all user queries are finished running.

As soon as the last query ends, it will drop all of the active connections within milliseconds.

I have tested this by running select pg_sleep(60). As soon as the query would return, TablePlus would throw:

Query 1 ERROR: SSL connection has been closed unexpectedly

So what actually happens in the Lambda environment?

It's a race condition!

Cluster entered a scaling-capacity state.
But the aurora_serverless_prepare_seamless_scaling query has not started yet. Remember - it runs in ~ 2 minutes.
Lambda gets invoked and an instance of Pool is created
Maybe some database activity happens for a while (> 2m), let's say it's a semi-long job Lambda, not a web request.
The function aurora_serverless_prepare_seamless_scaling is now patiently waiting for its moment to pounce 🐈
All of the transactions and queries are finished for whatever reason (e.g. maybe they run in a loop: fetch data -> save data), and there is a few millisecond window of opportunity.
But the Pool object is still alive and well, and Lambda is still running, just not querying the database.
aurora_serverless_prepare_seamless_scaling triggers scaling
All connections drop. 💥

Error: Connection terminated unexpectedly
    at Connection.<anonymous> (/node_modules/.pnpm/pg@8.11.2/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:627:28)
    at Connection.emit (node:events:513:28)
    at Socket.<anonymous> (/node_modules/.pnpm/pg@8.11.2/node_modules/pg/lib/connection.js:63:12)
    at Socket.emit (node:events:513:28)
    at TCP.<anonymous> (node:net:301:12)

I have corroborated this with a massive amount of Log Insight queries and a CloudWatch dashboard.

screenshot-20230807T210341-U33T7rBu@2x

Error Count = number of errors in the log at the given time. The value is hard to see, but it's at 20.
Scaling Events = Cluster uptime (seconds). This is the recommended metric to measure scaling 🤷🏼

screenshot-20230807T210433-1M9p42Zl@2x

screenshot-20230807T210930-oZ9Mqz9j@2x

bdbrownell commented 1 year ago

@moltar I was experiencing this issue over in Google cloud functions - the Google equivalent of AWS lambda. I presume GCP has some timeout in the same way AWS is dropping the connections. It turns out there is no way for GCP cloud functions to handle concurrent requests, so it doesn’t actually make sense to try to start a pool as the function only handles 1 query per invocation. Version 2 has changed that, and allows concurrent requests. In a function as a service environment, IMO, the best way to handle this would be an external pool service like pgbouncer.

Tadeuchi commented 1 year ago

Hi, the issue with connection still seems to be appearing in the latest version: 8.11.3

I have encountered something very similar to what was described above and managed to create some simple code snippet with the problem. Maybe this will help with resolving the issue.

I'm just using pg-pool to connect to local postgres and insert simple json object. When the object contains bigint the JSON.stringify in the pg lib fails, but the connection is already established. Here is what I'm doing.

Simple table to store the jsonb.

create table test (
    value jsonb
);

JS inserting the objects:

async function insertUsingPool() {
    // following query works fine
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 1}]
    ).catch((e) => { console.error(e) });

    // this fails with error: TypeError: Do not know how to serialize a BigInt
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 2n}]
    ).catch((e) => { console.error(e) });

    // this fails with error: Error: Connection terminated unexpectedly
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 3}]
    ).catch((e) => { console.error(e) });
}

async function insertUsingClientPool() {
    const pgPoolClient = await pgPool.connect();

    // following query works fine
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 4}]
    ).catch((e) => { console.error(e) });

    // this throws exception and hangs... TypeError: Do not know how to serialize a BigInt
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 5n}]
    ).catch((e) => { console.error(e) });

    // not even executed...
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 6}]
    ).catch((e) => { console.error(e) });
}

After running the insertUsingClientPool function, the test table contains only two objects "{""test"": 1}", "{""test"": 4}" and the process just hangs. pg_stat_activity shows there is an active connection ongoing with the insert query.

I don't know, maybe I'm doing something stupid and not using the pg lib correctly, but maybe the JSON.stringify should be called before establishing the connection?

cjnoname commented 7 months ago

Any updates?

devodii commented 5 months ago

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

How can we do this buddy?

brianc / node-postgres

7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) #2112

pgbouncer.ini

pgbouncer.ini