brianc / node-postgres

PostgreSQL client for node.js.

https://node-postgres.com

MIT License

12.29k stars 1.23k forks source link

Curious behavior: `idleTimeoutMillis` works locally, but not on "serverless" environments #2718

Open filipedeschamps opened 2 years ago

filipedeschamps commented 2 years ago

I was skeptical to write this issue since it might involve other things, not just pg, but I will give it a try:

So, everything works as expected on localhost: after I release a client to the Pool with client.release() and idleTimeoutMillis reaches its limit, my Pool correctly emits remove event (signaling that it correctly destroyed the client and also the real connection to the Database). Also, I can use my localhost to connect to the same Database I'm using in Production (RDS Postgres) and it works as expected.

But on environments like Vercel (which uses AWS Lambas under the hood), for some very curious reason, idleTimeoutMillis seems to not work as expected. Looks like after releasing a Client, it stays in the Pool as idle forever, not destroying the client/connection after reaching idleTimeoutMillis and leaving a lot of idle connections hanging on RDS.

To make things more strange, if I force a client to close using client.release(true), the remove event from the Pool is emitted and the connection is closed as expected... but forcing this to every client ruins the purpose of using a Pool in the first place.

I don't know if there's some different behavior of the eventloop in this kind of environment, and then the internal pg timers don't get run or something like this.

import { Pool } from 'pg';
import retry from 'async-retry';
// ...

const configurations = {
  // ...
  connectionTimeoutMillis: 5000,
  idleTimeoutMillis: 5000,
  max: 1,
  allowExitOnIdle: true // I've also tried with false
};

const pool = new Pool(configurations);

async function query(query, params) {
  let client;

  try {
    client = await tryToGetNewClientFromPool();
    return await client.query(query, params);
  } catch (error) {
    throw new ServiceError({
      // ...
    });
  } finally {
    if (client) {
      client.release();
    }
  }
}

// If all Database connections are maxed out, this function will
// keep retrying until it gets one Client from Pool and returns
// the Promise with it.
async function tryToGetNewClientFromPool() {
  const clientFromPool = await retry(newClientFromPool, {
    retries: 15,
    minTimeout: 0,
    factor: 2,
  });

  return clientFromPool;

  async function newClientFromPool() {
    return await pool.connect();
  }
}

// ...

sehrope commented 2 years ago

It's the lambda environment freeze / thaw causing this issue. You can't use a long lived connection pool with a lambda as there is no guarantee that the lambda will continue running to clear the expired connections or keep the underlying TCP socket alive.

See this thread and the nested linked issue for more details: https://github.com/brianc/node-postgres/issues/2243

filipedeschamps commented 2 years ago

Hi @sehrope thank you! I read the #2243 issue before creating mine, and my app is not suffering from "connection timeout".

But you pointed out again the freeze / thaw behavior, and I'm "stress" testing my app to keep all lambdas warm, for example:

ab -n 1000 -c 100 -l https://xxx

And at some point (near the 900 request) everything hangs, because there's no more available connection, and the tryToGetNewClientFromPool() keeps trying until the lambda times out.

But then I tried to create the Pool with 1 for the idle:

idleTimeoutMillis: 1,

It was way slower, but at the end, all 1000 request where served and no connections where hanging (tbh 5 of them was, I don't know why).

So know I'm wondering: if it's the freeze condition the problem, how fast does this happen to a lambda?

filipedeschamps commented 2 years ago

Additional note:

  idleTimeoutMillis: 100,

Consumes all connections again without recovery/closing connections any more.

Maybe some lambdas aren't being frozen because they became idle, but because they were forcibly frozen to make room for other new lambdas in a high load scenario?

[edit] my urge to use Pools is because creating a new connection every time or reusing one from the Pool, is a difference between 80ms and 1.5ms.

sehrope commented 2 years ago

I think that's possible as the docs say it could be frozen or bounced at any time:

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html

The last line also says not to assume you have connections open across requests:

When you write your function code, do not assume that Lambda automatically reuses the execution environment for subsequent function invocations. Other factors may dictate a need for Lambda to create a new execution environment, which can lead to unexpected results, such as database connection failures.

If there's multiple copies of the lambda running or if AWS spins up new ones, it's possible that the server's max connection limit is reached and a newly created lambda cannot open any new connections until the old ones on other lambdas are closed. And they might not actually be gracefully closed if the lambda is frozen so it'd hang until the server finally drops them due to TCP errors.

filipedeschamps commented 2 years ago

...cannot open any new connections until the old ones on other lambdas are closed. And they might not actually be gracefully closed if the lambda is frozen so it'd hang until the server finally drops them due to TCP errors.

Yes. What really bugs me is that, even using idleTimeoutMillis: 10 they aren't gracefully closed. Those lambdas weren't in an idle state, they are being shot down. If only I could see this information somewhere...

filipedeschamps commented 2 years ago

So @sehrope just to map out all scenarios and double check if I have the right models in my mind:

Considering

1 request from User to App, in which it makes 3 travels (queries) from App to Database (AWS RDS tiniest tier with 82 max connections) and the App hosted by Vercel (AWS Lambas under the hood).

Scenario 1 (slowest | safer)

For every single query, create a new client to open a new connection, run the query, and right afterward destroy the client and close the connection.

      "latency": {
        "first_query": 60.39779100008309,
        "second_query": 62.02611799980514,
        "third_query": 87.15444500022568
      },

✅ No connection left hanging under low load. ✅ No connection left hanging under high load:

$ ab -n 1000 -c 100 -l https://xxx
# > Time taken for tests:   16.601 seconds

Scenario 2 (slower | safe)

For 1 request, open just one connection and reuse it through the subsequent queries in the request, and in the end close it. It's safe, but less safe than Scenario 1 because you need to close the Database connections by yourself and in all code branches, for example:

Sucessfull request branch
Not found request branch
Errored request branch

But it pays of I guess:

      "latency": {
        "first_query": 58.97044100007042,
        "second_query": 2.284815000137314,
        "third_query": 0.9723079998511821
      },

✅ No connection left hanging under low load. ✅ No connection left hanging under high load. ✅ 2.5x faster than Scenario 1:

$ ab -n 1000 -c 100 -l https://xxx
# > Time taken for tests:   6.699 seconds

Scenario 3 (fast | unsafe)

For 1 request, use Pool declared in the global scope to manage all connected Clients (create and reuse). You don't need to close Clients by yourself, the Pool does this once idleTimeoutMillis is reached, BUT, in a Lambda environment, you can't guarantee the code will be running for the Pool be able to run its internal timers and gracefully close the connections. This problem is not apparent under low load, but under high load it is.

First request:

      "latency": {
        "first_query": 61.41757299995515,
        "second_query": 3.38494899997022,
        "third_query": 1.2718850000528619
      },

Second request made before reaching idleTimeoutMillis (notice first query):

      "latency": {
        "first_query": 1.6486390000209212,
        "second_query": 1.6974199999822304,
        "third_query": 1.25238800002262
      },

✅ No connection left hanging under low load. ❌ All connections left hanging under high load, test could not be completed:

Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 988 requests completed

Maybe you could kinda sync idleTimeoutMillis with the new idle_session_timeout available on Postgres 14, but is still risky.

Scenario 4 (faster | unsafer)

Always keep on global scope an open connection. It's faster because there's zero overhead (important to note that the Pool overhead is minimal, and it can manage multiple Clients), but it is one level unsafe, because of all problems from Scenario 3, plus Postgres closing your connection from its side, just like mentioned in the issue #2243

First request:

      "latency": {
        "first_query": 56.0580700000282377,
        "second_query": 1.2427840000018477,
        "third_query": 0.8340720001142472
      },

Second request:

      "latency": {
        "first_query": 0.8882809999631718,
        "second_query": 0.9511439999332651,
        "third_query": 0.5543670000042766
      },

✅ No connection left hanging under low load. ❌ All connections left hanging under high load, test could not be completed:

Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 984 requests completed

filipedeschamps commented 2 years ago

Scenario 2 is not completely safe under load

In a load of 5000 requests (50 concurrent), I received 22 message: 'Client was closed and is not queryable',

Anyway, differences between Scenario 2 and Scenario 4 under this load:

Scenario 2

Concurrency Level:      50
Time taken for tests:   32.306 seconds
Complete requests:      5000

Scenario 4

Concurrency Level:      50
Time taken for tests:   22.546 seconds
Complete requests:      5000

boromisp commented 2 years ago

Alternatively you could create a new pool for every request, and clean it up every time. This makes the error handling somewhat easier.

Did you handle errors on client.connect() in scenario 2? If the connection timed out (connectionTimeoutMillis) under high load, you will get these errors when you try to use the client.

filipedeschamps commented 2 years ago

Alternatively you could create a new pool for every request, and clean it up every time. This makes the error handling somewhat easier.

True! I will combine this strategy with idle_session_timeout from Postgres 14 because if the Pool can't clean itself from a frozen lambda, Postgres from its side can.

Did you handle errors on client.connect() in scenario 2? If the connection timed out (connectionTimeoutMillis) under high load, you will get these errors when you try to use the client.

I didn't and it's a great point, but are you sure a "waiting to open" connection will throw this message? I thought it was due to the fact a request called client.end(), flagging the client to _closing = true and then a subsequent request found this still existing object and tried to use it:

https://github.com/brianc/node-postgres/blob/684cd09bcecbf5ad5f451fdf608a3e9a9444524e/packages/pg/lib/client.js#L568-L573

https://github.com/brianc/node-postgres/blob/4b229275cfe41ca17b7d69bd39f91ada0068a5d0/packages/pg/test/integration/client/error-handling-tests.js#L60-L78

What I could do is to handle conditions for this.activeQuery and !this._queryable but I would start to mess with private properties.

sehrope commented 2 years ago

In a load of 5000 requests (50 concurrent), I received 22 message: 'Client was closed and is not queryable',

That sounds like a bug in your either the pool itself or your pool connection management. Regardless of load, the pool should not close connections that are checked out and should not hand out connections it knows that it closed.

I didn't and it's a great point, but are you sure a "waiting to open" connection will throw this message? I thought it was due to the fact a request called client.end(), flagging the client to _closing = true and then a subsequent request found this still existing object and tried to use it:

Are you calling client.end() on a pool connection? You should be calling .client.release(true) (or anything "truthy" like an Error) to inform the pool that the connection is bad and should be evicted.

Does your code use pool.query(...) or do you manually manage connections via pool.connect(...) (i.e. for transaction management)?

filipedeschamps commented 2 years ago

@sehrope on Scenario 2 I wasn't using a Pool, I was manually creating one client, assigning it to the global scope and for every new request, checking if already existed, if so, reuse it, otherwise create a new one, and only at the end of that request, closing the client and redefining its variable to null. So great chances of race condition I guess.

sehrope commented 2 years ago

Oh yes that'd definitely break under load. The client's have an internal query command queue so concurrent usage of client kind of works until it doesn't (i.e. anything that requires connection state management like a transaction). In practice it's a terrible idea to rely on the queue behavior and sharing client's concurrently is a big no-no.

How about having your request handler get or create a pool at the start of the request and if it's the only active request upon completion, it destroys the pool? Something like this (not tested, just a skeleton!):

let pool = null;
let poolUsageCount = 0;

// App code should use this to get the pool rather than referencing the variable directly:
export const getPool = () => {
    if (!pool) {
        throw new Error('Called getPool() outside of an active request');
    }
    return pool;
}

exports.handler =  async function(event, context) {
    if (!pool) {
        pool = new pg.Pool();
        poolUsageCount += 1;
    }
    try {
        await myRealHandler(event, context);
    } finally {
        poolUsageCount -= 1;
        if (poolUsageCount === 0) {
            pool.end();
            pool = null;            
        }
    }
  }

Under load there would be enough concurrent active requests that the pool does not get destroyed. It'd be reused by concurrent requests and any open connections held in the pool would be reused as well.

But if the request count ever drops low enough that there is nothing running, it would ensure the pool is destroyed and recreated in case the lamda is frozen / thawed before the next request.

You might have to tweak it a bit more to have the top level request handler destroy the pool before sending back the response to ensure it all happens before AWS thinks it's okay to scrap the lamda.

filipedeschamps commented 2 years ago

@sehrope great!! I'm avoiding having the controller manage the pool, but I'm not discarding this strategy, thank you 🤝

Meanwhile, I started my tests with idle_session_timeout and it really does what it promises: Postgres ends any idle connection automatically. So I've reimplemented Pool sharing with:

Pool:

idleTimeoutMillis: 5000

Postgres:

idle_session_timeout = 7000

While under 7000ms everything looks great at every request, after passing 7000, the next request blew up in my face:

terMark":16384,"buffer":{"head":null,"tail":null,"length":0},"length":0,"pipes":[],"flowing":true,"ended":false,"endEmitted":false,"reading":true,"sync":false,"needReadable":true,"emittedReadable":false,"readableListening":false,"resumeScheduled":false,"errorEmitted":false,"emitClose":false,"autoDestroy":false,"destroyed":true,"errored":null,"closed":true,"closeEmitted":false,"defaultEncoding":"utf8","awaitDrainWriters":null,"multiAwaitDrain":false,"readingMore":false,"dataEmitted":true,"decoder":null,"encoding":null},"_events":{"error":[null,null],"close":[null,null]},"_eventsCount":4,"_writableState":{"objectMode":false,"highWaterMark":16384,"finalCalled":false,"needDrain":false,"ending":false,"ended":false,"finished":false,"destroyed":true,"decodeStrings":false,"defaultEncoding":"utf8","length":0,"writing":false,"corked":0,"sync":false,"bufferProcessing":false,"writecb":null,"writelen":0,"afterWriteTickInfo":null,"buffered":[],"bufferedIndex":0,"allBuffers":true,"allNoop":true,"pendingcb":0,"prefinished":false,"errorEmitted":false,"emitClose":false,"autoDestroy":false,"errored":null,"closed":true,"closeEmitted":false,"writable":true},"allowHalfOpen":false,"_sockname":null,"_pendingData":null,"_pendingEncoding":"","server":null,"_server":null},"_host":null,"_readableState":{"objectMode":false,"highWaterMark":16384,"buffer":{"head":null,"tail":null,"length":0},"length":0,"pipes":[],"flowing":true,"ended":false,"endEmitted":false,"reading":false,"sync":false,"needReadable":true,"emittedReadable":false,"readableListening":false,"resumeScheduled":false,"errorEmitted":false,"emitClose":false,"autoDestroy":false,"destroyed":true,"errored":null,"closed":true,"closeEmitted":false,"defaultEncoding":"utf8","awaitDrainWriters":null,"multiAwaitDrain":false,"readingMore":false,"dataEmitted":true,"decoder":null,"encoding":null},"_writableState":{"objectMode":false,"highWaterMark":16384,"finalCalled":false,"needDrain":false,"ending":false,"ended":false,"finished":false,"destroyed":true,"decodeStrings":false,"defaultEncoding":"utf8","length":5,"writing":true,"corked":0,"sync":false,"bufferProcessing":false,"writelen":5,"afterWriteTickInfo":null,"buffered":[],"bufferedIndex":0,"allBuffers":true,"allNoop":true,"pendingcb":1,"prefinished":false,"errorEmitted":false,"emitClose":false,"autoDestroy":false,"errored":null,"closed":true,"closeEmitted":false},"allowHalfOpen":false,"_sockname":null,"_pendingData":null,"_pendingEncoding":"","_server":null,"ssl":null,"_requestCert":true,"_rejectUnauthorized":false},"_keepAlive":false,"_keepAliveInitialDelayMillis":0,"lastBuffer":false,"parsedStatements":{},"ssl":{"rejectUnauthorized":false},"_ending":true,"_emitMessage":false,"_connecting":true},"queryQueue":[],"binary":false,"processID":17622,"secretKey":543429530,"ssl":{"rejectUnauthorized":false},"_connectionTimeoutMillis":0,"_connectionCallback":null,"activeQuery":null,"readyForQuery":true,"hasExecuted":true,"_poolUseCount":57},"stack":["error: terminating connection due to idle-session timeout","    at Parser.parseErrorMessage (/var/task/node_modules/pg-protocol/dist/parser.js:287:98)","    at Parser.handlePacket (/var/task/node_modules/pg-protocol/dist/parser.js:126:29)","    at Parser.parse (/var/task/node_modules/pg-protocol/dist/parser.js:39:38)","    at TLSSocket.<anonymous> (/var/task/node_modules/pg-protocol/dist/index.js:11:42)","    at TLSSocket.emit (events.js:400:28)","    at addChunk (internal/streams/readable.js:293:12)","    at readableAddChunk (internal/streams/readable.js:267:9)","    at TLSSocket.Readable.push (internal/streams/readable.js:206:10)","    at TLSWrap.onStreamRead (internal/stream_base_commons.js:188:23)"]}
[ERROR] [1646924053876] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
END RequestId: 6265b8a0-62d2-45bb-afee-0bdc371917e0
REPORT RequestId: 6265b8a0-62d2-45bb-afee-0bdc371917e0  Duration: 6.02 ms   Billed Duration: 7 ms   Memory Size: 1024 MB    Max Memory Used: 54 MB  
RequestId: 6265b8a0-62d2-45bb-afee-0bdc371917e0 Error: Runtime exited with error: exit status 129
Runtime.ExitError

The most important part:

"stack":["error: terminating connection due to idle-session timeout","
    at Parser.parseErrorMessage (/var/task/node_modules/pg-protocol/dist/parser.js:287:98)","
    at Parser.handlePacket (/var/task/node_modules/pg-protocol/dist/parser.js:126:29)","

It's more clear to me that in a Lambda, the Pool stops its internal work as soon as the request is returned, so it's unable to disconnect the client after 5000 and then probably tries to do it right after that last request hits the Lambda, but the Postgres backend already killed the connection at this point.

boromisp commented 2 years ago

Btw, there is no concurrent client sharing in this case (and the poolUsageCount would always be 0 at the end of @sehrope's handler), because the AWS Lambda will only send a single request at a time to each instance.

I didn't and it's a great point, but are you sure a "waiting to open" connection will throw this message? I thought it was due to the fact a request called client.end(), flagging the client to _closing = true and then a subsequent request found this still existing object and tried to use it:

My mistake, the connection timeout sets the client.connection._ending flag, not the client._ending: https://github.com/brianc/node-postgres/blob/21ccd4f1b6e66774bbf24aecfccdbfe7c9b49238/packages/pg/lib/client.js#L103-L107 In this case, I don't think connection timeout would result in that exact error message. If you don't use pool, that error should only be triggered by using a client after calling client.end(). The connection timeout might result in Connection terminated unexpectedly errors.

sehrope commented 2 years ago

@boromisp Oh dang you're right. Lamda only sends a single request at a time to an instance.

@filipedeschamps Forget my idea of keeping it open it if it's concurrently being used. That would only work if the requests are actually concurrent.

filipedeschamps commented 2 years ago

I think I found an almost sweet spot, it's not the fastest strategy because you're hit in the first query (to open connection), but the Pool reuses the client to every subsequent query in that request and closes the client right after. So the pool is using:

idleTimeoutMillis: 1
max: 1

With idle_session_timeout = 7000 set on Postgres to kill anything that got out of control.

First query result

      "latency": {
        "first_query": 13.636365999933332,
        "second_query": 2.0373400000389665,
        "third_query": 0.6963150000665337
      },

Load results 1000 requests

Concurrency Level:      100
Time taken for tests:   7.362 seconds
Complete requests:      1000
Failed requests:        0

Load results 5000 requests

Concurrency Level:      100
Time taken for tests:   29.019 seconds
Complete requests:      5000
Failed requests:        0

No connections left hanging. But I'm confident it will not work for every load, because of this:

Pool:

idleTimeoutMillis: 100

Concurrency Level:      100
Time taken for tests:   51.840 seconds
Complete requests:      5000
Failed requests:        0
Non-2xx responses:      18

It's clear to me that the Pool internal timers completely freeze right after the request is returned.

Next, I'm going to try setting Postgres to idle_session_timeout = 65000 because Lambdas default timeout is 60000 and after this, it doesn't matter what it could do... connections impossible to be reached will be left hanging in idle.

filipedeschamps commented 2 years ago

To my surprise, somehow even after 60 seconds, the Lambda keeps the socket connection open.

Pool:

idleTimeoutMillis: 10000

Postgres:

idle_session_timeout = 61000

First request works as expected, and 2 minutes later, sending the second request makes the app crash with "error: terminating connection due to idle-session timeout"

[edit] also crashed after 5 minutes.

filipedeschamps commented 2 years ago

For now, my final choice will be:

Pool idleTimeoutMillis: 1

Postgres idle_session_timeout = 0

Load of 5.000 requests (100 concurrent) Time taken for tests: 30.627 seconds

Percentage of the requests served within a certain time (ms)
  50%    540
  66%    567
  75%    593
  80%    609
  90%    681
  95%    789
  98%   1881
  99%   2157
 100%   4318 (longest request)

Connections left hanging 11

One request example

      "latency": {
        "first_query": 19.316103999968618,
        "second_query": 2.637129999930039,
        "third_query": 1.5327200000174344
      },

In comparison, if I open one new connection for every query of the request:

Load of 5.000 requests (100 concurrent) Time taken for tests: 78.632 seconds

Percentage of the requests served within a certain time (ms)
  50%   1509
  66%   1547
  75%   1580
  80%   1608
  90%   1699
  95%   1796
  98%   2600
  99%   3009

Connections left hanging 0

One request example

      "latency": {
        "first_query": 19.859559000004083,
        "second_query": 17.381792000029236,
        "third_query": 18.042578999884427
      },

filipedeschamps commented 2 years ago

Final solution (already working in production) is to use Pool as much as you can, but keep checking "opened connections" versus "available connections" and if they're too close, begins to end() the Pool on finally instead of just release() th client.

My repository is still private, but here's the full solution:

// database.js - exports a Singleton with `query()` and `getNewClient()` methods

import { Pool, Client } from 'pg';
import retry from 'async-retry';
import { ServiceError } from 'errors/index.js';

const configurations = {
  user: process.env.POSTGRES_USER,
  host: process.env.POSTGRES_HOST,
  database: process.env.POSTGRES_DB,
  password: process.env.POSTGRES_PASSWORD,
  port: process.env.POSTGRES_PORT,
  connectionTimeoutMillis: 5000,
  idleTimeoutMillis: 30000,
  max: 1,
  ssl: {
    rejectUnauthorized: false,
  },
  allowExitOnIdle: true,
};

// https://github.com/filipedeschamps/tabnews.com.br/issues/84
if (['test', 'development'].includes(process.env.NODE_ENV) || process.env.CI) {
  delete configurations.ssl;
}

const cache = {
  pool: null,
  maxConnections: null,
  reservedConnections: null,
  openedConnections: null,
  openedConnectionsLastUpdate: null,
};

async function query(query, params) {
  let client;

  try {
    client = await tryToGetNewClientFromPool();
    return await client.query(query, params);
  } catch (error) {
    const errorObject = new ServiceError({
      message: error.message,
      context: {
        query: query.text,
      },
      errorUniqueCode: 'INFRA:DATABASE:QUERY',
      stack: new Error().stack,
    });
    console.error(errorObject);
    throw errorObject;
  } finally {
    if (client) {
      const tooManyConnections = await checkForTooManyConnections(client);

      if (tooManyConnections) {
        client.release();
        await cache.pool.end();
        cache.pool = null;
      } else {
        client.release();
      }
    }
  }
}

async function tryToGetNewClientFromPool() {
  const clientFromPool = await retry(newClientFromPool, {
    retries: 50,
    minTimeout: 0,
    factor: 2,
  });

  return clientFromPool;

  async function newClientFromPool() {
    if (!cache.pool) {
      cache.pool = new Pool(configurations);
    }

    return await cache.pool.connect();
  }
}

async function checkForTooManyConnections(client) {
  const currentTime = new Date().getTime();
  const openedConnectionsMaxAge = 10000;
  const maxConnectionsTolerance = 0.9;

  if (cache.maxConnections === null || cache.reservedConnections === null) {
    const [maxConnections, reservedConnections] = await getConnectionLimits();
    cache.maxConnections = maxConnections;
    cache.reservedConnections = reservedConnections;
  }

  if (
    !cache.openedConnections === null ||
    !cache.openedConnectionsLastUpdate === null ||
    currentTime - cache.openedConnectionsLastUpdate > openedConnectionsMaxAge
  ) {
    const openedConnections = await getOpenedConnections();
    cache.openedConnections = openedConnections;
    cache.openedConnectionsLastUpdate = currentTime;
  }

  if (cache.openedConnections > (cache.maxConnections - cache.reservedConnections) * maxConnectionsTolerance) {
    return true;
  }

  return false;

  async function getConnectionLimits() {
    const [maxConnectionsResult, reservedConnectionResult] = await client.query(
      'SHOW max_connections; SHOW superuser_reserved_connections;'
    );
    return [
      maxConnectionsResult.rows[0].max_connections,
      reservedConnectionResult.rows[0].superuser_reserved_connections,
    ];
  }

  async function getOpenedConnections() {
    const openConnectionsResult = await client.query(
      'SELECT numbackends as opened_connections FROM pg_stat_database where datname = $1',
      [process.env.POSTGRES_DB]
    );
    return openConnectionsResult.rows[0].opened_connections;
  }
}

async function getNewClient() {
  try {
    const client = await tryToGetNewClient();
    return client;
  } catch (error) {
    const errorObject = new ServiceError({
      message: error.message,
      errorUniqueCode: 'INFRA:DATABASE:GET_NEW_CONNECTED_CLIENT',
      stack: new Error().stack,
    });
    console.error(errorObject);
    throw errorObject;
  }
}

async function tryToGetNewClient() {
  const client = await retry(newClient, {
    retries: 50,
    minTimeout: 0,
    factor: 2,
  });

  return client;

  // You need to close the client when you are done with it
  // using the client.end() method.
  async function newClient() {
    const client = new Client(configurations);
    await client.connect();
    return client;
  }
}

export default Object.freeze({
  query,
  getNewClient,
});

filipedeschamps commented 2 years ago

Final results

First strategy

Open and close connection for every single query:

Time taken for tests:   78.256 seconds

Last strategy

Manage Pool based on opened vs available connections:

Time taken for tests:   20.968 seconds

stefanwaldhauser commented 6 months ago

@filipedeschamps Thank you so much for investigating this issue. had the same problem and am quite happy with the following solution:

const MAX_SIGNED_32_BIT_INTEGER = 2147483647;
...
{
     idleTimeoutMillis: MAX_SIGNED_32_BIT_INTEGER,
     reapIntervalMillis: MAX_SIGNED_32_BIT_INTEGER,
}
...

This configuration effectively deactivates the reap cycle in tarn.js, meaning that idle connections are never cleaned up, regardless of whether the lambda process is hot or cold. Given the operational characteristics of lambdas, server-side cleanup does not work reliably, so disabling it seems sensible.

Additionally, I set the idle_session_timeout in PostgreSQL to 30,000 milliseconds (30 seconds) to cleanup idle connections on the server side (PostgreSQL Documentation).

I also implemented graceful shutdown listeners in the lambda to ensure all remaining connections are cleaned up before the process is terminated, although this step seems optional since the server would clean these up after 30 seconds anyway. More details can be found in this Stack Overflow discussion.

A good option is also to look into RDS Proxy to achieve an external db connection pool.

stefanwaldhauser commented 5 months ago

Instead of deactivating the application side db connection pool completely I went with setting the value of the database idle timeout slightly higher. The full approach is described here: https://blog.stefanwaldhauser.me/posts/lambda_db_connection_leak/

In an AWS Lambda setup using application-side database connection pooling, there’s a risk of connection leaks due to frozen lambda processes not cleaning up idle database connections. To address this, configure the PostgreSQL idle_session_timeout parameter slightly higher than the application-side pool’s idleTimeoutMillis. Also, manually check if your application-side pool contains any idle connections when the lambda process is unfrozen and another execution occurs, to ensure you are not using a database connection that has been terminated by the database while the process was frozen.

filipedeschamps commented 5 months ago

to ensure you are not using a database connection that has been terminated by the database while the process was frozen

Yes! I've had a hard time with this, since pool doesn't automatically handle this case for you, unfortunately 🤝