"timeout exceeded when trying to connect" spike after upgrading to 8.2.1

mriedem commented 4 years ago

Yesterday we upgraded from pg@8.0.3 (with pg-pool@3.1.1) to pg@8.2.1 (with pg-pool@3.2.1), specifically from our package-lock.json:

    "pg": {
      "version": "8.2.1",
      "resolved": "https://registry.npmjs.org/pg/-/pg-8.2.1.tgz",
      "integrity": "sha512-DKzffhpkWRr9jx7vKxA+ur79KG+SKw+PdjMb1IRhMiKI9zqYUGczwFprqy+5Veh/DCcFs1Y6V8lRLN5I1DlleQ==",
      "requires": {
        "buffer-writer": "2.0.0",
        "packet-reader": "1.0.0",
        "pg-connection-string": "^2.2.3",
        "pg-pool": "^3.2.1",
        "pg-protocol": "^1.2.4",
        "pg-types": "^2.1.0",
        "pgpass": "1.x",
        "semver": "4.3.2"
      },
      "dependencies": {
        "pg-connection-string": {
          "version": "2.2.3",
          "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.2.3.tgz",
          "integrity": "sha512-I/KCSQGmOrZx6sMHXkOs2MjddrYcqpza3Dtsy0AjIgBr/bZiPJRK9WhABXN1Uy1UDazRbi9gZEzO2sAhL5EqiQ=="
        },
        "semver": {
          "version": "4.3.2",
          "resolved": "https://registry.npmjs.org/semver/-/semver-4.3.2.tgz",
          "integrity": "sha1-x6BxWKgL7dBSNVt3DYLWZA+AO+c="
        }
      }
    }

We started seeing a spike in timeout exceeded when trying to connect errors with this stacktrace:

Error: timeout exceeded when trying to connect
    at Timeout._onTimeout (/usr/src/app/node_modules/pg-pool/index.js:188:27)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)

This is a pretty basic express app with a postgres 12 backend running on node 12.

We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):

That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):

We're not sure how to debug this. These are the relevant values we're using related to the Pool config:

connectionTimeoutMillis = 60000
idleTimeoutMillis = 60000
max = 150

We have a staging environment where this showed up as well but we didn't have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.

jsatsvat commented 4 years ago

We're having the same issue - causing us some headache. Just looking at 8.2.2 to see if that fixed it.

mriedem commented 4 years ago

Looking at the changes in 8.2.2 I'm not sure those are going to resolve my issue since I'm not using bytea type columns:

https://github.com/brianc/node-postgres/compare/pg@8.2.1...pg@8.2.2

charmander commented 4 years ago

@mriedem The fix applies to any long message.

brianc commented 4 years ago

hmm sorry this is happening - I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random? Those issues are hard to debug. I'm gonna check my app logs & see if I can see any of this - I usually run the latest on our systems and they're under pretty heavy load all day.

mriedem commented 4 years ago

I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random?

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression. I mean, if we had some debug patch or something applied in our staging environment I'm sure we'd hit it to get more data if you have something in mind. I can't say there is a specific query that fails 100% of the time though.

I'm not sure if it helps, but there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240? But we had other failures on queries that didn't involve json columns, e.g. a simple count query: SELECT COUNT(job_id) FROM jobs WHERE queue_name = $1 AND execution_time IS NULL;, though those might have been on the same pod (db connection pool) so if the problem causes an issue for all connections in the same pool then that might explain the issue in other basic queries.

mriedem commented 4 years ago

@jsatsvat did moving to 8.2.2 resolve the issue for you?

jsatsvat commented 4 years ago

did moving to 8.2.2 resolve the issue for you?

No, it didn't unfortunately.

there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240

FYI - We don't have any JSON columns as far as I am aware.

brianc commented 4 years ago

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression.

yeah I mean is there a way this can be reproduced in an isolated script so I could run it locally and analyze what's going on?

brianc commented 4 years ago

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

mriedem commented 4 years ago

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

We're not using an ORM, just pg directly.

brianc commented 4 years ago

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

jsatsvat commented 4 years ago

We're using slonik.

And yes, at least for us, it happens randomly out of a sudden with a bunch of timeouts at once.

brianc commented 4 years ago

dang - do you have numbers around how long it takes before it happens? What's the recovery you do? restart the app?

jsatsvat commented 4 years ago

For now we've drastically increased the pool size (previously set to 100) which temporarily takes some pressure off it; but as a quick recovery measure we do indeed restart the affected service.

As it seems to happen randomly, I am not sure as to how helpful numbers will be as we can't really determine a certain interval or similar.

mriedem commented 4 years ago

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a select now(); query to make sure that pod's connection pool is OK because if it's full then we want that pod to go out of rotation for traffic until it can drain its connection requests. The pod will only crash and restart automatically if an uncaught error slips through.

I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs.

brianc commented 4 years ago

K thanks for the info this is helpful. One other question...are you all using SSL for the connection? I'm wondering if there's a weird edge case in the startup / request ssl packet handling.

brianc commented 4 years ago

Sorry...more questions. I'm trying to repro locally...no luck yet. What cloud are y'all running in? I wonder if it's networking related to k8s or the like.

mriedem commented 4 years ago

Yeah we're setting ssl.ca in the options that get passed to create the pool.

We're in the IBM cloud using the K8S service there to run the app and the IBM Cloud Databases for Postgresql for the PG backend.

brianc commented 4 years ago

Alright...I left a script run all night that was connecting like this: local-script -> glcoud sql proxy -> cloud sql instance in gcloud. It has a 10 connection pool, and every 1-2 seconds it checks out a client, runs a query, and returns it. The pool is set to idleTimeoutMillis of 100 - so basically every time the script waits for 1-2 seconds the pool is cleaned out & there's a reconnection attempt. If it ever takes more than 3 seconds for the client to connect, query, and be returned to the pool the script exits. It ran all night without problems....so maybe thinking the connection logic has a problem is a red herring? Looking at the code the timeout exceeded when trying to connect is actually a bad error message - that error happens when the pool is saturated & cannot give you an existing client in time. I'm going to change that error message to be more descriptive. You've logged out the number of waiting on your pool and it skyrockets. That leads me to believe one of three things is going on...

1) somewhere your code is checking out a client and not checking it back in. Eventually this starves the pool of available clients, waiting count skyrockets, and yeah...bad things. Doing nothing but upgrading from pg@8.0 to pg@8.2 is unlikely to have caused this since theoretically your app code didn't change & all errors are handled the same...but worth considering. The best way to diagnose this would be for me to put a pool.emit('release', client) call into the code whenever a client is released back to the pool. Not having this event is just an oversight on my part. The pool already emits a pool.on('acquire', (client) => {}) event. With a matching release event you could do something to find orphaned clients like this:

pool.on('acquire', (client) => {
  client.orphanTimeoutId = setTimeout(() => {
    console.error('client was not returned within 5 seconds', client)
    process.exit(-1)  // do whatever you need to do here to trip alarms
  }, 5000)
})

pool.on('release', (client) => {
  clearTimeout(client.orphanTimeoutId)
})

2) Something changed between 8.0 and 8.2 which makes a client enter some kind of exceptionally bad run time (we fixed one of those things with #2241, there might be more?) The best way to debug this would be to log all query execution times and use statement_timeout in your connection options and set it to something reasonably low like 3 seconds (or 2 * what your longest query every takes) and then monitor the logs for errors due to queries timing out. That could then isolate it down to a particular query at which point given that query & estimated data volumes returned from it I could likely build a reproducible fix on my end.

3) Something changed between 8.0 and 8.2 which makes a client get "stuck" thinking it's executing a query when it's not. This isn't likely...but would be worth investigating. The easiest way to do this most likely would be hopefully you have a single file where all queries "funnel through" in your app, otherwise you'll need to monkeypatch (Client.prototype.query) and wrap it with a timing function. Time the execution and log out to the console (or crash the app entirely if its in staging or something) if the duration between calling client.query and the query finishing exceeded some threshold. This is different than statement_timeout in that maybe all the queries finish fast but for some reason something gets "stuck" in the parser or client and it never reads the readyForQuery message properly.

Sorry this is hard to track down....I'll go ahead and add the pool.on('release') event and change the error message when the pool is starved for available clients to not say "timeout when trying to connect" because it's actually not trying to connect, it's starved.

What do you think?

mriedem commented 4 years ago

Thanks for the detailed feedback and ideas of things to try. We can try (1) when that's released. Doing (2) would also be pretty easy but I'd need to tinker a bit with figuring out our slowest query times to set a proper timeout. Thankfully we do wrap query with a timer function to log query execution times so we should have that information in our logs. And because we have that wrapper function on query we should be able to try (3) as well.

I don't think I'll probably be able to get to this today but we should be able to try some of this early next week. Thanks again.

PsyTae commented 4 years ago

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

PsyTae commented 4 years ago

Just updated our code to 8.3.0, in the meantime setting our max connections to 150 from 50. We are using amazon aws. Over the last 2 hours, we have noticed that the connections are steadily rising and not releasing back to the pool for the next query to use appropriately. RDSConnections

mriedem commented 4 years ago

@brianc re your earlier comment on things to try for (3) we have a wrapper on Pool.query to log how long the query function took, i.e. this is a log statement from a query in our staging environment that took longer than 45 seconds for whatever reason:

Jul 1 12:01:05 api-queue-764c7789b9-7hlpk api-queue 20 {"name":"pg-connect","hostname":"api-queue-764c7789b9-7hlpk","pid":20,"level":20,"time_taken_ms":48866,"args":["SELECT * FROM tasks\n WHERE timestamp_dequeued IS NULL AND queue_name = $1\n ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;",["status-devices-staging"]],"msg":"","time":"2020-07-01T17:01:05.837Z","v":0}

That was from July 1 when we upgraded to pg@8.2.1 and excludes any logs from timeout exceeded when trying to connect failures.

I'm not sure about (2) though. Sorry if this is a dumb question but I'm not sure how to do this:

The best way to debug this would be to log all query execution times

I was thinking maybe there would be a property on the pg.Result object to record how long a query took but I don't see anything like that. Am I missing something obvious?

PsyTae commented 4 years ago

on my connection file

const mysql = require('mysql'); // mysql
const pg = require('pg'); // postgres
const log = false;
//--- My SQL Connection -----------------------------------------------------------------------------------------------------------------

const mysqlConfig = {
    connectionLimit: 10,
    waitForConnections: true,
    host: process.env.MYSQL_HOST,
    port: process.env.MYSQL_PORT,
    user: process.env.MYSQL_USER,
    password: process.env.MYSQL_PASS,
    database: process.env.MYSQL_DATABASE
};

const postgisConfig = {
    host: process.env.POSTGIS_HOST,
    user: process.env.POSTGIS_USER,
    password: process.env.POSTGIS_PASS,
    database: process.env.POSTGIS_DATABASE,
    port: process.env.POSTGIS_PORT,
    max: 300,
    idleTimeoutMillis: 30 * 1000,
    connectionTimeoutMillis: 2 * 1000
};

const mysqlConn = mysql.createPool(mysqlConfig);
const postGisConn = new pg.Pool(postgisConfig);

postGisConn.myQuery = (text, values, callback) => {
    if (typeof values === 'function') {
        callback = values;
        values = undefined;
    }
    callback = typeof callback === 'function' ? callback : () => {};
    const start = Date.now();
    return postGisConn.query(text, values, (err, res) => {
        const duration = Date.now() - start;
        if (err) return callback(err);
        if (!err && (log || duration > (1 * 1000))) console.log('executed query', { text, duration, rows: res.rowCount });
        callback(err, res);
    });
};

mysqlConn.on('error', err => {
    console.error('Mysql Error:', err);
});

postGisConn.on('error', (err, client) => {
    console.error('PostGis Error', err);
});

module.exports.mysql = mysqlConn;
module.exports.postGis = postGisConn;

I created a wrapper for our pool.query function to calculate durations and then updated every pool.query in the rest of the project to the pool.myQuery. Doing this we were able to find that we did have an extremely long-running query that was hitting the DB very frequently. Turns out there was a trigger on this table we had to disable to get the long-running query taken care of.

This caused all our connection pool to be consumed very quickly and not be released in time for other queries to use appropriately. Hopefully, this code might help @mriedem find a way to log the query execution times out.

mriedem commented 4 years ago

We have something similar, everything goes through a function that wraps Pool.query and logs how long the original query function took.

I've posted some code for that today in our staging environment which will set both statement_timeout to 45 seconds and also log a warning for which we can setup an alert if query takes longer than 45 seconds. Checking our staging logs we have some queries that routinely take about 30 seconds so with that in place we can try to upgrade to pg@8.2.1 again in our staging environment and see what it triggers.

mriedem commented 4 years ago

We're starting to play around with log_min_duration_statement which defaults to 100ms in the IBM managed postgres DB instance and we definitely have queries like this taking over 1 second:

SELECT * FROM tasks WHERE timestamp_dequeued IS NULL AND queue_name = $1 ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;

This is the definition of that table:

                                               Table "public.tasks"
       Column       |            Type             | Collation | Nullable |                Default                 
--------------------+-----------------------------+-----------+----------+----------------------------------------
 task_id            | integer                     |           | not null | nextval('tasks_task_id_seq'::regclass)
 task_content       | json                        |           | not null | 
 queue_name         | text                        |           | not null | 
 timestamp_queued   | timestamp without time zone |           | not null | timezone('utc'::text, now())
 timestamp_dequeued | timestamp without time zone |           |          | 
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (task_id)
    "task_queuename" btree (queue_name)
    "task_ts_dequeued" btree (timestamp_dequeued)
    "tasks_qn_dequeued_queued" btree (queue_name, timestamp_dequeued, timestamp_queued)

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

mriedem commented 4 years ago

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

We're now actively trimming this table and running vacuum analyze in a cron job to keep the DB size down. We still need to upgrade to pg@8.3.0 in our staging environment and see if we hit the regression (my guess is it might not be as noticeable now that we're routinely cleaning up old records).

EdenAzulay commented 4 years ago

Hi, I'm having the same issue when using version 8.3.3 . Do you know if there are any recent changes on the last releases that might cause the same issue?

brianc commented 4 years ago

@EdenAzulay

I'm having the same issue when using version 8.3.3 . Do you know if there are any recent changes on the last releases that might cause the same issue?

What issue are you having? can you describe it a bit more? Did you upgrade from an older version without the issue? I don't know of any changes which may have caused the issue. There have been several reports on this thread, but nothing conclusive yet.

@PsyTae

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

Thanks for the info and it sounds like you cleared up your own issue? Did the issue only show up w/ pg@8 and not pg@7?

As an aside: I have a separate pool in our app for long running queries (we have a few that take several seconds) with longer connection timeouts and so on. It can help.

briangonzalez commented 3 years ago

@PsyTae @mriedem

Did either of you get to the bottom of this? We're seeing something similar.

mriedem commented 3 years ago

Did either of you get to the bottom of this? We're seeing something similar.

@briangonzalez Not yet no. We're still on pg@8.0.3 since I've been dragging my feet on doing the upgrade. We have instrumented some stuff in our code for logging warnings in slow queries and such to hopefully help us identify / alert on any performance regression when we do upgrade.

nilansht commented 3 years ago

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

mriedem commented 3 years ago

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

Have you tried setting _statementtimeout or _querytimeout? See the Client constructor config which you can pass through the Pool constructor. Those values are passthrough to postgres.

https://node-postgres.com/api/client

heitordobeis commented 2 years ago

hi @mriedem, i have the same problem today, did you get any solution?

airhorns commented 2 years ago

I think this issue is definitely still a thing, we're seeing the occasional timeout exceeded when trying to connect error under load in production when the pool is seemingly not under much stress and still pretty far from it's max limit. Here's an example from our tracing system where we're counting the number of times we're connecting with pool.connect, and each time we connect we log the current pool stats (idleCount, totalCount, waitingCount) at the same time. Our max is 75, so the pool should definitely be able to allocate a new connection. : CleanShot 2022-05-10 at 22 34 53@2x

My first suspicion for a strange timeout like this would be node event loop blocking where the process is stuck doing something for the 5s, but the event loop utilization never goes above .3:

So, I am inclined to think there's some very subtle bug in the pool logic that causes the timeout to be hit under load despite the pool not being full. It works almost all the time for us, but it seems to be load that causes this issue rarely.

oriarazi commented 2 years ago

@mriedem any news on that? using pg@8.5.1 with pg-pool@3.3.0 and facing the same issue under load in production. @airhorns found any solution?

karo-dc commented 1 year ago

@mriedem did you solve that? I use "pg": "8.9.0", same issue here.

sscots commented 1 year ago

Having same issue on 8.8

alexandresobolevski commented 10 months ago

+1

koplenov commented 7 months ago

+1 *popcorn_time*

airhorns commented 7 months ago

Are y'all able to prove that you aren't being hit by long event loop blocks during connection acquisition?

robaca commented 7 months ago

Getting the same error with version 8.11.3 and we are sure there is no event loop lag (we measure and would log it). Our service is also running on GCP/GKE and are using the @google-cloud/cloud-sql-connector together with the Pool to connect to a Cloud SQL for PostgreSQL instance. It's only affecting a small fraction of queries.

brianc commented 7 months ago

Dang! sorry about this! Do you have any code or way to reproduce this at all? I'd love to help fix this for you.

brianc / node-postgres

"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262