brianc / node-pg-pool

A connection pool for node-postgres
MIT License
180 stars 64 forks source link

Connection terminated due to connection timeout / Connection terminated #116

Open dustin-H opened 5 years ago

dustin-H commented 5 years ago

Hi there,

I got some issues with this pool. Maybe this is related to https://github.com/brianc/node-postgres/issues/1790 and https://github.com/brianc/node-postgres/issues/1611.

First of all this is my environment:

I'm using pg-pool with this config:

{
   /* normal credentials with ssl */
   min: 10,
   max: 46,
   idleTimeoutMillis: 30000,
   connectionTimeoutMillis:  3000
}

I get about 10 times a day this error: Connection terminated due to connection timeout.

This leads to a failure of some query and thus to an error for the end-user.

I already tried to debug this in cooperation with my Database-Provider, however this doesn't seem to be related to that. The ping to my db is about 0.23ms in the same data-center.

After having a look at the code, enabling detailed logging and hours of debugging I think I found the issue.

I think this is related to ending clients on idle timeouts: https://github.com/brianc/node-pg-pool/blob/master/index.js#L41

Here the idle timeout is triggered, however it does not check if the client is currently used. I know the timeout is cleared in https://github.com/brianc/node-pg-pool/blob/master/index.js#L138, but there seems to be an edge case where a client gets destroyed, which has just been given to the application.

For me that happened once in about 2000 requests.

To prove my intention, I set idleTimeoutMillis to 0, which disables the removal of idle clients. This solved the issue for me. I never got this error again.

I cannot really say how this can happen nor how to fix it, but I'm pretty sure this is related to the idle-timeout.

Hope this helps debugging it.

Have a nice day, Dustin

jfromaniello commented 5 years ago

I've the same problem even with idleTimeoutMillis: 0 in 7.8.1.

I am not sure why the connection get closed, the query fails with this stack trace:

Error: Connection terminated due to connection timeout
    at Connection.con.once (/usr/local/lib/node_modules/myapp/node_modules/pg/lib/client.js:200:9)
    at Connection.g (events.js:292:16)
    at emitNone (events.js:86:13)
    at Connection.emit (events.js:185:7)
    at Socket.<anonymous> (/usr/local/lib/node_modules/myapp/node_modules/pg/lib/connection.js:76:10)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at TCP._handle.close [as _onclose] (net.js:509:12

I checked my server config and I don't have idle_in_transaction_session_timeout.

UmanShahzad commented 5 years ago

Same issue here, it's actually error spamming us hard. I hypothesize this has to do with the interplay between killed idle connections and the connection timeout - idle connections die, then a stampede of connections come in, and on the DB side, for whatever reason, it can't handle the stampede of new connections all at once. Maybe something to do with cloud provider limitations on the VM hosting the DB, or whatever, but it's probably a good idea to just not kill idle workers if you can afford to have many connections open in the first place.

We're going to probably try setting idleTimeoutMillis: 0 as well and will report back.

OTATA commented 5 years ago

Same problem here too. We set idleTimeoutMillis to 0 and error persists :/

danielsmeyer commented 5 years ago

I am also working through this error. Seems like connections are aborting instead of waiting?

cobolbaby commented 4 years ago

+1

ericpearson commented 4 years ago

+1 is this issue ever going to be fixed?

solagna commented 4 years ago

+1 same problem

thakursaurabh1998 commented 4 years ago

+1

coderholic commented 4 years ago

We have tried setting idleTimeoutMillis: 0 and still see this error. Any updates on this?

coderholic commented 4 years ago

We have ~40 web instances and are seeing ~250 errors within a minute from a single instance a couple of times a day (max connections is set to 3), and different instances at different times, which rules out it being a real DB connection issue, and means it must be a code issue. We've also set the timeout to 20s, and still get it regularly.

It is possible that the issue is related to https://github.com/brianc/node-pg-pool/pull/131, or is there something else wrong here?

charmander commented 4 years ago

@coderholic Unless it’s an unhandled error you expect to be handled by pool.query(), no, it’s not related to #131. It doesn’t sound like the same issue as this issue, either, given that setting idleTimeoutMillis: 0 doesn’t make it go away.

We have ~40 web instances and are seeing ~250 errors within a minute from a single instance a couple of times a day (max connections is set to 3), and different instances at different times, which rules out it being a real DB connection issue, and means it must be a code issue.

Not sure that that rules a DB connection issue out.

Which error messages exactly are you seeing?

coderholic commented 4 years ago

The full error we see is:

Error: Connecting to '<redacted>' failed: timeout exceeded when trying to connect
  at Timeout.setTimeout [as _onTimeout] (/app/node_modules/pg-pool/index.js:180:27)
  at ontimeout (timers.js:436:11)
  at tryOnTimeout (timers.js:300:5)
  at listOnTimeout (timers.js:263:5)
  at Timer.processTimers (timers.js:223:10)

If there was an issue with the DB I'd expect to see these errors on all instances, rather than just one. We're doing a fairly consistent 150 queries/second to the DB, so it's not like it's mostly idle and the other instances aren't hitting the DB at the point there could be an issue.

I'm also not sure why we see ~250 errors every time this happens. Each instance has 3 max connections, so it seems like it's retrying a lot and throwing the same error again and again. We have a 20s timeout set though, so shouldn't we see this error a max of 3 times per 20s, instead of 250 times within a few seconds?

charmander commented 4 years ago

@coderholic Okay, that different error message is caused by a timeout acquiring a client when there’s no room left in the pool to connect new ones. I don’t know what specifics could account for the difference in your instances, but seeing lots of errors is easy to explain. One way: 3 queries tie up the pool for 21⅔ seconds, the first 1⅔ seconds of queries after them time out and produce 250 errors, then everything else continues as normal.

coderholic commented 4 years ago

@charmander thanks for the additional details. If I'm understanding correctly, this error is about queued queries timing out while waiting for a connection to the DB to be returned to the pool, it's not actually about a pool connection timing out when connecting to the DB? And is there an error thrown for every queued query? If that's the case then it sounds like one fix would be to increase the pool size?

coderholic commented 4 years ago

Also, I'm fairly sure we don't have any queries that take more than 1s (we have slow logging turned on for anything over 1s, and we don't have any queries from these instances that trigger the log). Is there something else, other than a slow query, that could result in a connection not getting returned to the pool for >20s?

charmander commented 4 years ago

If I'm understanding correctly, this error is about queued queries timing out while waiting for a connection to the DB to be returned to the pool, it's not actually about a pool connection timing out when connecting to the DB?

Right.

And is there an error thrown for every queued query?

For every one that times out, yes.

If that's the case then it sounds like one fix would be to increase the pool size?

Yes, if your PostgreSQL server(s) can handle it.

Also, I'm fairly sure we don't have any queries that take more than 1s (we have slow logging turned on for anything over 1s, and we don't have any queries from these instances that trigger the log). Is there something else, other than a slow query, that could result in a connection not getting returned to the pool for >20s?

It doesn’t have to be one connection not getting returned to the pool for >20s, the query queue just has to fill up faster than it empties for at least that long. (It also doesn’t have to be the query execution itself that’s slow – try timing queries client-side, especially on the problematic client.)