coopernurse / node-pool

Generic resource pooling for node.js
2.37k stars 259 forks source link

Massive RSS which is never released #219

Open wubzz opened 6 years ago

wubzz commented 6 years ago

In 3.1.7 the lib seems to build up a lot of memory usage over time, eventually reaching enough RSS (which is never released) to crash an app. I originally reported this in https://github.com/tgriesser/knex/issues/2383 since I was not 100% sure the issue was in generic-pool or not, but now I am sure. Maybe related to #197 as well?

Eventhough I have no test app to reproduce the issue I thought I should at least report the problem.

Edit: In my case it required a lot of traffic/sqls to reproduce in production. Not comfortable with further testing in production.

sandfox commented 6 years ago

what sort of numbers is lots of traffic? and what version of nodejs are you using?

wubzz commented 6 years ago

I'm running Node 8.9.2. My app got the issue during a weekend, which is slightly less traffic. I would still say anywhere between 50-150 requests/min at a minimum. Each request in turn runs anywhere between 0-10 queries/request. It adds up to a lot of queries.. After that weekend I fixed it by rolling back knex/node-pool version so it has not happened since.

In addition to this, each tenant has its own database. This in turn means multiple instances of node-pool since there are multiple knex clients.

Another user of knex reported the following:

SELECT using knex repeatedly and monitored the heap usage after each query. Knex.js was burning anywhere from 700k to 2MB of heap per query, which crashed after a few hundred queries when it hit the node heap limit at around 1.5gb.

So perhaps it can be reproduced by spamming queries at a much larger scale.

I realize you're not getting a whole lot of information, and I apologize for that.

sandfox commented 6 years ago

This might be a "fun" one to debug and probably all hinges on whats happening at runtime :-p couple more questions that have occurred to me ...

Whats the pool config you're using? what version of generic-pool were you using before you upgraded to 3.1.7? do you have any metrics from the pool such as pool.size, pool.available, pool.borrowed, pool.pending?

wubzz commented 6 years ago

Config:

connectionOptions.pool = {
    max: dbCfg.poolMax, //Usually between 3-5 depending on cluster mode
    min: 0,
    idleTimeoutMillis: 5000,
    evictionRunIntervalMillis: 1000,
    Promise,
};

Prior to 3.1.7 we were using version 2.4.2. Unfortunately no metrics available, nothing was really being logged by default in production.

sandfox commented 6 years ago

Ah I see, yes, there is pretty huge jump between version and just about everything changed. FWIW the latest release (3.4.0) fixed some internal bugs which may or may not have any bearing on your problem. I don't know for sure, but it's certainly possible one of those bugs could have been responsible for the huge RSS consumption by way of not releasing objects from a queue/list somewhere. Your transaction rate alone shouldn't be enough to cause any problems I know it's definately being used for things in the region of 1K+ inflight requests and hundreds of txn/s, but I suspect cause of this lies in the specific traffic flow and things like number of inflight/waiting resource requests.

I'll try to find sometime in the next day or two to see if I can easily reproduce this with some synthetic data.

max: dbCfg.poolMax, //Usually between 3-5 depending on cluster mode What does cluster mode mean here?

wubzz commented 6 years ago

The cluster part is simply a dynamic limitation of how many connections the pool is allowed to create depending on if the app is running in cluster mode or not.

Without cluster: Max 10 With cluster: Math.ceil(10 / amount of forks)

This ensures that when scaling the app I keep the connections under the base postgres threshold of (default) 100 connections. It's a WIP solution.. :P