Consider a configuration parameter for 'working-set' connection pool size

YogoGit commented 9 years ago

Thanks for developing this library. We use this in our Play applications, and it's made a HUGE positive impact on the performance of the application.

After months of our applications working fine, we started experiencing issues that I believe are related to the issue fixed in 2.3.7. This issue isn't about that bug, but during my investigation, I went through all of the HikariCP documentation, as well as looked at the details of the connection pool (using the Codehale metrics), and it became obvious that the current configuration parameters are not ideal for our application.

I've noticed that for the application in question, during most of day, we tend to use ~5-7 connections 'all-the-time' (including both idle and active connections), with occasional spikes to 10-15 occasionally during the week, and rare spikes to 50+ when we get a news story and our site gets beat up bad. We also have multiple instances of the application running at the same time, so this spike could involve 100-150 connections across the cluster.

Because of this usage pattern, I can't determine out a good way to configure HikariCP. Because of the rare spikes, I don't dare limit the maximum to less than 50 (it's set to 75 currently, and we've not reached the limit). However, setting up HikariCP to have 75 connections active all the time is a huge waste of resources (mostly memory), since this would require our DB to have 150-225 (mostly idle) connections running for this one application to handle the very rare times that we get a spike, which has a measured negative effect on performance. Also, I don't want to clamp the upper limit to a much lower number since when we had configured a lower maximum our application fell over instead of degrading when the limit was hit. However, we also still want to keep a upper-limit on things 'just in case' we experience a runaway process, DoS, or application bug since we would rather not have a single rogue application take out the entire DB server farm (which is shared among many applications).

We were setting minimumIdle to 5, but what ends up happening with that setting is that it continually creates new connections every time a connection is handed out, even though that new connection is probably not going to be used about half the time.

What I think would work better would be to have the ability to set the size of the working-set, such that we would be able to indicate we would like 7 connections in the pool at all times, which includes both active and idle connections. This ensures that we don't create new connections everytime a connection is allocated, but still leaves to ensure that we keep a set of ready-to-go connections when a connection times out (maximum TTL). If this were possible, it would take care of 95% of our application needs. The application would have the ability to manage the connection in the working set (including refilling the set once connections time-out), but still allow us the headroom of allowing for a 'limited' maximum spike that will eventually get cleaned up and brought back to the 'working-set' number.

I'm open to ideas. For the time being, we're living with minimumIdle set to 5, which causes the working set to fluctuate between 5 and 10, which is acceptable, but not optimal.

Another potentially interesting datapoint is that we set maximumLifetime to 10 mins (for security and clustering reason), which may also have an affect on the behavior of the pool.

brettwooldridge commented 9 years ago

@YogoGit Before we could make any recommendations we'd need a bit more data/metrics. Understanding the stack, and measurements taken at different points, is key to performing this kind of tuning.

You said "DB server farm" ... for the application in question are we talking about one DB server, a load-balanced cluster, ...?

Database server(s) info:

How many? (with respect to the application in question)
CPU cores, single server or total across all servers if load-balancing?
Memory?

Front-end server(s) info:

How many?
min/max/avg request/sec? (eg. over a 3-5 day period)
min/max/avg response time? (eg. over a 3-5 day period)
Less relevant, but avg. CPU cores/server?

HikariCP configuration? (I understand maybe 80% of the relevant settings are above)

Lastly, can you share some of the Codehale metrics you have? For example, wait-time and usage-time during both "normal" load (~5-7 connections), and "busy" times (~10-15 connections)? Knowing these would shed a lot of light on what I might suggest. Of course, knowing those metrics for the "rare spikes" would be interesting too.

YogoGit commented 9 years ago

Let's see, where do I begin. Everything is in the cloud, so the # of CPUs and cores are mostly irrelevant, since everything is virtual, so it's all emulated/faked anyway. None of the front-end servers are memory or CPU limited in any way, and while the DB server could always use more memory, it tends to have a load average of slightly less than 40% during the day (I know, a complete arbitrary number), with short-lived bumps in the morning and after lunch around the top of the hour when folks login to the application. (The DB server is a dedicated virtual server that only runs MySQL, with 8 Xeon CPUS and 16GB of memory, with 14GB allocated to a tuned MySQL instance, for whatever value that provides.)

In my experience, our environment has very little affect on the pooling performance, as it tends to be dominated by the application behavior (web applications). Regardless, here is the data as best as I can provide.

For purposes of the application, we have a single DB for ~10 different applications. We have replication and mirroring going on in the cluster, but the application is agnostic to those changes. I probably shouldn't have even mentioned the term cluster, as it just clouds the issue.
Front-end servers. We have a few of them (5 currently), and one each server we run instances of our 22 different applications in Docker containers. So, each server has a mix of applications running on it, based on our knowledge of 'busy' vs. 'boring' applications, and the very busiest application may have multiple instances of them running on different servers (one of which is the application in question).
We have a load balancer in front of 3 dedicated NGINX servers, which farm requests out to the front-end servers, so the front-end servers deal with either external page requests and/or Rest WS requests from the other front-end servers.
DB load varies, but it's not too bad. We run between 1.5-3K queries/sec throughout the day, which is measured across ALL the apps. The DB has ~22 different applications hitting it (not including mirroring and replication), and of those, 12 of them use HikariCP with a minimumIdle settting of 5. This tends to leave our DB with 80-105 active connections during the day (with random spikes to 160-240, which I'm hoping are cured by version 2.3.7).
Transaction time is harder to judge (Codehale below), but looking at NewRelic, it appears that on average each DB transaction is ~65 ms. Again, not too bad considering we have some pretty nasty queries and normalized data, and we have some transactions that can take as long as 3 seconds to complete (they are rare though).

DB/HikariCP settings:

HikariCP Version: 2.3.2 (in the process of upgrading now to 2.3.7)
MySQL Driver 5.1.34
minimumIdle = 5
maximumPoolSize = 75
connectionTimeout = 7s
maxLifetime = 10m
connectionInitSql = 'SELECT 1' Everything else is using the defaults, including idleTimeout

Codehale data (throwing some random values out I grabbed during the day). These are random. Will take a bit more time to pick up 'normal' vs. 'busy' times and followup with another message, as correlating things will take a bit more work as the data for load and such is on a different system with the Codehale logs. Note, the rare spikes haven't occurred since we just enabled Codehale metrics earlier this week, so I can't give you those (yet), and it may be more than a month between spikes.

type=TIMER, name=HikariPool-2.pool.Wait, count=4411072, min=0.0016029999999999998, max=0.8736189999999999, mean=0.011255767431160671, stddev=0.08706364048207735, median=0.001777, p75=0.0020369999999999997, p95=0.00244, p98=0.0026869999999999997, p99=0.794254, p999=0.8736189999999999, mean_rate=48.96545811872003, m1=83.64623764803856, m5=100.54372187887556, m15=106.02019035494982, rate_unit=events/second, duration_unit=milliseconds      
type=TIMER, name=HikariPool-2.pool.Wait, count=4406907, min=0.0016129999999999999, max=0.8604109999999999, mean=0.007689056250542513, stddev=0.06829777539927599, median=0.001789, p75=0.002046, p95=0.0024419999999999997, p98=0.002632, p99=0.004213, p999=0.847111, mean_rate=48.94639077107378, m1=84.82444448873724, m5=103.69073005868988, m15=107.31806263774833, rate_unit=events/second, duration_unit=milliseconds
type=TIMER, name=HikariPool-2.pool.Wait, count=4379533, min=0.001593, max=2.179563, mean=0.01897283644235109, stddev=0.13838340261027937, median=0.001808, p75=0.0021089999999999998, p95=0.002523, p98=0.01217, p99=0.828869, p999=2.179563, mean_rate=48.77779529925719, m1=118.37363711711554, m5=105.0839990779641, m15=107.32719433386445, rate_unit=events/second, duration_unit=milliseconds

type=HISTOGRAM, name=HikariPool-2.pool.Usage, count=4409771, min=1, max=709, mean=23.729489435250294, stddev=56.30743107633765, median=16.0, p75=22.0, p95=47.0, p98=155.0, p99=360.0, p999=709.0
type=HISTOGRAM, name=HikariPool-2.pool.Usage, count=4406906, min=1, max=1000, mean=24.312712040593706, stddev=50.35508511999317, median=16.0, p75=26.0, p95=68.0, p98=202.0, p99=294.0, p999=419.0
type=HISTOGRAM, name=HikariPool-2.pool.Usage, count=4380989, min=1, max=1213, mean=26.081748607995728, stddev=62.41842354992076, median=17.0, p75=21.0, p95=72.0, p98=158.0, p99=339.0, p999=776.0

brettwooldridge commented 9 years ago

If you've got NewRelic, we recommend hooking up the Dropwizard (Codehale) Metrics ... this is one option:

https://github.com/palominolabs/metrics-new-relic

There are also some recipes for wiring up dropwizard via JMX to NewRelic floating around.

YogoGit commented 9 years ago

Our account doesn't allow for custom metrics. :(

brettwooldridge commented 9 years ago

I wanted to circle back on the issue to provide you with some additional information. We've found that for a lot of applications, tuning the MySQL can have a huge impact on performance. You might try some of the items discussed here. As well as digging through the Connector/J properties.

nateww commented 9 years ago

Thanks Brett. For us, things work really well 98% of the time. With our applications (roughly 8 different applications), we generally run about 10-12 active connections, with each 'session' lasting between 10-80ms on average. Unfortunately, when our DB gets a bit bogged down due to either a significant increase in load (ie; marketing has done their job), or some developer has added a really slow query to the system, the # active connections ramps up significantly (like from 15 to roughly 75 or 100). When this occurs, the database is already behind, and the default behavior of HikariCP is to add another 60-85 additional connections to the pool to keep the minimum idle connections number stead, making what is already a bad situation worse.

In our case, I could care less about the # of idle connections. I would like a 'working set size' of about a dozen (meaning active + idle, which is where the applications tends to sit), and a maximum # of connections to avoid things going rogue on us.

Right now, we set minimum idle to ensure that we have enough connections 'available' to handle most things effeciently (because we know the size of our working set). I suppose we could set minimumIdle to zero, but then we know that during the day, it causes the system to get much more 'bursty' because it keeps throwing away connections instead of keeping them around, but I see no other way of ensuring that we keep around a 'cache/pool' of connections.

Your recommendation to make it a fix-sized pool by setting both minimumIdle and maximumPoolSize sounds like it work most of the time. Unfortunately, when we get the occasional (once/twice week), we need to allow for a larger maximumPoolSize, as we've found that it's much more time-consuming to block the application while waiting for a connection than it is to spin up a new connection. This is primarily due to the fact that the connections that tend to cause the increase in active connections are the occasional long-running queries which tend to cluster at certain times of the week/month/etc...

Regardless, thanks for keeping an eye on this ticket, and the issue I point out is an optimization for our situation, as we are very happy with the performance on HikariCP.

nateww commented 9 years ago

Ps. Thanks again for the pointer to the Codehale metrics. It was invaluable in helping us debug a performance issue in our MySQL setup. In the past you had us setup logging which showed us a bug in HikariCP that caused an explosion in connections (fixed), but recently we had some issues with a bad query coupled with an application that accidentally allowed the borrower to repeat the bad query multiples times. In this case, we could identify that the # of active connections was increastng, allowing us to focus on the connection consumers and not a pooling issue.

brettwooldridge commented 9 years ago

You might try taking one of the app servers and configuring it as you mentioned, setting minimumIdle=0. If there is fairly constant activity, within the idleTimeout, that will keep the connections active in the pool at approximately the correct level, but will prevent the pool from refilling aggressively during spikes. Another variation would be minimumIdle=1.

If you are running >1k transactions/sec throughout the day, you can definitely be more aggressive with the idleTimeout, using a value like 30 seconds, without actually causing any significant connection churn due to transactions keeping connections alive. The benefit would be that the rampdown from a spike would be much quicker.

EDIT: naturally, you want to closely examine the metrics of how this app server runs during normal load and especially the problematic spike loads.

lakshmisri commented 9 years ago

I have a question. I am using this for my connection pooling and the version am using is 2.3.6. Also I am using Java 7. So is that the right version to use?

brettwooldridge commented 9 years ago

@lakshmisri If you're on Java 7, you can use 2.4.1 (just released yesterday).

lakshmisri commented 9 years ago

@brettwooldridge: Thank you so much for your quick response. I was getting too many open connection issue. So that will be resolved right. This is the property I am using jdbcUrl=jdbc:mysql://localhost/myDB username=root password=**** maximumPoolSize=20

hope this is enough

lakshmisri commented 9 years ago

If it was java 6, what version to use ???

Regards, Lakshmisri Gopalan IDEAS2IT Technologies

On Tue, Aug 25, 2015 at 11:28 AM, Brett Wooldridge <notifications@github.com

wrote:

If you're on Java 7, you can use 2.4.1 (just released yesterday).

— Reply to this email directly or view it on GitHub https://github.com/brettwooldridge/HikariCP/issues/320#issuecomment-134489418 .

brettwooldridge commented 9 years ago

@lakshmisri If it was Java 6, you would want HikariCP-java6 version 2.3.9.

fgaule commented 8 years ago

There is an open ticket in NewRelic to request support for Hikari. It needs a few +1 to move from Incubator to Production. https://discuss.newrelic.com/t/measure-database-when-using-custom-client/36134

brettwooldridge / HikariCP

Consider a configuration parameter for 'working-set' connection pool size #320