brettwooldridge / HikariCP

光 HikariCP・A solid, high-performance, JDBC connection pool at last.
Apache License 2.0
19.97k stars 2.93k forks source link

If pending connection appears, all others metrics are set to 0 #897

Open jeremie-H opened 7 years ago

jeremie-H commented 7 years ago

In production mode, our hikari pool is under heavy load, and when the maximum seems to be reached, metrics returned by dropwizard are weird. Only 1 pending connection, and all others metrics are set to 0. is the problem with dropwizard or HikariCP ?

Environment

HikariCP version: 2.4.3
JDK version     : 1.8.0_65
Database        : Oracle 12.1.0.2.0

Normal case : datasourcePRODUCTION.pool.ActiveConnections: { value: 1 }, datasourcePRODUCTION.pool.IdleConnections: { value: 10 }, datasourcePRODUCTION.pool.PendingConnections: { value: 0 }, datasourcePRODUCTION.pool.TotalConnections: { value: 10 },

Pending case : datasourcePRODUCTION.pool.ActiveConnections: { value: 0 }, datasourcePRODUCTION.pool.IdleConnections: { value: 0 }, datasourcePRODUCTION.pool.PendingConnections: { value: 1 }, datasourcePRODUCTION.pool.TotalConnections: { value: 0 },

brettwooldridge commented 7 years ago

Can you please try on the latest library version?

AlexanderMann commented 7 years ago

Not sure if this is the same, but I'm getting a similar behaviour.

DB: Postgresql AWS RDS Env: clojure:lein-2.7.1 -> openjdk:8 HikariCP Version: [hikari-cp "1.7.5"] [com.zaxxer/HikariCP "2.5.1"] Dropwizard Version: [io.dropwizard.metrics/metrics-core "3.2.3"] PSQL JDBC Version: [org.postgresql/postgresql "9.4.1212"]

I have a reported spitting stats every 15 seconds, but the values I get for my active connections are virtually zilch. However, I know that the number of requests I'm serving (successfully) is in the thousands per minute...so I assume that it's actually talking to the db (more so since I have timing and call stats for the our db roundtrips...)

Two separate services running in production show the following graphs.:

screen shot 2017-08-04 at 15 41 27 screen shot 2017-08-04 at 15 44 05
AlexanderMann commented 7 years ago

Going to try and bump all the things and see if that fixes.

AlexanderMann commented 7 years ago

Confirmed that HikariCP version is: [com.zaxxer/HikariCP "2.6.3"]

Still seeing:

screen shot 2017-08-04 at 16 56 01 screen shot 2017-08-04 at 16 56 43
brettwooldridge commented 7 years ago

If you run with debug logging do the statistics logged every 30 seconds agree? Debug logging is not particularly chatty, so you should be able to use it even under load.

AlexanderMann commented 7 years ago

Hey @brettwooldridge,

Still working on getting you the logs etc., but thought you might find this interesting/helpful. Last night we had a production "blip" where Usage and Wait times skyrocketed and we saw our pods get rotated due to:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

We had 55 instances of that error over about a 30 second period. This occurred over 5 of our 7 pods in loosely the same distribution (ranging from 7 at min to 17 at max). Behaviour during this was that our Wait and Usage times skyrocketed (eventually coming back down. More interestingly, the TimeoutRate has stayed solid after this at 23. Not entirely sure why, or what's up, etc.

Pool's config is:

  {:auto-commit        true
   :read-only          false
   :connection-timeout 30000
   :validation-timeout 5000
   :idle-timeout       600000
   :max-lifetime       1800000
   :minimum-idle       10
   :maximum-pool-size  10
   :register-mbeans    false}

Now, all of this matters because, during this period I was able to get this distribution graph from my connections:

screen shot 2017-08-05 at 09 02 25

I have tests setup to conduct the logging you were after, so please stay tuned! I'm wondering now if my problem is just that I'm just missing the counters when they have values I'm after etc. Maybe a historgram metric is more what I was expecting?

brettwooldridge commented 7 years ago

Here's a little insight into how the statistics are collected. Whenever you call dropwizard to collect stats, this method is called. The logic is, if it has been more than one second since the last call, refresh the stats, otherwise return the previous stats. So, basically stats are refreshed on demand at a maximum rate of once per second.

brettwooldridge commented 7 years ago

Re: above, the stats are a snapshot of the pool state at the instant that the metrics are requested, and as far as I can tell are as accurate as possible given the dynamic nature of the pool. But note, there is no synchronization, so it is possible for active+idle to != total.

AlexanderMann commented 7 years ago

@brettwooldridge yeah, I think the above is accurate. I pulled the logs you were asking for but nothing funky stood out.

I think what I'm probably after (as far as "active" and "pending" counts) is probably best gotten by the Usage.Count and Wait.Count metrics. Any other recommendations I'm happy to hear them, but please feel free to close this on my part.

Also, goes without saying, but Hikari is awesome, and I really appreciate you as the maintainer being so responsive and reasonably opinionated on the operation of db pooling etc.

pratyushbansal commented 4 years ago

Hey @brettwooldridge,

Still working on getting you the logs etc., but thought you might find this interesting/helpful. Last night we had a production "blip" where Usage and Wait times skyrocketed and we saw our pods get rotated due to:

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

We had 55 instances of that error over about a 30 second period. This occurred over 5 of our 7 pods in loosely the same distribution (ranging from 7 at min to 17 at max). Behaviour during this was that our Wait and Usage times skyrocketed (eventually coming back down. More interestingly, the TimeoutRate has stayed solid after this at 23. Not entirely sure why, or what's up, etc.

Pool's config is:

  {:auto-commit        true
   :read-only          false
   :connection-timeout 30000
   :validation-timeout 5000
   :idle-timeout       600000
   :max-lifetime       1800000
   :minimum-idle       10
   :maximum-pool-size  10
   :register-mbeans    false}

Now, all of this matters because, during this period I was able to get this distribution graph from my connections:

screen shot 2017-08-05 at 09 02 25

I have tests setup to conduct the logging you were after, so please stay tuned! I'm wondering now if my problem is just that I'm just missing the counters when they have values I'm after etc. Maybe a historgram metric is more what I was expecting?

@AlexanderMann I know I am really late, but if possible could you please share the root cause analysis of your problem. Thanks

AlexanderMann commented 4 years ago

Oof, @pratyushbansal this was quite a while ago and at an employer past...best I can remember we switched things over to not use the active vs pending connections. The better metric was what @brettwooldridge recommended of using the:

Usage.Count and Wait.Count

I think...it's been so long now though.

The incident above where I got the weirdness in production is also something I can't remember.