Open jeremie-H opened 7 years ago
Can you please try on the latest library version?
Not sure if this is the same, but I'm getting a similar behaviour.
DB: Postgresql AWS RDS
Env: clojure:lein-2.7.1
-> openjdk:8
HikariCP Version: [hikari-cp "1.7.5"]
[com.zaxxer/HikariCP "2.5.1"]
Dropwizard Version: [io.dropwizard.metrics/metrics-core "3.2.3"]
PSQL JDBC Version: [org.postgresql/postgresql "9.4.1212"]
I have a reported spitting stats every 15 seconds, but the values I get for my active connections are virtually zilch. However, I know that the number of requests I'm serving (successfully) is in the thousands per minute...so I assume that it's actually talking to the db (more so since I have timing and call stats for the our db roundtrips...)
Two separate services running in production show the following graphs.:
Going to try and bump all the things and see if that fixes.
Confirmed that HikariCP version is: [com.zaxxer/HikariCP "2.6.3"]
Still seeing:
If you run with debug logging do the statistics logged every 30 seconds agree? Debug logging is not particularly chatty, so you should be able to use it even under load.
Hey @brettwooldridge,
Still working on getting you the logs etc., but thought you might find this interesting/helpful. Last night we had a production "blip" where Usage and Wait times skyrocketed and we saw our pods get rotated due to:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
We had 55 instances of that error over about a 30 second period. This occurred over 5 of our 7 pods in loosely the same distribution (ranging from 7 at min to 17 at max). Behaviour during this was that our Wait and Usage times skyrocketed (eventually coming back down. More interestingly, the TimeoutRate has stayed solid after this at 23. Not entirely sure why, or what's up, etc.
Pool's config is:
{:auto-commit true
:read-only false
:connection-timeout 30000
:validation-timeout 5000
:idle-timeout 600000
:max-lifetime 1800000
:minimum-idle 10
:maximum-pool-size 10
:register-mbeans false}
Now, all of this matters because, during this period I was able to get this distribution graph from my connections:
I have tests setup to conduct the logging you were after, so please stay tuned! I'm wondering now if my problem is just that I'm just missing the counters when they have values I'm after etc. Maybe a historgram metric is more what I was expecting?
Here's a little insight into how the statistics are collected. Whenever you call dropwizard to collect stats, this method is called. The logic is, if it has been more than one second since the last call, refresh the stats, otherwise return the previous stats. So, basically stats are refreshed on demand at a maximum rate of once per second.
Re: above, the stats are a snapshot of the pool state at the instant that the metrics are requested, and as far as I can tell are as accurate as possible given the dynamic nature of the pool. But note, there is no synchronization, so it is possible for active+idle to != total.
@brettwooldridge yeah, I think the above is accurate. I pulled the logs you were asking for but nothing funky stood out.
I think what I'm probably after (as far as "active" and "pending" counts) is probably best gotten by the Usage.Count and Wait.Count metrics. Any other recommendations I'm happy to hear them, but please feel free to close this on my part.
Also, goes without saying, but Hikari is awesome, and I really appreciate you as the maintainer being so responsive and reasonably opinionated on the operation of db pooling etc.
Hey @brettwooldridge,
Still working on getting you the logs etc., but thought you might find this interesting/helpful. Last night we had a production "blip" where Usage and Wait times skyrocketed and we saw our pods get rotated due to:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
We had 55 instances of that error over about a 30 second period. This occurred over 5 of our 7 pods in loosely the same distribution (ranging from 7 at min to 17 at max). Behaviour during this was that our Wait and Usage times skyrocketed (eventually coming back down. More interestingly, the TimeoutRate has stayed solid after this at 23. Not entirely sure why, or what's up, etc.
Pool's config is:
{:auto-commit true :read-only false :connection-timeout 30000 :validation-timeout 5000 :idle-timeout 600000 :max-lifetime 1800000 :minimum-idle 10 :maximum-pool-size 10 :register-mbeans false}
Now, all of this matters because, during this period I was able to get this distribution graph from my connections:
I have tests setup to conduct the logging you were after, so please stay tuned! I'm wondering now if my problem is just that I'm just missing the counters when they have values I'm after etc. Maybe a historgram metric is more what I was expecting?
@AlexanderMann I know I am really late, but if possible could you please share the root cause analysis of your problem. Thanks
Oof, @pratyushbansal this was quite a while ago and at an employer past...best I can remember we switched things over to not use the active vs pending connections. The better metric was what @brettwooldridge recommended of using the:
Usage.Count and Wait.Count
I think...it's been so long now though.
The incident above where I got the weirdness in production is also something I can't remember.
In production mode, our hikari pool is under heavy load, and when the maximum seems to be reached, metrics returned by dropwizard are weird. Only 1 pending connection, and all others metrics are set to 0. is the problem with dropwizard or HikariCP ?
Environment
Normal case : datasourcePRODUCTION.pool.ActiveConnections: { value: 1 }, datasourcePRODUCTION.pool.IdleConnections: { value: 10 }, datasourcePRODUCTION.pool.PendingConnections: { value: 0 }, datasourcePRODUCTION.pool.TotalConnections: { value: 10 },
Pending case : datasourcePRODUCTION.pool.ActiveConnections: { value: 0 }, datasourcePRODUCTION.pool.IdleConnections: { value: 0 }, datasourcePRODUCTION.pool.PendingConnections: { value: 1 }, datasourcePRODUCTION.pool.TotalConnections: { value: 0 },