Closed nateww closed 9 years ago
Here is the breakdown just before and just after:
And, just before (getConnection() doesn't even show up).
A few questions. What version of HikariCP? Can you share your configuration? And are you using Codahale metrics?
We're using the HikariCP Play Plugin, which uses HikariCP 2.0.1. My configuration is in Play format, but I think it will make sense.
# Partitions as defined in the Play pooling plugin aren't used,
# but are emulated by assuming a single partition/pool in the
# updated Plugin.
db.default.partitionCount=1
db.default.minConnectionsPerPartition=5
db.default.maxConnectionsPerPartition=50
# How long when trying to get a connection before we error out?
db.default.connectionTimeout=7s
# Throw away regardless of use after this time
db.default.maxConnectionAge=120s
# Ensure the newly created connection is valid before using it
db.default.initSQL="SELECT 1"
I'm not sure what Codahale metrics are, so I'm guessing I'm not using them.
I assume you are using this plugin. db.default.initSQL
is not what you want, you'll want db.default.connectionTestStatement
. If your TTL for connections is 2 minutes, because HikariCP has a variation for maxConnectionAge
retirement of up to +30 seconds, you'll want to try db.default.maxConnectionAgeInSeconds=90
.
I am not that familiar with Play, but you might see if you can update the HikariCP dependency to version 2.2.5 rather than 2.0.1.
You might also want to open an issue over the the play-hikaricp plugin project to upgrade their dependency.
Also, the documentation on the project page is more accurate than the web page.
Finally, I would recommend using the db.default.hikaricp.file
("That is the preferred way to configure HikariCP because you have full access to all properties...") rather than the "ordinary Play way", as I cannot vouch 100% for their conversion of Play-style properties into HikariCP properties.
Another thing, turn on DEBUG level logging for the 'com.zaxxer.hikari' package ... it's not too noisy ... and it will log the entire pool configuration at startup so you can verify that the properties you set through the play plugin were translated properly to HikariCP properties.
Thanks Brett. Yes, I'm using the edulify plugin.
In our environment, the TTL is 3 mins, so we're using 2 mins to avoid any issues with the network killing off the connection in the middle, so this should be 'safe' (and we've verified that it's safe with BoneCP).
FWIW, I was the last person to touch both both the documentation you reference above as well as the Play-style property configuration code, so I'm pretty confident it's doing the right with regard to mapping the properties. However, since I last touched both, it appears that the author has not updated to a newer version of HikariCP (the latest/released version is using 2.0.1).
Re: using the db.default.hikaricp.file, at this point, we'd prefer to stick with the ordinary 'Play' way as it allows to easily switch between the two DB providers, at least until we can verify that it's not an issue with the pooling implementation.
However, I thought that the connectionTestStatement wouldn't be necessary for JDBC 4 connections, which is why I didn't add a line for it. Do I need the connectionTestStatement? From HikariCP's homepage.
connectionTestQuery
If your driver supports JDBC4 we strongly recommend not setting this property.
So, in summary, I'll ping the plugin author to see if they can release a new version of the Plugin with an update HikariCP dependency.
However, do you think the behavior we experienced is something that may have been fixed in versions of the pooling code since 2.0.01?
Regarding the debug output, I'm pretty confident that the settings are being done correctly as I debugged the code when I wrote (before I provided the PR to the edulify folks).
Thanks!
You definitely don't need connectionTestStatement
, but looking at your configuration a connectionInitSql
of SELECT 1
seemed like a mistake, like you meant connectionTestStatement
instead.
Are you using edulify 1.5.0 or 1.4.1? I re-wrote and contributed much of the configuration mapping in 1.5.0.
If you look at the 1.4.1 code that maps Play properties, it completely ignores connectionTimeout
amongst many other options. maxConnectionAge
also appears to be mishandled to me, just glancing at it.
In the 1.5.0 code that I contributed, it maps many more properties, hopefully correctly.
Still, if "120s
" as a literal value for maxConnectionAge
is being used, the code is not going to handle it properly and you will wind up with a default of 30 minute timeout (unless Play's Configuration
is doing some magic, which it could be):
private def maxLifetime(config: Configuration) = {
var maxLife = config.getLong("maxConnectionAgeInMinutes").getOrElse(30L)
maxLife = config.getLong("maxConnectionAgeInSeconds").getOrElse(maxLife * 60)
maxLife = config.getMilliseconds("maxConnectionAge").getOrElse(maxLife * 1000)
maxLife.toString
}
Of course, "120
" without the "s" should work (in edulify 1.5.0).
I don't know if any bug in HikariCP that could inherently cause the observed performance issue, it seems very strange.
I'll await your response before diving deeper.
The code was extended even more (by me) prior to 1.5.0, and it should do the right thing. Among the changes I made I ensured that it logged the values it uses regardless of which method was used to configure the plugin.
Also, since Play uses HSON format, it converts the code (120s) to 120 * 1000 using the last value, otherwise it uses the defaults which works out to be 30 minutes.
I've created a PullRequest for the edulify developer to try and update the plugin to use HikariCP 2.2.5, so we'll see how that goes.
Thanks again for your quick response!
If Play converts 120s into 120 * 1000 behind the scenes, won't the edulify 1.5.0 code:
maxLife = config.getMilliseconds("maxConnectionAge").getOrElse(maxLife * 1000)
then convert it into 120000 * 1000? When I contributed that code for 1.5.0, I wasn't expecting that any magic was going on...
UPDATE: oh nevermind, I see, config.getMilliseconds("maxConnectionAge")
will return 120000 (hopefully) so the getOrElse()
will never fire.
BTW, what are you using to monitor the application (graphs shown above)?
We're using NewRelic, which has a number of features that allows us to obtain information on our application from a variety of sources (including the database) to get a better overall picture of the behavior of the application.
The first question of importance is, did your application experience SQLExceptions for timeouts?
This is some pretty deep analysis, if you want to follow along. This is the core of getConnection()
in 2.0.1 (you can skim it because it is repeated below):
163 final long start = System.currentTimeMillis();
164 final MetricsContext context = (isRecordMetrics ? metricsTracker.recordConnectionRequest(start) : MetricsTracker.NO_CONTEXT);
165 long timeout = connectionTimeout;
...
168 do {
169 IHikariConnectionProxy connection = connectionBag.borrow(timeout, TimeUnit.MILLISECONDS);
170 if (connection == null) {
171 break; // We timed out... break and throw exception
172 }
173
174 final long now = System.currentTimeMillis();
175 connection.unclose(now);
176
177 if (now > connection.getExpirationTime() || (now - connection.getLastAccess() > 1000L && !isConnectionAlive(connection, timeout))) {
178 closeConnection(connection); // Throw away the dead connection and try again
179 timeout = connectionTimeout - elapsedTimeMs(start);
180 continue;
181 }
182
183 if (isIsolateInternalQueries) {
184 IHikariConnectionProxy newProxy = ProxyFactory.getProxyConnection(connection);
185 connectionBag.remove(connection);
186 connectionBag.add(newProxy);
187 connection = newProxy;
188 }
189
190 if (leakDetectionThreshold != 0) {
191 connection.captureStack(leakDetectionThreshold, houseKeepingExecutorService);
192 }
193
194 return connection;
195 }
196 while (timeout > 0L);
...
206 throw new SQLException(...)
We can ignore 164 and 183-192 completely, as they are not enabled ...
168 do {
169 IHikariConnectionProxy connection = connectionBag.borrow(timeout, TimeUnit.MILLISECONDS);
170 if (connection == null) {
171 break; // We timed out... break and throw exception
172 }
173
174 final long now = System.currentTimeMillis();
175 connection.unclose(now);
176
177 if (now > connection.getExpirationTime() || (now - connection.getLastAccess() > 1000L && !isConnectionAlive(connection, timeout))) {
178 closeConnection(connection); // Throw away the dead connection and try again
179 timeout = connectionTimeout - elapsedTimeMs(start);
180 continue;
181 }
...
194 return connection;
195 }
196 while (timeout > 0L);
...
206 throw new SQLException(...)
For the time being, I'm going to assume that connectionBag.borrow(timeout, ...)
honors the timeout correctly, but we'll come back to it, as it may be the key.
If a Connection is obtained from the bag, in terms of things that can block, we get down to the first if
statement (line 177). It seems there are two calls of importance here: isConnectionAlive()
and inside of the conditional closeConnection()
.
isConnectionAlive(Connection connection, long timeoutMs)
in your case boils down pretty much to:
434 timeoutMs = timeoutEnabled ? Math.max(1000L, timeoutMs) : 0L;
435
436 if (isJdbc4ConnectionTest) {
437 return connection.isValid((int) TimeUnit.MILLISECONDS.toSeconds(timeoutMs));
438 }
Making a lot of assumptions here, but I am assuming that the MySQL isValid()
honors its timeout (?). If the Connection is valid within your 7 second timeout, isAlive()
would return true
back to getConnection()
and this would result in the conditional being false
(not entered) and therefore the Connection would then be returned immediately.
If isValid()
returns false
, or if the Connection from the bag is past its maxLifetime, the conditional would be entered. Now we come to closeConnection()
.
closeConnection(IHikariConnectionProxy connectionProxy)
boils down to:
279 try {
...
284 connectionProxy.realClose();
...
289 finally {
290 connectionBag.remove(connectionProxy);
291 }
Line 290 cannot block. But line 284, a real close call against the MySQL Connection can block indefinitely.
Let's assume that 284 does not block for any appreciable time. The connection is closed, we return back into the conditional in getConnection()
, timeout
is recalculated as 7 seconds minus the time since getConnection()
was entered: timeout = 7000 - elapsedTimeMs(start);
. Then continue
is called, and the timeout
is checked for > 0. If less than 7000ms have transpired, we loop again, if the timeout
went negative then the loop falls through and a SQLException is thrown.
If closing the real connection does block for some extended period, getConnection()
will be "stuck" for possibly longer than your connectionTimeout
, but after recalculating the timeout
upon returning to getConnection()
a SQLException is guaranteed to be thrown.
If your application never received SQLExceptions for timeouts during this period, I have no choice but to conclude there is probably bug in HikariCP's ConcurrentBag.borrow()
.
If your application did receive SQLExceptions for timeouts during this period, I am more inclined to suspect a slow/blocking connection.close()
or a connection.isValid()
that does not properly honor timeouts.
If it is the latter (you did receive SQLExceptions), HikariCP 2.2.5 may provide some relief in the form of the closeConnection()
call now handing off the real connection.close()
to a background thread (so it does not block).
Whether I need to dig into ConcurrentBag
depends largely on whether or not your application experienced SQLExceptions for timeouts.
I will say that it is strange to me that even during the "abherent period" getConnection()
never exceeded exactly 30 seconds. This does make me a bit suspicious that the edulify plugin somehow didn't pickup your 7 second setting, and the default -- which is exactly 30000ms -- was used. Not conclusive by any means, but a bit odd nonetheless.
From the logs during the event: (I just grabbed one of the many entries)
Caused by: java.sql.SQLException: Timeout of 7000ms encountered waiting for connection.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:207)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:93)
(Nice job of logging so the timeout is known!)
Not so fast on the congratulations. In 2.0.1 the log looked like this:
throw new SQLException(String.format("Timeout of %dms encountered waiting for connection.", configuration.getConnectionTimeout()), lastConnectionFailure.getAndSet(null));
Because configuration.getConnectionTimeout()
was used, even a longer timeout would log as whatever timeout you had configured. But at least it confirms that the connectionTimeout
was set properly.
In 2.2.5 the log looks like this:
throw new SQLException(String.format("Timeout after %dms of waiting for a connection.", elapsedTimeMs(start)), lastConnectionFailure.getAndSet(null));
So the actual time taken in the getConnection()
call is logged. If at all possible, I recommend upgrading to 2.2.5, even if you have to compile the edulify plugin yourself.
In 2.2.5 you would see logs like (for a 5s timeout):
HikariDataSource Exception getting connection: Timeout after 5003ms of waiting for a connection.
HikariDataSource Exception getting connection: Timeout after 5001ms of waiting for a connection.
DANG But, we know two things. (1) A SQLException was thrown, and as you point out, (2) the timeout is configured. :)
Hopefully I'll get a response on my 2.2.5 plugin issue...
Is it possible that there was a clock (time) change made on this machine? Just curious, because we use System.currentTimeMillis() in a few places and it is vulnerable to time changes (causing miscalculation of elapsed times, expirations, etc.). We are replacing with System.nanoTime() in the 2.3.0 release, which does not suffer from this problem.
We are replacing with System.nanoTime() in the 2.3.0 release, which does not suffer from this problem.
Except on OS X where they currently use the same clock source.
@chrisvest do you have a source on that? One thing I have found is that benchmark performance after to switch from currentTimeMillis() to nanoTime() is cut in half. Meaning 2x slower that prior to the change.
This is going to require some research. Comments in this thread indicate that performance can be up to 10x worse. But HikariCP really needs a monotonic clock source -- and a high-performance one -- or Bad Things:tm: can happen.
@brettwooldridge I have it from the HotSpot source code. Turns out they have a fix lined up for a future release: https://bugs.openjdk.java.net/browse/JDK-8040140 ... oh, and it's been backported to Java8 u20 and u25. Nice.
@chrisvest thanks! I'm still a bit troubled, I'm running 1.8.0_25-b17, but the times aren't looking good. I heard that u40 should be out soon.
I was considering backing out the nanoTime() change because of the performance hit, but I'm going to do some runs on Linux to see what the "hit" is.
@brettwooldridge It should work fairly well on Linux, but of course nothing is free: http://shipilev.net/blog/2014/nanotrusting-nanotime/
@chrisvest I think I read that article a few months back. Its certainly not that anything in HikariCP needs nano-second accuracy, but it definitely needs monotonicity. This is going to be a head scratcher.
I'm thinking of sticking with currentTimeMillis()
with the housekeeping thread guarding against backward time-flow. Backward flow of a even a few seconds is no big deal, but if the clock jumps back by more than a minute, the best thing may be to mark all connections evicted.
@brettwooldridge For what it's worth, I use nanoTime in Stormpot and assume it to be correct on the platforms that matter for production usages. I haven't looked into the other platforms, but I've only heard it discussed as a problem for OS X, e.g. on the concurrency-interest list.
@brettwooldridge Our servers are all timeclock sync'd using NTP, which should mean their clocks are all the same and shouldn't change that much. That being said, I suppose it's possibly that the NTP identified some weird time-issue that would have required the clock to be way out of sync that would have required a large clock step (vs. the normal NTP solution of making time run slightly slower/faster). However, that is very unlikely in this case.
@nateww probably of bigger concern is things like shifts between standard time and daylight-savings time in the U.S. and other countries, where clocks jump back one hour on a designated day. If there happened to be a thread in getConnection()
or any of several other methods that depend on System.currentTimeMillis()
the connection might not timeout for 1 hour + connectionTimeout.
Another change in 2.3.0 is here where, in theory even though time has passed, it is below the threshold of System.nanoTime()
which might yield a delta of time passage as zero. If that ever occurs, the new code insures minor though forward progress in time. (For example, the resolution of nanoTime()
on OS X appears to be several hundred nanoseconds).
@brettwooldridge I thought that Java handled the ST vs. DST jump properly by not 'jumping' the time. But, I'd have to investigate to make sure.
I was pretty sure that I verified this, but that was ~15 years ago a quite few jobs ago. At that time we (intentionally) standardized on using the value returned from System.currentTimeMillis() for all timekeeping to avoid having to worry about timezones and changing between daylight and standard time. Note, the time can jump when the actual clock is changed (slewed), and can lose time with leap seconds and such, but I thought that Java handled timezone and DST swaps correctly. But, don't bet the bank on it as I haven't checked in a very, very, long time.
@brettwooldridge Maybe we hit this or a similar bug tonight:
java.sql.SQLException: Timeout after 2000ms of waiting for a connection.
at com.zaxxer.hikari.pool.BaseHikariPool.getConnection(BaseHikariPool.java:202) ~[HikariCP-java6-2.3.0-20141127.142859-2.jar:na]
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:91) ~[HikariCP-java6-2.3.0-20141127.142859-2.jar:na]
The server on which the error happens, has an active ntp daemon and the problem did not resolve by itself. The pool had to be shutdown manually (app instance restart in this case). A connection to the same database, from the same server, but another app-instance worked flawlessly. The app-instance which had the error, also had many other pools to other databases that worked without problems. It seems, that just one pool was affected.
We are running Hikari as a datasource provider for eclipselink.
Our configuration (simplified code snippet):
HikariConfig dsConfig = new HikariConfig();
dsConfig.setPoolName(serviceName + " [" + databaseName + "]");
dsConfig.setDataSourceClassName("org.postgresql.ds.PGSimpleDataSource");
dsConfig.addDataSourceProperty("serverName", host);
dsConfig.addDataSourceProperty("portNumber", port);
dsConfig.addDataSourceProperty("databaseName", databaseName);
dsConfig.addDataSourceProperty("user", user);
dsConfig.addDataSourceProperty("password", password);
dsConfig.addDataSourceProperty("tcpKeepAlive", true);
dsConfig.addDataSourceProperty("loginTimeout", 10);
dsConfig.addDataSourceProperty("socketTimeout", 10);
dsConfig.setUsername(user);
dsConfig.setPassword(password);
dsConfig.setMaximumPoolSize(30);
dsConfig.setMinimumIdle(0);
dsConfig.setIdleTimeout(TimeUnit.MINUTES.toMillis(10));
dsConfig.setConnectionTimeout(TimeUnit.SECONDS.toMillis(2));
dsConfig.setThreadFactory(new ThreadFactory() {
@Override
public Thread newThread(final Runnable r) {
return new Thread(new Runnable() {
@Override
public void run() {
// some app stuff here
r.run();
}
});
}
});
if (isDebugEnabled()) {
dsConfig.setLeakDetectionThreshold(TimeUnit.SECONDS.toMillis(10));
dsConfig.setRegisterMbeans(true);
}
OFFTOPIC: Since we switched to hikari we are experiencing some problems with our java vm (64 bit "1.7.0_65", OpenJDK). The VM itself leaks system memory over time, while heap and permgen space is healthy. This leads to slow systems when the memory footprint of all instances outgrows system ram. Do you have any suggestions or tips if we should investigate on hikari or simply blame the vm? One app instance can have thousands of hikari polls. Min idle is always 0, so the pool is kept, but with zero connections, till the next request. Typically a few hundred pools are active/filled with connections.
@dfex55 Related to the OFFTOPIC: I'm assuming that when you say system memory you are referring to direct memory usage. Question: do you start the jvm with -XX:+DisableExplicitGC? What is the request rate of getConnection? Can it happen that for some pools getConnection is called less than every 2 or 5 seconds?
Couple of things. We have left this bug open, because it is troubling. A timeout in one thing, especially with a short connectionTimeout of 2 seconds and a minimumIdle of 0 (which requires on-demand creation), but non-recovery is something else.
Can you enable DEBUG logging? HikariCP is not "noisy" ... typically only logging pool stats once every 30 seconds. But if there are unexpected exceptions, it may log them. Then again, if you are running thousands of pools... well, let's just say its was not a design-time consideration, but we still wouldn't expect this symptom.
Are you sure it is a VM leak? Because HikariCP uses ThreadLocals, which the VM implements "off-heap", I wonder if there is a leak there. It surely seems that would have been discovered/reported elsewhere. I am certainly not one to pawn off responsibility, so I definitely won't point the finger at the VM. What VM options are you running with?
I've just published HikariCP 2.3.0-rc3 to the sonatype staging repo, if you're interested in testing it out.
BTW, what hardware/OS are you running on? The more details you can provide about your environment (database, driver version, etc.) the more it will help to correlate if someone else reports this issue.
@brettwooldridge We just upgrade to the latest Play plugin late last week, which now uses HikariCP 2.2.5. At the moment, we'd like to let it run for some time, to see if the issue still exists in the newer plugin plus the debugging output should be better.
In addition, we were able to reproduce similar behavior in our application by reducing the pool size to a very low number of sessions and requesting more sessions than available (maxConnections), but the logging output was slightly different, so we're not sure if it's the same bug.
Because of the above, we bumped up the # of connections as well, so we have two changes in place, higher max connections, and an upgrade from 2.0.1 -> 2.2.5 of HikariCP.
Ok. I think 2.2.5 in various ways is more robust than 2.0.1. Fingers crossed.
first: I created a new bug (https://github.com/brettwooldridge/HikariCP/issues/220) for what we are experiencing: From time to time we have errors on our connection/query/socket or whatever. With the latest hikari those broken connections remain in the pool and blocking one "connection-slot". If max pool size is reached getConnection will not return a new connection, but an timeout exception. That is different from the described bug here, as it will not happen with the mentioned version of hikari.
offtopic: Thank you for your suggestions. To avoid spamming this bug, I will dig deeper into the problem and (perhaps) return to hikari .. just for the record:
@nateww HikariCP 2.3.0-rc4 has been published to the sonatype staging repository. Any chance you can give it a try? This bug is pretty much the gate on the release now.
@brettwooldridge I won't be able to do it this week (traveling), but I'll see what I can do for next week.
@nateww any update on this?
We're going to go ahead and release HikariCP 2.3.0. If this issue persists, it will be addressed in a future release. Leaving this bug open for now.
@nateww If there is no further progress on this, we're going to go ahead and close this issue as resolved.
Sorry for not responding to this. We haven't seen any issues since upgrading to the Play plugin that uses 2.0.5, so I consider it closed.
Thanks!
This is a shot in the dark. We recently (2 weeks ago) switched from the default pooling provider in Play 2.3 (BoneCP) to HikariCP.
Earlier today we had a huge spike in the application, and analysis revealed that getConnection() of HikariCP was taking ~7 seconds per invocation.
Any ideas on what might cause this? Analysis of the database shows the database running normal up to this point, and then once the spike occurred, the load average (and queries) went DOWN during this spike in behavior as no connections were being made, and thus no operations were being done.
[ Due to network configuration with the firewall, we have a very short TTL for connections of 2 mins ]