Very slow calling getConnection

nateww commented 9 years ago

This is a shot in the dark. We recently (2 weeks ago) switched from the default pooling provider in Play 2.3 (BoneCP) to HikariCP.

Earlier today we had a huge spike in the application, and analysis revealed that getConnection() of HikariCP was taking ~7 seconds per invocation.

screen shot 2014-12-09 at 3 05 09 pm

Any ideas on what might cause this? Analysis of the database shows the database running normal up to this point, and then once the spike occurred, the load average (and queries) went DOWN during this spike in behavior as no connections were being made, and thus no operations were being done.

[ Due to network configuration with the firewall, we have a very short TTL for connections of 2 mins ]

nateww commented 9 years ago

Here is the breakdown just before and just after: screen shot 2014-12-09 at 3 13 40 pm

nateww commented 9 years ago

And, just before (getConnection() doesn't even show up).

screen shot 2014-12-09 at 3 15 45 pm

brettwooldridge commented 9 years ago

A few questions. What version of HikariCP? Can you share your configuration? And are you using Codahale metrics?

nateww commented 9 years ago

We're using the HikariCP Play Plugin, which uses HikariCP 2.0.1. My configuration is in Play format, but I think it will make sense.

# Partitions as defined in the Play pooling plugin aren't used,
# but are emulated by assuming a single partition/pool in the
# updated Plugin.
db.default.partitionCount=1
db.default.minConnectionsPerPartition=5
db.default.maxConnectionsPerPartition=50
# How long when trying to get a connection before we error out?
db.default.connectionTimeout=7s
# Throw away regardless of use after this time
db.default.maxConnectionAge=120s
# Ensure the newly created connection is valid before using it
db.default.initSQL="SELECT 1"

I'm not sure what Codahale metrics are, so I'm guessing I'm not using them.

brettwooldridge commented 9 years ago

I assume you are using this plugin. db.default.initSQL is not what you want, you'll want db.default.connectionTestStatement. If your TTL for connections is 2 minutes, because HikariCP has a variation for maxConnectionAge retirement of up to +30 seconds, you'll want to try db.default.maxConnectionAgeInSeconds=90.

I am not that familiar with Play, but you might see if you can update the HikariCP dependency to version 2.2.5 rather than 2.0.1.

You might also want to open an issue over the the play-hikaricp plugin project to upgrade their dependency.

brettwooldridge commented 9 years ago

Also, the documentation on the project page is more accurate than the web page.

Finally, I would recommend using the db.default.hikaricp.file ("That is the preferred way to configure HikariCP because you have full access to all properties...") rather than the "ordinary Play way", as I cannot vouch 100% for their conversion of Play-style properties into HikariCP properties.

brettwooldridge commented 9 years ago

Another thing, turn on DEBUG level logging for the 'com.zaxxer.hikari' package ... it's not too noisy ... and it will log the entire pool configuration at startup so you can verify that the properties you set through the play plugin were translated properly to HikariCP properties.

nateww commented 9 years ago

Thanks Brett. Yes, I'm using the edulify plugin.

In our environment, the TTL is 3 mins, so we're using 2 mins to avoid any issues with the network killing off the connection in the middle, so this should be 'safe' (and we've verified that it's safe with BoneCP).

FWIW, I was the last person to touch both both the documentation you reference above as well as the Play-style property configuration code, so I'm pretty confident it's doing the right with regard to mapping the properties. However, since I last touched both, it appears that the author has not updated to a newer version of HikariCP (the latest/released version is using 2.0.1).

Re: using the db.default.hikaricp.file, at this point, we'd prefer to stick with the ordinary 'Play' way as it allows to easily switch between the two DB providers, at least until we can verify that it's not an issue with the pooling implementation.

However, I thought that the connectionTestStatement wouldn't be necessary for JDBC 4 connections, which is why I didn't add a line for it. Do I need the connectionTestStatement? From HikariCP's homepage.

connectionTestQuery
If your driver supports JDBC4 we strongly recommend not setting this property.

So, in summary, I'll ping the plugin author to see if they can release a new version of the Plugin with an update HikariCP dependency.

However, do you think the behavior we experienced is something that may have been fixed in versions of the pooling code since 2.0.01?

Regarding the debug output, I'm pretty confident that the settings are being done correctly as I debugged the code when I wrote (before I provided the PR to the edulify folks).

Thanks!

brettwooldridge commented 9 years ago

You definitely don't need connectionTestStatement, but looking at your configuration a connectionInitSql of SELECT 1 seemed like a mistake, like you meant connectionTestStatement instead.

Are you using edulify 1.5.0 or 1.4.1? I re-wrote and contributed much of the configuration mapping in 1.5.0.

If you look at the 1.4.1 code that maps Play properties, it completely ignores connectionTimeout amongst many other options. maxConnectionAge also appears to be mishandled to me, just glancing at it.

In the 1.5.0 code that I contributed, it maps many more properties, hopefully correctly.

Still, if "120s" as a literal value for maxConnectionAge is being used, the code is not going to handle it properly and you will wind up with a default of 30 minute timeout (unless Play's Configuration is doing some magic, which it could be):

private def maxLifetime(config: Configuration) = {
    var maxLife = config.getLong("maxConnectionAgeInMinutes").getOrElse(30L)
    maxLife     = config.getLong("maxConnectionAgeInSeconds").getOrElse(maxLife * 60)
    maxLife     = config.getMilliseconds("maxConnectionAge").getOrElse(maxLife * 1000)
    maxLife.toString
}

Of course, "120" without the "s" should work (in edulify 1.5.0).

I don't know if any bug in HikariCP that could inherently cause the observed performance issue, it seems very strange.

I'll await your response before diving deeper.

nateww commented 9 years ago

The code was extended even more (by me) prior to 1.5.0, and it should do the right thing. Among the changes I made I ensured that it logged the values it uses regardless of which method was used to configure the plugin.

Also, since Play uses HSON format, it converts the code (120s) to 120 * 1000 using the last value, otherwise it uses the defaults which works out to be 30 minutes.

I've created a PullRequest for the edulify developer to try and update the plugin to use HikariCP 2.2.5, so we'll see how that goes.

Thanks again for your quick response!

brettwooldridge commented 9 years ago

If Play converts 120s into 120 * 1000 behind the scenes, won't the edulify 1.5.0 code:

maxLife     = config.getMilliseconds("maxConnectionAge").getOrElse(maxLife * 1000)

then convert it into 120000 * 1000? When I contributed that code for 1.5.0, I wasn't expecting that any magic was going on...

UPDATE: oh nevermind, I see, config.getMilliseconds("maxConnectionAge") will return 120000 (hopefully) so the getOrElse() will never fire.

brettwooldridge commented 9 years ago

BTW, what are you using to monitor the application (graphs shown above)?

nateww commented 9 years ago

We're using NewRelic, which has a number of features that allows us to obtain information on our application from a variety of sources (including the database) to get a better overall picture of the behavior of the application.

brettwooldridge commented 9 years ago

The first question of importance is, did your application experience SQLExceptions for timeouts?

This is some pretty deep analysis, if you want to follow along. This is the core of getConnection() in 2.0.1 (you can skim it because it is repeated below):

163   final long start = System.currentTimeMillis();
164   final MetricsContext context = (isRecordMetrics ? metricsTracker.recordConnectionRequest(start) : MetricsTracker.NO_CONTEXT);
165   long timeout = connectionTimeout;
...
168      do {
169         IHikariConnectionProxy connection = connectionBag.borrow(timeout, TimeUnit.MILLISECONDS);
170         if (connection == null) {
171            break; // We timed out... break and throw exception
172         }
173
174         final long now = System.currentTimeMillis();
175         connection.unclose(now);
176
177         if (now > connection.getExpirationTime() || (now - connection.getLastAccess() > 1000L && !isConnectionAlive(connection, timeout))) {
178            closeConnection(connection); // Throw away the dead connection and try again
179            timeout = connectionTimeout - elapsedTimeMs(start);
180            continue;
181         }
182
183         if (isIsolateInternalQueries) {
184             IHikariConnectionProxy newProxy = ProxyFactory.getProxyConnection(connection);
185             connectionBag.remove(connection);
186             connectionBag.add(newProxy);
187             connection = newProxy;
188         }
189
190         if (leakDetectionThreshold != 0) {
191            connection.captureStack(leakDetectionThreshold, houseKeepingExecutorService);
192         }
193
194         return connection;
195      }
196      while (timeout > 0L);
...
206   throw new SQLException(...)

We can ignore 164 and 183-192 completely, as they are not enabled ...

168      do {
169         IHikariConnectionProxy connection = connectionBag.borrow(timeout, TimeUnit.MILLISECONDS);
170         if (connection == null) {
171            break; // We timed out... break and throw exception
172         }
173
174         final long now = System.currentTimeMillis();
175         connection.unclose(now);
176
177         if (now > connection.getExpirationTime() || (now - connection.getLastAccess() > 1000L && !isConnectionAlive(connection, timeout))) {
178            closeConnection(connection); // Throw away the dead connection and try again
179            timeout = connectionTimeout - elapsedTimeMs(start);
180            continue;
181         }
...
194         return connection;
195      }
196      while (timeout > 0L);
...
206   throw new SQLException(...)

For the time being, I'm going to assume that connectionBag.borrow(timeout, ...) honors the timeout correctly, but we'll come back to it, as it may be the key.

If a Connection is obtained from the bag, in terms of things that can block, we get down to the first if statement (line 177). It seems there are two calls of importance here: isConnectionAlive() and inside of the conditional closeConnection().

isConnectionAlive(Connection connection, long timeoutMs) in your case boils down pretty much to:

434 timeoutMs = timeoutEnabled ? Math.max(1000L, timeoutMs) : 0L;
435
436 if (isJdbc4ConnectionTest) {
437     return connection.isValid((int) TimeUnit.MILLISECONDS.toSeconds(timeoutMs));
438 }

Making a lot of assumptions here, but I am assuming that the MySQL isValid() honors its timeout (?). If the Connection is valid within your 7 second timeout, isAlive() would return true back to getConnection() and this would result in the conditional being false (not entered) and therefore the Connection would then be returned immediately.

If isValid() returns false, or if the Connection from the bag is past its maxLifetime, the conditional would be entered. Now we come to closeConnection().

closeConnection(IHikariConnectionProxy connectionProxy) boils down to:

279 try {
...
284   connectionProxy.realClose();
...
289 finally {
290   connectionBag.remove(connectionProxy);
291 }

Line 290 cannot block. But line 284, a real close call against the MySQL Connection can block indefinitely.

Let's assume that 284 does not block for any appreciable time. The connection is closed, we return back into the conditional in getConnection(), timeout is recalculated as 7 seconds minus the time since getConnection() was entered: timeout = 7000 - elapsedTimeMs(start);. Then continue is called, and the timeout is checked for > 0. If less than 7000ms have transpired, we loop again, if the timeout went negative then the loop falls through and a SQLException is thrown.

If closing the real connection does block for some extended period, getConnection() will be "stuck" for possibly longer than your connectionTimeout, but after recalculating the timeout upon returning to getConnection() a SQLException is guaranteed to be thrown.

Analysis

If your application never received SQLExceptions for timeouts during this period, I have no choice but to conclude there is probably bug in HikariCP's ConcurrentBag.borrow().

If your application did receive SQLExceptions for timeouts during this period, I am more inclined to suspect a slow/blocking connection.close() or a connection.isValid() that does not properly honor timeouts.

If it is the latter (you did receive SQLExceptions), HikariCP 2.2.5 may provide some relief in the form of the closeConnection() call now handing off the real connection.close() to a background thread (so it does not block).

Whether I need to dig into ConcurrentBag depends largely on whether or not your application experienced SQLExceptions for timeouts.

brettwooldridge commented 9 years ago

I will say that it is strange to me that even during the "abherent period" getConnection() never exceeded exactly 30 seconds. This does make me a bit suspicious that the edulify plugin somehow didn't pickup your 7 second setting, and the default -- which is exactly 30000ms -- was used. Not conclusive by any means, but a bit odd nonetheless.

nateww commented 9 years ago

From the logs during the event: (I just grabbed one of the many entries)

Caused by: java.sql.SQLException: Timeout of 7000ms encountered waiting for connection.
  at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:207)
  at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:93)

(Nice job of logging so the timeout is known!)

brettwooldridge commented 9 years ago

Not so fast on the congratulations. In 2.0.1 the log looked like this:

throw new SQLException(String.format("Timeout of %dms encountered waiting for connection.", configuration.getConnectionTimeout()), lastConnectionFailure.getAndSet(null));

Because configuration.getConnectionTimeout() was used, even a longer timeout would log as whatever timeout you had configured. But at least it confirms that the connectionTimeout was set properly.

In 2.2.5 the log looks like this:

throw new SQLException(String.format("Timeout after %dms of waiting for a connection.", elapsedTimeMs(start)), lastConnectionFailure.getAndSet(null));

So the actual time taken in the getConnection() call is logged. If at all possible, I recommend upgrading to 2.2.5, even if you have to compile the edulify plugin yourself.

brettwooldridge commented 9 years ago

In 2.2.5 you would see logs like (for a 5s timeout):

HikariDataSource Exception getting connection: Timeout after 5003ms of waiting for a connection.
HikariDataSource Exception getting connection: Timeout after 5001ms of waiting for a connection.

nateww commented 9 years ago

DANG But, we know two things. (1) A SQLException was thrown, and as you point out, (2) the timeout is configured. :)

Hopefully I'll get a response on my 2.2.5 plugin issue...

brettwooldridge commented 9 years ago

Is it possible that there was a clock (time) change made on this machine? Just curious, because we use System.currentTimeMillis() in a few places and it is vulnerable to time changes (causing miscalculation of elapsed times, expirations, etc.). We are replacing with System.nanoTime() in the 2.3.0 release, which does not suffer from this problem.

chrisvest commented 9 years ago

We are replacing with System.nanoTime() in the 2.3.0 release, which does not suffer from this problem.

Except on OS X where they currently use the same clock source.

brettwooldridge commented 9 years ago

@chrisvest do you have a source on that? One thing I have found is that benchmark performance after to switch from currentTimeMillis() to nanoTime() is cut in half. Meaning 2x slower that prior to the change.

This is going to require some research. Comments in this thread indicate that performance can be up to 10x worse. But HikariCP really needs a monotonic clock source -- and a high-performance one -- or Bad Things^:tm: can happen.

chrisvest commented 9 years ago

@brettwooldridge I have it from the HotSpot source code. Turns out they have a fix lined up for a future release: https://bugs.openjdk.java.net/browse/JDK-8040140 ... oh, and it's been backported to Java8 u20 and u25. Nice.

brettwooldridge commented 9 years ago

@chrisvest thanks! I'm still a bit troubled, I'm running 1.8.0_25-b17, but the times aren't looking good. I heard that u40 should be out soon.

I was considering backing out the nanoTime() change because of the performance hit, but I'm going to do some runs on Linux to see what the "hit" is.

chrisvest commented 9 years ago

@brettwooldridge It should work fairly well on Linux, but of course nothing is free: http://shipilev.net/blog/2014/nanotrusting-nanotime/

brettwooldridge commented 9 years ago

@chrisvest I think I read that article a few months back. Its certainly not that anything in HikariCP needs nano-second accuracy, but it definitely needs monotonicity. This is going to be a head scratcher.

I'm thinking of sticking with currentTimeMillis() with the housekeeping thread guarding against backward time-flow. Backward flow of a even a few seconds is no big deal, but if the clock jumps back by more than a minute, the best thing may be to mark all connections evicted.

chrisvest commented 9 years ago

@brettwooldridge For what it's worth, I use nanoTime in Stormpot and assume it to be correct on the platforms that matter for production usages. I haven't looked into the other platforms, but I've only heard it discussed as a problem for OS X, e.g. on the concurrency-interest list.

nateww commented 9 years ago

@brettwooldridge Our servers are all timeclock sync'd using NTP, which should mean their clocks are all the same and shouldn't change that much. That being said, I suppose it's possibly that the NTP identified some weird time-issue that would have required the clock to be way out of sync that would have required a large clock step (vs. the normal NTP solution of making time run slightly slower/faster). However, that is very unlikely in this case.

brettwooldridge commented 9 years ago

@nateww probably of bigger concern is things like shifts between standard time and daylight-savings time in the U.S. and other countries, where clocks jump back one hour on a designated day. If there happened to be a thread in getConnection() or any of several other methods that depend on System.currentTimeMillis() the connection might not timeout for 1 hour + connectionTimeout.

Another change in 2.3.0 is here where, in theory even though time has passed, it is below the threshold of System.nanoTime() which might yield a delta of time passage as zero. If that ever occurs, the new code insures minor though forward progress in time. (For example, the resolution of nanoTime() on OS X appears to be several hundred nanoseconds).

nateww commented 9 years ago

@brettwooldridge I thought that Java handled the ST vs. DST jump properly by not 'jumping' the time. But, I'd have to investigate to make sure.

I was pretty sure that I verified this, but that was ~15 years ago a quite few jobs ago. At that time we (intentionally) standardized on using the value returned from System.currentTimeMillis() for all timekeeping to avoid having to worry about timezones and changing between daylight and standard time. Note, the time can jump when the actual clock is changed (slewed), and can lose time with leap seconds and such, but I thought that Java handled timezone and DST swaps correctly. But, don't bet the bank on it as I haven't checked in a very, very, long time.

dfex55 commented 9 years ago

@brettwooldridge Maybe we hit this or a similar bug tonight:

java.sql.SQLException: Timeout after 2000ms of waiting for a connection.
        at com.zaxxer.hikari.pool.BaseHikariPool.getConnection(BaseHikariPool.java:202) ~[HikariCP-java6-2.3.0-20141127.142859-2.jar:na]
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:91) ~[HikariCP-java6-2.3.0-20141127.142859-2.jar:na]

The server on which the error happens, has an active ntp daemon and the problem did not resolve by itself. The pool had to be shutdown manually (app instance restart in this case). A connection to the same database, from the same server, but another app-instance worked flawlessly. The app-instance which had the error, also had many other pools to other databases that worked without problems. It seems, that just one pool was affected.

We are running Hikari as a datasource provider for eclipselink.

Our configuration (simplified code snippet):

    HikariConfig dsConfig = new HikariConfig();
    dsConfig.setPoolName(serviceName + " [" + databaseName + "]");

    dsConfig.setDataSourceClassName("org.postgresql.ds.PGSimpleDataSource");
    dsConfig.addDataSourceProperty("serverName", host);
    dsConfig.addDataSourceProperty("portNumber", port);
    dsConfig.addDataSourceProperty("databaseName", databaseName);
    dsConfig.addDataSourceProperty("user", user);
    dsConfig.addDataSourceProperty("password", password);
    dsConfig.addDataSourceProperty("tcpKeepAlive", true);
    dsConfig.addDataSourceProperty("loginTimeout", 10);
    dsConfig.addDataSourceProperty("socketTimeout", 10);

    dsConfig.setUsername(user);
    dsConfig.setPassword(password);

    dsConfig.setMaximumPoolSize(30);
    dsConfig.setMinimumIdle(0);
    dsConfig.setIdleTimeout(TimeUnit.MINUTES.toMillis(10));
    dsConfig.setConnectionTimeout(TimeUnit.SECONDS.toMillis(2));

    dsConfig.setThreadFactory(new ThreadFactory() {

      @Override
      public Thread newThread(final Runnable r) {
        return new Thread(new Runnable() {

          @Override
          public void run() {
            // some app stuff here
            r.run();
          }
        });
      }
    });

    if (isDebugEnabled()) {
      dsConfig.setLeakDetectionThreshold(TimeUnit.SECONDS.toMillis(10));
      dsConfig.setRegisterMbeans(true);
    }

OFFTOPIC: Since we switched to hikari we are experiencing some problems with our java vm (64 bit "1.7.0_65", OpenJDK). The VM itself leaks system memory over time, while heap and permgen space is healthy. This leads to slow systems when the memory footprint of all instances outgrows system ram. Do you have any suggestions or tips if we should investigate on hikari or simply blame the vm? One app instance can have thousands of hikari polls. Min idle is always 0, so the pool is kept, but with zero connections, till the next request. Typically a few hundred pools are active/filled with connections.

mihai-chezan commented 9 years ago

@dfex55 Related to the OFFTOPIC: I'm assuming that when you say system memory you are referring to direct memory usage. Question: do you start the jvm with -XX:+DisableExplicitGC? What is the request rate of getConnection? Can it happen that for some pools getConnection is called less than every 2 or 5 seconds?

brettwooldridge commented 9 years ago

Couple of things. We have left this bug open, because it is troubling. A timeout in one thing, especially with a short connectionTimeout of 2 seconds and a minimumIdle of 0 (which requires on-demand creation), but non-recovery is something else.

Can you enable DEBUG logging? HikariCP is not "noisy" ... typically only logging pool stats once every 30 seconds. But if there are unexpected exceptions, it may log them. Then again, if you are running thousands of pools... well, let's just say its was not a design-time consideration, but we still wouldn't expect this symptom.

Are you sure it is a VM leak? Because HikariCP uses ThreadLocals, which the VM implements "off-heap", I wonder if there is a leak there. It surely seems that would have been discovered/reported elsewhere. I am certainly not one to pawn off responsibility, so I definitely won't point the finger at the VM. What VM options are you running with?

brettwooldridge commented 9 years ago

I've just published HikariCP 2.3.0-rc3 to the sonatype staging repo, if you're interested in testing it out.

BTW, what hardware/OS are you running on? The more details you can provide about your environment (database, driver version, etc.) the more it will help to correlate if someone else reports this issue.

nateww commented 9 years ago

@brettwooldridge We just upgrade to the latest Play plugin late last week, which now uses HikariCP 2.2.5. At the moment, we'd like to let it run for some time, to see if the issue still exists in the newer plugin plus the debugging output should be better.

In addition, we were able to reproduce similar behavior in our application by reducing the pool size to a very low number of sessions and requesting more sessions than available (maxConnections), but the logging output was slightly different, so we're not sure if it's the same bug.

Because of the above, we bumped up the # of connections as well, so we have two changes in place, higher max connections, and an upgrade from 2.0.1 -> 2.2.5 of HikariCP.

brettwooldridge commented 9 years ago

Ok. I think 2.2.5 in various ways is more robust than 2.0.1. Fingers crossed.

dfex55 commented 9 years ago

first: I created a new bug (https://github.com/brettwooldridge/HikariCP/issues/220) for what we are experiencing: From time to time we have errors on our connection/query/socket or whatever. With the latest hikari those broken connections remain in the pool and blocking one "connection-slot". If max pool size is reached getConnection will not return a new connection, but an timeout exception. That is different from the described bug here, as it will not happen with the mentioned version of hikari.

offtopic: Thank you for your suggestions. To avoid spamming this bug, I will dig deeper into the problem and (perhaps) return to hikari .. just for the record:

hikari is great serving thousands of pools!
OS: ubuntu 64-bit
getConnection() calls between 0ms and XX minutes
no -XX:+DisableExplicitGC
can be observed with just -server -xmx.. and -XX:MaxPermSize..
I would exclude the ThreadLocals as they are stored in an on-heap-map (unless hikari uses off-heap values)

brettwooldridge commented 9 years ago

@nateww HikariCP 2.3.0-rc4 has been published to the sonatype staging repository. Any chance you can give it a try? This bug is pretty much the gate on the release now.

nateww commented 9 years ago

@brettwooldridge I won't be able to do it this week (traveling), but I'll see what I can do for next week.

brettwooldridge commented 9 years ago

@nateww any update on this?

brettwooldridge commented 9 years ago

We're going to go ahead and release HikariCP 2.3.0. If this issue persists, it will be addressed in a future release. Leaving this bug open for now.

brettwooldridge commented 9 years ago

@nateww If there is no further progress on this, we're going to go ahead and close this issue as resolved.

nateww commented 9 years ago

Sorry for not responding to this. We haven't seen any issues since upgrading to the Play plugin that uses 2.0.5, so I consider it closed.

Thanks!

brettwooldridge / HikariCP

Very slow calling getConnection #215

Analysis