brettwooldridge / HikariCP

光 HikariCP・A solid, high-performance, JDBC connection pool at last.
Apache License 2.0
19.96k stars 2.93k forks source link

java.lang.InterruptedException thrown by waitUntilSequenceExceeded #703

Closed noahlz closed 8 years ago

noahlz commented 8 years ago

Last night we had a network outage in our datacenter that caused some databases to be temporarily unreachable (no route to host). After the issue was restored ,we saw a flurry of the following errors:

Caused by: java.sql.SQLException: qai - Interrupted during connection acquisition
    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:182)
    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:147)
    at com.novus.jdbc.hikari.HikariQueryExecutor.connection(HikariQueryExecutor.scala:8)
    at com.novus.jdbc.QueryExecutor$class.com$novus$jdbc$QueryExecutor$$executeQuery(QueryExecutor.scala:88)
    at com.novus.jdbc.QueryExecutor$class.execute(QueryExecutor.scala:59)
    at com.novus.jdbc.hikari.HikariQueryExecutor.execute(HikariQueryExecutor.scala:7)
    at com.novus.jdbc.QueryExecutor$class.selectOne(QueryExecutor.scala:120)
    at com.novus.jdbc.hikari.HikariQueryExecutor.selectOne(HikariQueryExecutor.scala:7)
    at com.novus.analytics.core.util.RetryingStatementExecutor$$anonfun$selectOne$2.apply(RetryingStatementExecutor.scala:17)
    at com.novus.analytics.core.util.RetryingStatementExecutor$$anonfun$selectOne$2.apply(RetryingStatementExecutor.scala:17)
    at com.novus.analytics.core.util.RetryBehavior$.iWrap$1(RetryBehavior.scala:192)
    ... 233 more
Caused by: java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedLongSynchronizer.doAcquireSharedNanos(AbstractQueuedLongSynchronizer.java:815)
    at java.util.concurrent.locks.AbstractQueuedLongSynchronizer.tryAcquireSharedNanos(AbstractQueuedLongSynchronizer.java:1103)
    at com.zaxxer.hikari.util.QueuedSequenceSynchronizer.waitUntilSequenceExceeded(QueuedSequenceSynchronizer.java:92)
    at com.zaxxer.hikari.util.ConcurrentBag.borrow(ConcurrentBag.java:169)
    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:165)
    ... 243 more

Supposedly the pool did not recover properly after the networking issue was corrected, we saw queries not returning (probably because the hikari queue was full), and then this occurred (possibly when we sent a sig -15 to restart the app).

I don't have specific reproduction steps, just thought you should know and triage - is Hikari behaving properly? An InterruptedException in production - that's a first for me!

brettwooldridge commented 8 years ago

The InterrupedException was almost certainly caused by the SIGTERM. HikariCP only puts a network timeout around its own operations -- connection acquisitions and alive checks -- and only where the driver supports setNetworkTimeout(). What kind of database is this and what driver version?

If your driver supports it, and in order to ensure that a network outage does not cause application queries to hang indefinitely in the TCP stack, it is advisable to set a global network timeout. For example, the MySQL driver has socketTimeout. However, if this is set, for example to 2 minutes, you will also need to set the HikariCP idleTimeout to the same or shorter.

noahlz commented 8 years ago

SQL Server 2008 jTDS v1.3.1

Thanks! I'll close this.

brettwooldridge commented 8 years ago

@noahlz This article might help too, seems like jTDS does have a socketTimeout property:

http://www.cubrid.org/blog/dev-platform/understanding-jdbc-internals-and-timeout-configuration/

noahlz commented 8 years ago

Thanks! I know that article well because I referenced it when overhauling our JDBC layer to use HikariCP earlier this year :) I'll doublecheck our timeout settings - it could just be that they are set into the minutes (we have some long-running queries) and did not timeout at the time that the sigterm arrived.

noahlz commented 8 years ago

Ok, I see in our code that we are not using setIdleTimeout when we create our HikariConfig object. Further, I see that the default for this is 10 minutes.

By happy coincidence, our SQLServer connections have network socket timeout set to 10 minutes as well (or longer - 60 minutes) for some components, but for our MySQL connections it's much shorter - network socket timeout of something like 2 minutes.

Sounds like we should set the idle timeout to be the same as the socket timeout in all cases?

Just one other question: what is the benefit of idleTimeout in the case where we have connections "hung" due to network issues (or non-responsive SQL Server connections). It sounds like the connections get "removed" from the pool, but what is the impact of that on the application?

brettwooldridge commented 8 years ago

@noahlz Just to be clear, idleTimeout will have no effect on hung connections.

However, if a global socket-level timeout is set in order to reduce TCP-level disruptions, for example with MySQL's socketTimeout property, and that timeout is shorter than idleTimeout, it can cause delays obtaining a connection from the pool. Meaning, an idle connection will conduct no traffic, and after socketTimeout will be closed preemptively by the TCP stack, so if the idleTimeout is longer than socketTimeout then the pool will still think the connection is valid and will try to test it when getConnection() is called. Only then will the pool realize that the connection is dead, and will initiate the creation of a new connection.

Ideally, in the absence of any network issues, the pool should close Connections before either the network layer or the database itself does so preemptively. This minimizes delays and ensures that the pool does not contain dead connections that need to be tested, retired and replaced at they point at which they are requested by the application.

It should also be noted that idleTimeout has no effect when minimumIdle is not set. When minimumIdle is not set, HikariCP operates as a fixed-size pool with maximumPoolSize connections. In this case, which the default behavior, only the maxLifeTime value is applicable.

It's funny that you reopened this issue, because I was just starting to work on an article about tuning the TCP stack of Linux and Windows to minimize downtime after a network interruption.

noahlz commented 8 years ago

I re-opened merely to indicate that I was hoping for more feedback, but I will close now - this is great information, thanks!

brettwooldridge commented 8 years ago

@noahlz I'll link the article I am creating here when it is finished. Probably sometime later this week.