Closed noahlz closed 8 years ago
The InterrupedException
was almost certainly caused by the SIGTERM. HikariCP only puts a network timeout around its own operations -- connection acquisitions and alive checks -- and only where the driver supports setNetworkTimeout()
. What kind of database is this and what driver version?
If your driver supports it, and in order to ensure that a network outage does not cause application queries to hang indefinitely in the TCP stack, it is advisable to set a global network timeout. For example, the MySQL driver has socketTimeout
. However, if this is set, for example to 2 minutes, you will also need to set the HikariCP idleTimeout
to the same or shorter.
SQL Server 2008 jTDS v1.3.1
Thanks! I'll close this.
@noahlz This article might help too, seems like jTDS does have a socketTimeout
property:
http://www.cubrid.org/blog/dev-platform/understanding-jdbc-internals-and-timeout-configuration/
Thanks! I know that article well because I referenced it when overhauling our JDBC layer to use HikariCP earlier this year :) I'll doublecheck our timeout settings - it could just be that they are set into the minutes (we have some long-running queries) and did not timeout at the time that the sigterm arrived.
Ok, I see in our code that we are not using setIdleTimeout
when we create our HikariConfig
object. Further, I see that the default for this is 10 minutes.
By happy coincidence, our SQLServer connections have network socket timeout set to 10 minutes as well (or longer - 60 minutes) for some components, but for our MySQL connections it's much shorter - network socket timeout of something like 2 minutes.
Sounds like we should set the idle timeout to be the same as the socket timeout in all cases?
Just one other question: what is the benefit of idleTimeout in the case where we have connections "hung" due to network issues (or non-responsive SQL Server connections). It sounds like the connections get "removed" from the pool, but what is the impact of that on the application?
@noahlz Just to be clear, idleTimeout
will have no effect on hung connections.
However, if a global socket-level timeout is set in order to reduce TCP-level disruptions, for example with MySQL's socketTimeout
property, and that timeout is shorter than idleTimeout
, it can cause delays obtaining a connection from the pool. Meaning, an idle connection will conduct no traffic, and after socketTimeout
will be closed preemptively by the TCP stack, so if the idleTimeout
is longer than socketTimeout
then the pool will still think the connection is valid and will try to test it when getConnection()
is called. Only then will the pool realize that the connection is dead, and will initiate the creation of a new connection.
Ideally, in the absence of any network issues, the pool should close Connections before either the network layer or the database itself does so preemptively. This minimizes delays and ensures that the pool does not contain dead connections that need to be tested, retired and replaced at they point at which they are requested by the application.
It should also be noted that idleTimeout
has no effect when minimumIdle
is not set. When minimumIdle
is not set, HikariCP operates as a fixed-size pool with maximumPoolSize
connections. In this case, which the default behavior, only the maxLifeTime
value is applicable.
It's funny that you reopened this issue, because I was just starting to work on an article about tuning the TCP stack of Linux and Windows to minimize downtime after a network interruption.
I re-opened merely to indicate that I was hoping for more feedback, but I will close now - this is great information, thanks!
@noahlz I'll link the article I am creating here when it is finished. Probably sometime later this week.
Last night we had a network outage in our datacenter that caused some databases to be temporarily unreachable (
no route to host
). After the issue was restored ,we saw a flurry of the following errors:Supposedly the pool did not recover properly after the networking issue was corrected, we saw queries not returning (probably because the hikari queue was full), and then this occurred (possibly when we sent a sig -15 to restart the app).
I don't have specific reproduction steps, just thought you should know and triage - is Hikari behaving properly? An InterruptedException in production - that's a first for me!