brettwooldridge / HikariCP

光 HikariCP・A solid, high-performance, JDBC connection pool at last.
Apache License 2.0
20.04k stars 2.94k forks source link

No backoff if PoolEntryCreator is interrupted #2265

Open PaoloVoronini opened 4 days ago

PaoloVoronini commented 4 days ago

https://github.com/brettwooldridge/HikariCP/blob/ffdb8106bccb285d1775af1e8b040bfce4588dc1/src/main/java/com/zaxxer/hikari/pool/HikariPool.java#L758C19-L758C31

Method UtilityElf.quietlySleep() retains the interrupted flag As a result PoolEntryCreator does not sleep at all - i.e. no backoff. Possibly the interrupted flag also prevents creating new pool entry.

while (shouldContinueCreating()) {
   final var poolEntry = createPoolEntry();
   if (poolEntry != null) {
      added = true;
      connectionBag.add(poolEntry);
      logger.debug("{} - Added connection {}", poolName, poolEntry.connection);
      quietlySleep(30L);
      break;
   } else {  // failed to get connection from db, sleep and retry
      if (loggingPrefix != null && backoffMs % 50 == 0)
         logger.debug("{} - Connection add failed, sleeping with backoff: {}ms", poolName, backoffMs);

      //THE PROBLEM IS HERE
      //the thread does not sleep because it is interrupted
      quietlySleep(backoffMs); 

      backoffMs = Math.min(SECONDS.toMillis(5), backoffMs * 2);
   }
}

@brettwooldridge my suggestion would be to clear the interrupted flag in the snippet above. i.e.

quietlySleep(backoffMs);
//clear interrupted state
Thread.interrupted();

PS: this might be related to #2208 and other "database outage" issues mentioned in it

brettwooldridge commented 4 days ago

What would cause the adder thread to be interrupted besides shutdown?

PaoloVoronini commented 4 days ago

thanks Brett for the quick response!

unfortunately, i was not able to figure out the root cause reliably. similar to other issues i mentioned, we observe this upon database outages this is the last piece of the stacktrace (please find below) so far i can only speculate that possibly the ojdbc sets the interrupted status of the thread for some reason

PS: just in case, i should mention that i wrote a little method to support the statement "thread does not sleep"

public static void main(String[] args) {
    Thread t = new Thread(() -> {
        while (true) {
            try {
                System.out.println("sleeping");
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                System.out.println("interrupted");
                Thread.currentThread().interrupt();
            }
        }
    });
    t.start();
    t.interrupt();
}

End of the stacktrace:

Caused by: java.io.InterruptedIOException: Socket read interrupted
    at oracle.net.nt.TimeoutSocketChannel.initializeSocketChannel(TimeoutSocketChannel.java:278) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.connect(TimeoutSocketChannel.java:235) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:202) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:335) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more 
brettwooldridge commented 4 days ago

Thanks. Still trying to figure this out. java.io.InterruptedIOException extends java.io.IOException which extends java.lang.Exception. None of them extends java.lang.InterruptedException. The above exception should be caught and handled here. Still not a thread interrupt...

Is this running in Google GCD or something? Are you supplying an externally managed ScheduledExecutorService to the pool?

PaoloVoronini commented 3 days ago

1) we have observed this both with a component running in Google Cloud and on-premise 2) no, we do not supply ScheduledExecutorService 3) why have i not shared the full stacktrace? sorry! please find it below now

PS: yes, it does not make sense to me, neither. first, the exception is not InterruptedException, but i could explain it (just a possibility) with original exception being something like ClosedByInterruptException, being caught somewhere (in ojbc or java.nio) and suppressed without clearing the interrupted state of the thread - this would explain the interrupt status without InterruptedException. second, still someone should call "interrupt" on our thread in the first place. for this i see no explanation at all.

2024-10-24 09:34:52,386 INFO ? [UAT connection adder] throwing
java.sql.SQLRecoverableException: ORA-17002: I/O error: , Authentication lapse 0 ms.
https://docs.oracle.com/error-help/db/ora-17002/
    at oracle.jdbc.driver.T4CConnection.handleLogonIOException(T4CConnection.java:1658) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.handleLogonInterruptedIOException(T4CConnection.java:1519) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1132) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:1178) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:105) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:886) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:693) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:359) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:201) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:470) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:733) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:712) [HikariCP-5.0.1.jar:?]
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.io.IOException: , Authentication lapse 0 ms.
    at oracle.jdbc.driver.T4CConnection.handleLogonIOException(T4CConnection.java:1653) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 16 more
Caused by: java.io.InterruptedIOException:
    at oracle.net.nt.TcpNTAdapter.handleEstablishSocketException(TcpNTAdapter.java:401) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:346) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more
Caused by: java.io.InterruptedIOException: Socket read interrupted
    at oracle.net.nt.TimeoutSocketChannel.initializeSocketChannel(TimeoutSocketChannel.java:278) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.connect(TimeoutSocketChannel.java:235) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:202) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:335) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more 
brettwooldridge commented 2 days ago

java.sql.SQLRecoverableException: ORA-17002: I/O error: , Authentication lapse 0 ms.

This is a very specific and unusual error. Many threads point to an Oracle listener process falling over. But that is not conclusive.

I suggest looking at these Google results and this ChatGPT diagnosis.

It seems to me handling InterruptedException is a separate issue. My general reaction is that it is bad behavior to simply "clear and ignore" an InterruptedException. If anything, the entire pool should shutdown rather than continuing. I suspect addressing the underlying connection issue will also resolve the InterruptedException issue.

Also, you should ensure that you are running in HikariCP v6.2.1. Some threads point to improperly closed connections, and v6.0.0 contained a fix related to connection closing (double-close).