Pool upon reaching empty state does not refill until 3 minutes

ankit-joinwal commented 2 years ago

According to this Benchmark,

In point of fact, as soon as the pool hit zero available connections, right around 800μs into the run, HikariCP began requesting connections to be added to the pool asynchronously. If the metrics had continued to be collected past the end of the spike -- out beyond 150ms -- you would observe that an additional connection is indeed added to the pool. But only one, because HikariCP employs elision logic; at that point HikariCP would also realize that there is actually no more pending demand, and the remaining connection acquisitions would be elided.

Issue- In my case, once the pool hit zero available connections, HikariCP only added connection after 3 minute 52 seconds.

Expected- Refilling should not take 3 minutes. In the Benchmark, refilling was done in 800μs

Steps to reproduce

HikariCP Config

HikariPool-1 - configuration:
allowPoolSuspension................................false
autoCommit................................true
catalog................................none
connectionInitSql................................none
connectionTestQuery................................none
connectionTimeout................................3000
dataSource................................none
dataSourceClassName................................none
dataSourceJNDI................................none
dataSourceProperties................................{password=<masked>}
driverClassName................................"com.amazonaws.secretsmanager.sql.AWSSecretsManagerMariaDBDriver"
exceptionOverrideClassName................................none
healthCheckProperties................................{}
healthCheckRegistry................................none
idleTimeout................................30000
initializationFailTimeout................................1
isolateInternalQueries................................false
jdbcUrl................................jdbc-secretsmanager:mariadb:aurora://{db_host}:3306/{db_name}?zeroDateTimeBehavior=CONVERT_TO_NULL&autoReconnect=true&useSSL=true&verifyServerCertificate=false&trustServerCertificate=true
keepaliveTime................................0
leakDetectionThreshold................................0
maxLifetime................................60000
maximumPoolSize................................5
metricRegistry................................none
metricsTrackerFactory................................none
minimumIdle................................2
password................................<masked>
poolName................................"HikariPool-1"
readOnly................................false
registerMbeans................................false
scheduledExecutor................................none
schema................................none
threadFactory................................internal
transactionIsolation................................default
username................................"{db_user}"
validationTimeout................................5000

Upon application startup at 2022-09-10 23:55:14.399, pool was filled with 2 connections. Log output-

2022-09-10 23:55:14.399 DEBUG 46108 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - After cleanup  stats (total=2, active=0, idle=2, waiting=0)
2022-09-10 23:55:14.399 DEBUG 46108 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Fill pool skipped, pool is at sufficient level.

At around 2022-09-10 23:56:14.410 , database password was rotated.

at 2022-09-10 23:57:14.416 Pool did not have any available connections.

2022-09-10 23:57:14.416 DEBUG 46108 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Before cleanup stats (total=0, active=0, idle=0, waiting=0)
2022-09-10 23:57:14.417 DEBUG 46108 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - After cleanup  stats (total=0, active=0, idle=0, waiting=0)
2022-09-10 23:57:14.417 DEBUG 46108 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Fill pool skipped, pool is at sufficient level.

At 2022-09-10 23:58:00.630 , an HTTP request was made to get some record from database, which resulted in error

2022-09-10 23:57:57.603 DEBUG 46108 --- [nio-8080-exec-1] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Add connection elided, waiting 1, queue 2
2022-09-10 23:58:00.607 DEBUG 46108 --- [nio-8080-exec-1] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Timeout failure stats (total=0, active=0, idle=0, waiting=0)
2022-09-10 23:58:00.630 ERROR 46108 --- [nio-8080-exec-1] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3004ms.] with root cause

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3004ms.
    at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696) ~[HikariCP-4.0.3.jar:na]
    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197) ~[HikariCP-4.0.3.jar:na]
    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162) ~[HikariCP-4.0.3.jar:na]
    at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-4.0.3.jar:na]
    at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:646) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:713) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:744) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:757) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:879) ~[spring-jdbc-5.3.9.jar:5.3.9]
    at com.luanvv.aws.rdssecrets.RestController.one(RestController.java:18) ~[classes/:na]

At 2022-09-11 00:00:16.370 after more than 3 minutes, finally connections were added asynchronously in the pool.

2022-09-11 00:00:16.370 DEBUG 46108 --- [onnection adder] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Added connection org.mariadb.jdbc.MariaDbConnection@49c26088
2022-09-11 00:00:19.750 DEBUG 46108 --- [onnection adder] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Added connection org.mariadb.jdbc.MariaDbConnection@563a433f
2022-09-11 00:00:19.750 DEBUG 46108 --- [onnection adder] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - After adding stats (total=2, active=0, idle=2, waiting=0)

ankit-joinwal commented 2 years ago

The problem that is issue causes is SEVERE. The application is unhealthy for more than 3 minutes

ankit-joinwal commented 2 years ago

Was going through https://github.com/brettwooldridge/HikariCP/blob/dev/CHANGES#L5.

Seems like this issue is supposed to be fixed in 5.0.0

I tried with version 5.0.1 but the issue is still happening. CC @brettwooldridge

ankit-joinwal commented 2 years ago

Found the root cause of the issue.

When HikariCP requests a connection from Driver here, it is waiting for connection to be returned within the connectionTimeout configured.

However, in case of MariaDB Driver used with AWS Aurora , the driver does not return (either success or failure) within connectionTimeout due to failover recovery used in the driver https://mariadb.com/kb/en/failover-and-high-availability-with-mariadb-connector-j-for-2x-driver/

As per failover recovery, the driver will retry by default 120 times to establish a socket with old password (which is now rotated). As a result, in HikariCP, add connection requests keep on piling up in the queue.

Workaround I have used is reduced the retry count in failover to 10 using URL properties as below: retriesAllDown=5&socketTimeout=3000

But I think there should a protection in HikariCP for such cases so it does not blindly trust the underlying driver to respect the connectionTimeout configured.

CC @brettwooldridge

brettwooldridge commented 2 years ago

@ankit-joinwal Thank you for troubleshooting this. I would argue that HikariCP already has a guard against requests "pilling up in the queue", by limiting the queue size. What else can HikariCP do to guard against this?

Certainly, it would be advantageous to add to the FAQ or other documentation about the above retriesAllDown property of the MariaDB driver.

In the end, regardless of how HikariCP handling this, how does the driver eventually recover from this condition?

ankit-joinwal commented 2 years ago

@brettwooldridge Would it make sense to also use a Future and set a timeout on Future while submitting task to get connection here and here ?

In the end, regardless of how HikariCP handling this, how does the driver eventually recover from this condition?

The MariaDB driver throws exception after all retries on all db hosts exhaust and expect client to send connection request again. This retry behaviour of MariaDB driver is particularly to handle AWS Aurora primary failover.

Because HikariCP uses thread pool size of 1, the blocked thread is then freed up and the piled up queue requests make the connection attempt again. In our case, we are using aws-secretsmanager-jdbc to fetch user secrets , so next time HikariCP attempts to make a connection, the user credentials are refreshed and the connection requests are successful. Here is the driver code for ref - https://github.com/mariadb-corporation/mariadb-connector-j/blob/maintenance/2.x/src/main/java/org/mariadb/jdbc/internal/protocol/AuroraProtocol.java#L125-L302

But the recovery time in this whole process is ~3 minutes which is breaking our App SLA.

Hope this clarifies.

ankit-joinwal commented 2 years ago

I have documented about the issue here as it may help others too - https://ankit-joinwal.medium.com/a-must-read-before-using-hikaricp-and-mariadb-driver-with-aws-aurora-f3c4f19cc73b

brettwooldridge / HikariCP

Pool upon reaching empty state does not refill until 3 minutes #1972

Steps to reproduce