Open ankit-joinwal opened 2 years ago
The problem that is issue causes is SEVERE. The application is unhealthy for more than 3 minutes
Was going through https://github.com/brettwooldridge/HikariCP/blob/dev/CHANGES#L5.
Seems like this issue is supposed to be fixed in 5.0.0
I tried with version 5.0.1 but the issue is still happening. CC @brettwooldridge
Found the root cause of the issue.
When HikariCP requests a connection from Driver here, it is waiting for connection to be returned within the connectionTimeout
configured.
However, in case of MariaDB Driver used with AWS Aurora , the driver does not return (either success or failure) within connectionTimeout
due to failover recovery used in the driver https://mariadb.com/kb/en/failover-and-high-availability-with-mariadb-connector-j-for-2x-driver/
As per failover recovery, the driver will retry by default 120 times to establish a socket with old password (which is now rotated). As a result, in HikariCP, add connection requests keep on piling up in the queue.
Workaround I have used is reduced the retry count in failover to 10 using URL properties as below:
retriesAllDown=5&socketTimeout=3000
But I think there should a protection in HikariCP for such cases so it does not blindly trust the underlying driver to respect the connectionTimeout
configured.
CC @brettwooldridge
@ankit-joinwal Thank you for troubleshooting this. I would argue that HikariCP already has a guard against requests "pilling up in the queue", by limiting the queue size. What else can HikariCP do to guard against this?
Certainly, it would be advantageous to add to the FAQ or other documentation about the above retriesAllDown
property of the MariaDB driver.
In the end, regardless of how HikariCP handling this, how does the driver eventually recover from this condition?
@brettwooldridge Would it make sense to also use a Future and set a timeout on Future while submitting task to get connection here and here ?
In the end, regardless of how HikariCP handling this, how does the driver eventually recover from this condition?
The MariaDB driver throws exception after all retries on all db hosts exhaust and expect client to send connection request again. This retry behaviour of MariaDB driver is particularly to handle AWS Aurora primary failover.
Because HikariCP uses thread pool size of 1, the blocked thread is then freed up and the piled up queue requests make the connection attempt again. In our case, we are using aws-secretsmanager-jdbc to fetch user secrets , so next time HikariCP attempts to make a connection, the user credentials are refreshed and the connection requests are successful. Here is the driver code for ref - https://github.com/mariadb-corporation/mariadb-connector-j/blob/maintenance/2.x/src/main/java/org/mariadb/jdbc/internal/protocol/AuroraProtocol.java#L125-L302
But the recovery time in this whole process is ~3 minutes which is breaking our App SLA.
Hope this clarifies.
I have documented about the issue here as it may help others too - https://ankit-joinwal.medium.com/a-must-read-before-using-hikaricp-and-mariadb-driver-with-aws-aurora-f3c4f19cc73b
According to this Benchmark,
Issue- In my case, once the pool hit zero available connections, HikariCP only added connection after 3 minute 52 seconds.
Expected- Refilling should not take 3 minutes. In the Benchmark, refilling was done in 800μs
Steps to reproduce
HikariCP Config
Upon application startup at
2022-09-10 23:55:14.399
, pool was filled with 2 connections. Log output-At around
2022-09-10 23:56:14.410
, database password was rotated.at
2022-09-10 23:57:14.416
Pool did not have any available connections.At
2022-09-10 23:58:00.630
, an HTTP request was made to get some record from database, which resulted in errorAt
2022-09-11 00:00:16.370
after more than 3 minutes, finally connections were added asynchronously in the pool.