Open roookeee opened 10 months ago
@roookeee very interesting write-up. Did you think about setting infinte lifetime (maxLifetime=0
) to further confirm that the overhead comes from the provisioning of connections?
It is not ucommon to use fixed-size pool (minimumIdle=maximumPoolSize
in case of Hikari) . I am thinking if there would be any inconveniences in such cases to use inifinte lifetime as it should indirectly address the overhead related to provisioning.
Did you think about setting infinte lifetime (maxLifetime=0) to further confirm that the overhead comes from the provisioning of connections?
We set the maxLifetime
to 6 hours, so issue appears less often but is still present. An infinite lifetime is sadly not possible because of firewall / provider rules etc.
It is not ucommon to use fixed-size pool (minimumIdle=maximumPoolSize in case of Hikari)
I can't quite follow here, we are using a standard minimumIdle=maximumPoolSize
setup
EDIT: We can live with these spikes every 6h but I am still interested as to why this is happening. How is Hikari re-recreating connections? I would hope it creates a new connection first and destroys an expired connection after it has added a new connection.
Facing the same issue here. What is the solution? Please share.
An improvement would be https://github.com/brettwooldridge/HikariCP/pull/2035
Context We are using HikariCP in a GraphQL Spring Boot application inside of Kubernetes. We are connecting to an AWS Aurora Serverless V2 instance in the same AZ.
Our connections use a fixed connection pool of 15 connection per pod. Currently we use 4 pods = 60 connections. Each service uses its own, dedicated AWS Aurora Serverless V2 database.
The included images show averaged graphs across all running pods.
Problem We noticed that some of our services incur a large latency increase every 30 minutes. After cross-referencing every 30 minute re-occuring event we found strong evidence that Hikaris re-creation of connections after
![image](https://github.com/brettwooldridge/HikariCP/assets/1199562/5827903d-3979-4610-8fb1-fe3d0bb02148)
maxLifetime
expires introduces said latency spikes. The metrichikaricp_connections_creation_seconds
shows changes exactly when the latency issues arise. This also explains why the 30 minute window is "moving": the 30 minute lifetime is relative to the application start, not relative to a fixed point in time(e.g. every 30 minutes exactly at 00:00, 00:30, 01:00 am etc.):As shown the latency increases 10x for at least 2-3 collection intervals (Prometheus scrapes us every 15-30 seconds). Other, higher load services don't seem to experience the same issues but it's somewhat expected as the 99th and 50th percentile are statistically "smoother" when lots of load is present:
![image](https://github.com/brettwooldridge/HikariCP/assets/1199562/67123bbc-2506-409c-a211-20d51297e722)
I would have expected that HikariCP re-recreates the connections in the background which should create negligible overhead, not in the area of 20-30ms for a period of 30 seconds. The low load services is executing trivial queries with low query planning overhead (the query executes within 1-3ms with an uncached query plan) which makes the overhead seem excessive even when considering that using a new connection has certain server-side overhead (no caches are present, no query plans are cached yet).
Is there any way to get rid of these latency spikes when a low-load service gets it connections rotated because of
maxLifetime
? It's messing with our SLAs and it's quite unfortunate that a low-load service is affected more so than higher load services.The high load service is experiencing the same spikes in a low load environment, which increases the likelihood that this is related to connection re-creation in low load environments.