Re-creating connections after maxLifetime creates large latency issues

roookeee commented 10 months ago

Context We are using HikariCP in a GraphQL Spring Boot application inside of Kubernetes. We are connecting to an AWS Aurora Serverless V2 instance in the same AZ.

Our connections use a fixed connection pool of 15 connection per pod. Currently we use 4 pods = 60 connections. Each service uses its own, dedicated AWS Aurora Serverless V2 database.

The included images show averaged graphs across all running pods.

Problem We noticed that some of our services incur a large latency increase every 30 minutes. After cross-referencing every 30 minute re-occuring event we found strong evidence that Hikaris re-creation of connections after maxLifetime expires introduces said latency spikes. The metric hikaricp_connections_creation_seconds shows changes exactly when the latency issues arise. This also explains why the 30 minute window is "moving": the 30 minute lifetime is relative to the application start, not relative to a fixed point in time(e.g. every 30 minutes exactly at 00:00, 00:30, 01:00 am etc.):

As shown the latency increases 10x for at least 2-3 collection intervals (Prometheus scrapes us every 15-30 seconds). Other, higher load services don't seem to experience the same issues but it's somewhat expected as the 99th and 50th percentile are statistically "smoother" when lots of load is present:

I would have expected that HikariCP re-recreates the connections in the background which should create negligible overhead, not in the area of 20-30ms for a period of 30 seconds. The low load services is executing trivial queries with low query planning overhead (the query executes within 1-3ms with an uncached query plan) which makes the overhead seem excessive even when considering that using a new connection has certain server-side overhead (no caches are present, no query plans are cached yet).

Is there any way to get rid of these latency spikes when a low-load service gets it connections rotated because of maxLifetime? It's messing with our SLAs and it's quite unfortunate that a low-load service is affected more so than higher load services.

The high load service is experiencing the same spikes in a low load environment, which increases the likelihood that this is related to connection re-creation in low load environments.

budzow commented 9 months ago

@roookeee very interesting write-up. Did you think about setting infinte lifetime (maxLifetime=0) to further confirm that the overhead comes from the provisioning of connections?

It is not ucommon to use fixed-size pool (minimumIdle=maximumPoolSize in case of Hikari) . I am thinking if there would be any inconveniences in such cases to use inifinte lifetime as it should indirectly address the overhead related to provisioning.

roookeee commented 9 months ago

Did you think about setting infinte lifetime (maxLifetime=0) to further confirm that the overhead comes from the provisioning of connections?

We set the maxLifetime to 6 hours, so issue appears less often but is still present. An infinite lifetime is sadly not possible because of firewall / provider rules etc.

It is not ucommon to use fixed-size pool (minimumIdle=maximumPoolSize in case of Hikari)

I can't quite follow here, we are using a standard minimumIdle=maximumPoolSize setup

EDIT: We can live with these spikes every 6h but I am still interested as to why this is happening. How is Hikari re-recreating connections? I would hope it creates a new connection first and destroys an expired connection after it has added a new connection.

bantify commented 1 month ago

Facing the same issue here. What is the solution? Please share.

svendiedrichsen commented 1 month ago

An improvement would be https://github.com/brettwooldridge/HikariCP/pull/2035

brettwooldridge / HikariCP

Re-creating connections after maxLifetime creates large latency issues #2099