Mass extinction of connections impacting AWS Aurora reader endpoint load balancing

gedl commented 6 years ago

Environment

HikariCP version: 3.2.0
JDK version     : 1.8.0_162
Database        : Aurora MySQL 5.6.10a
Driver version  : mysql-connector-java 8.0.12

Extra info: connection pool size: 10, max idle unset.

Having set maxLifetime to 30m (the default 1800000 millis) I would expect the behaviour described in #480 to cause connections to be recycled "out of phase" to avoid mass extinction of the pool. What I am observing check here is that AWS Aurora reader endpoint is dispatching lots of connections to the same read replica (note that in the X-axis the interval of Y-axis upward and downward jumps are "exactly" 30 minutes). This graph represents >40th generation of connections. The result is that aurora, arguably because it somehow caches the number of connections on each replica, assigns a whole pool to one replica, eventually (in our case we have 3) leaving one replica with nearly all the connections of all application servers, and the other 2 replicas almost IDLE for a generation's lifetime.

I would expect the changes in #480 to gradually scatter the recycling of the pool, up to a maximum of 18s variance after some generations. Admittedly the x-axis of the graph is not granular enough to tell exactly how apart the connections are reaching the aurora cluster, but it doesn't seem that they are spread in any material way.

I have 3 questions: 1 - has anyone observed this phenomena with this, or similar setup 2 - what is my best logging option on the hikari side to observe the application side lifecycle of each generation of connections? 3 - is there anyway to directly set the amount of variance desired to avoid mass extinction?

Thank you very much.

goughy000 commented 6 years ago

Any reason you are using the mysql connector rather than mariadb? AWS suggests usage of the mariadb connector in their documentation, and the driver also contains extra functionality to handle Aurora clusters more effectively compared with the MySQL connector.

If you make the switch, ensure you activate the functionality in the jdbc url and use the cluster endpoint address: jdbc:mysql:aurora://cluster.cluster-xxxx.eu-west-1.rds.amazonaws.com/db

Important: you must use the *.amazonaws.com address, if you wrap it in a custom CNAME then it won't work effectively in the mariadb driver. See here

More info: https://mariadb.com/kb/en/library/failover-and-high-availability-with-mariadb-connector-j/#specifics-for-amazon-aurora

brettwooldridge commented 6 years ago

@gedl have you tried the above suggestion? Any update on this issue?

gedl commented 6 years ago

Hey, sorry for the delay.

Inspired by the MariaDb driver we ended up implementing a generic jdbc driver that works on top of any aurora cluster, fully supporting mysql and psql, and presumably with future aurora flavours of other jdbc-accessible RDBSs.

We've also made it open source: https://github.com/DiceTechnology/dice-fairlink

It works via AWS Aurora SDK and therefore does not rely on amazonaws.com sub-domains.

Should I close this case, or do you want to pursue my point nr 3 ?

brettwooldridge commented 6 years ago

First of all, awesome. Just awesome. 👏

I love to see open source contributions like this.

Let’s leave this open for the time being. If retirements from the pool are not well distributed enough, I think we need a better algorithm. A deterministic one would also be better than our relying on a pseudo random distribution to avoid extinction events.

Again, really impressed and inspired by your team’s initiative in taking the bull by the horns re: Aurora.

gedl commented 5 years ago

Only noticed your response now.

We were surprised by the lack of solutions for what seems to be a common problem with the usage of such a popular combination (HikariCP + RDS/Aurora) and thought this could be useful.

We've taken so much from opensource and would hate to see people wasting time with these "details" instead of making their products great, so open sourcing it was the only acceptable thing to do.

It has been working well in production since, even though I'd like to see it spreading the connections even better. There are a couple of edge cases related to the arithmetics (connection pool size not divisible by number of replicas, etc), but it's much better than before.

gedl commented 5 years ago

Because this thread is still open, I think it's relevant to note that dice-fairlink versions 1.x.x had a scalability problem, where it would be rate limited by the RDS API should many client applications were deployed in the same AWS account (they would all hit the RDS API and roughly the same time).

Versions 2.x.x have worked around this undocumented limits imposed by AWS.

brettwooldridge / HikariCP

Mass extinction of connections impacting AWS Aurora reader endpoint load balancing #1247

Environment