Failover is delayed by waiting for a topology with more than one instance

danielbaniel commented 3 weeks ago

Describe the bug

The bug lies in the code here: https://github.com/aws/aws-advanced-jdbc-wrapper/blob/d9a563b9613d2c7075c6a1ff4d5b16af0f615324/wrapper/src/main/java/software/amazon/jdbc/plugin/failover/ClusterAwareWriterFailoverHandler.java#L408

I don't know the history of this check, but it's problematic in a few situations.

Take a two instance cluster with instance Foo and instance Bar. Lets say Foo is the writer. Foo crashes and Bar gets promoted to the writer. When Bar becomes available the driver will get stuck in this loop until Foo comes up as a reader (which may never happen in a bounded time depending on other problems) and brings the topology size to two. However, as soon as the driver is connected to Bar it has a writer connection and can complete the failover so all the additional downtime is unnecessary.

Expected Behavior

I expect the driver to return availability to clients looking for a writer as soon as a new writer is connected to regardless of the rest of the topology in terms of number of readers and their health.

What plugins are used? What other connection properties were set?

aurora-mysql

Current Behavior

When connecting to a two instance aurora mysql cluster and calling the failover-db-cluster api the failover of the driver won't complete until both instances restart (the reader gets promoted and restarts as a writer and the old writer restarts as a reader). It should complete as soon as the new writer is up.

Reproduction Steps

Create a two instance mysql cluster. Connect and send queries with the driver. Trigger failover with the api. Wait for the FailoverSuccessSQLException. Note that this comes later than the time when the new writer comes up. You can get this from the db cloudwatch logs for example.

Possible Solution

No response

Additional Information/Context

No response

The AWS Advanced JDBC Driver version used

latest

JDK version used

11

Operating System and version

osx

ucjonathan commented 2 weeks ago

@danielbaniel I don't use MySQL, but since you pointed out the exact like of problematic code, I believe that statement should be changed to:

if (topology.size() == 1 && getWriter(topology) == null) {

If we have a topology of 1 and there is no writer, then log that message otherwise connect to that writer.

danielbaniel commented 2 weeks ago

Hey @ucjonathan, this issue isn't mysql specific and applies to pg too. I filled in the issue incorrectly because I only specified the aurora-mysql plugin in this issue description but it affects both.

In either case however, your fix suggestion seems appropriate. As soon as the driver is connected to a writer it should go ahead and serve requests, no reason to wait for other instances.

I expect it will apply to MAZ clusters too not just Aurora. Whatever the context, as soon as you have a writer there's no need to wait for another instance to be up if you're looking for a writer endpoint.

aws / aws-advanced-jdbc-wrapper