eclipse-vertx / vert.x

Vert.x is a tool-kit for building reactive applications on the JVM
http://vertx.io
Other
14.19k stars 2.05k forks source link

HA verticle become deployed twice after split-brain partition #5235

Open apanasevich opened 3 weeks ago

apanasevich commented 3 weeks ago

Version

4.5.7

Context

I've run three nodes Vert.x cluster with HA feature enabled and quorum size set to 2.

        .setQuorumSize(2)
        .setHAEnabled(true);

Only one node (let's say Node1) deployed a single verticle (say SingletonVerticle) programmatically with HA option set to true.

            new DeploymentOptions()
                .setInstances(1)
                .setHa(true)

If Node1 leaves a cluster because of split-brain network partition than the SingletonVerticle gets redeployed on one of other two nodes and undeployed on Node1 as expected.

But when network fault is over and Node1 joins a cluster back, HAManager redeploys SingletonVerticle on Node1 too. Which results in two instances of SingletonVerticle in a cluster instead of declared count in DeploymentOptions.

As I can see the problem is that HAManager operates a local Queue<Runnable> toDeployOnQuorum of verticles waiting for deploy along with Map<String, String> clusterMap of deployments and it doesn't check wheather verticles from the queue already deployed on other nodes.

tsegismont commented 3 weeks ago

Thanks for reporting this. Would you like to contribute a fix? I can help you getting started.

apanasevich commented 2 weeks ago

I'll think fow to fix the issue. The thing I'm iterested in more than a fix is how to test a split-brain case. Looks like it needs to add @TestOnly methods like simulateKill which you've already made.

tsegismont commented 2 weeks ago

There's a limitation to the test suite as it is right now: it uses embedded instances of Vert.x or Hazelcast, making it difficult to test the split brain scenario (we'd need some sort of proxy between nodes that we can use to simulate network partitions).