Open apanasevich opened 3 weeks ago
Thanks for reporting this. Would you like to contribute a fix? I can help you getting started.
I'll think fow to fix the issue. The thing I'm iterested in more than a fix is how to test a split-brain case. Looks like it needs to add @TestOnly
methods like simulateKill
which you've already made.
There's a limitation to the test suite as it is right now: it uses embedded instances of Vert.x or Hazelcast, making it difficult to test the split brain scenario (we'd need some sort of proxy between nodes that we can use to simulate network partitions).
Version
4.5.7
Context
I've run three nodes Vert.x cluster with HA feature enabled and quorum size set to 2.
Only one node (let's say
Node1
) deployed a single verticle (saySingletonVerticle
) programmatically with HA option set to true.If
Node1
leaves a cluster because of split-brain network partition than theSingletonVerticle
gets redeployed on one of other two nodes and undeployed onNode1
as expected.But when network fault is over and
Node1
joins a cluster back,HAManager
redeploysSingletonVerticle
onNode1
too. Which results in two instances ofSingletonVerticle
in a cluster instead of declared count inDeploymentOptions
.As I can see the problem is that
HAManager
operates a localQueue<Runnable> toDeployOnQuorum
of verticles waiting for deploy along withMap<String, String> clusterMap
of deployments and it doesn't check wheather verticles from the queue already deployed on other nodes.