Closed aksabg closed 5 months ago
Hi @aksabg
This is the GH repository of the Vert.x core library, please send future reports to vertx-hazelcast
In the event of a split-brain, it is possible that subscriptions become inconsistent.
Please check out this recommendations: https://vertx.io/docs/vertx-hazelcast/java/#_recommendations
In summary, make sure you shutdown nodes gracefully and one after the other. And also add new nodes gradually.
Questions
It seems that Vert.x caches non-existing addresses (previously shutdown) when using clustered event bus. Some eventbus send calls cause timeout exception because the listener on the address does not exist.
Version
Vert.x 4.5.5 hazelcast-kubernetes 3.2.3
Context
We are using clustered Vert.x with hazelcast kubernetes discovery. We have multiple kubernetes pods and each contains one Verticle.
Periodically we do underlying virtual machine updates in a way that we start a new virtual machine, shut down one pod (multiple replicas) on the old machine, start it on the new, repeat the process until all pods are migrated to the new machine and then delete the old machine.
It appears that somewhere in this process a split brain occured. It appears that Hazelcast was able to recover, but it seems Vert.x wasn't. It looks like Vert.x is trying to send event bus messages to addresses that no longer exist in the cluster. The only way we are able to solve this problem is to shut down all Verticles and then start them again.
Potentially relevant log statements
We cannot reproduce the issue consistently, it only happens sometimes.
Any ideas on what might be going on?