EventBus breaks on bad network connection

ruslansennov commented 2 years ago

Questions

EventBus breaks on bad network connection. I have observed this when using hazelcast or infinispan.

What happens in the case of Infinispan:

Imagine three nodes "A", "B", "C" and the network in node "A" at some point failed.
After 20 seconds the ConnectionHolder on node "A" removes nodes "B" and "C" from "_vertx.subs". And the ConnectionHolder on nodes "B" and "C" deletes node "A".
When the network is turned on, all nodes are joins to each other, the "_vertx.subs" caches are merged and contain only "B" and "C".
Consumers on node "A" stop receiving messages from "B" and "C".

I see the HazelcastClusterManager.republishOwnSubs() method, but it is only called if members are removed from the cluster. And I don't see such things in infinispan.

Version

4.3. 4.2.

bkoripalli commented 1 year ago

I am also facing similar issue. Can we expect this in 4.3.5? My Understanding Reason for this issue may be , earlier we were using --cluster-port --cluster-host option as params. but based on migration guide 3.x to 4.x those options are removed, I programmatically trying to set in subclass of RunCommand but still eventBusOptions json uses port is 0

Here is the code snippet setting up clusterHost and clusterPort.


    protected Vertx startVertx() {
        EventBusOptions eventBusOptions = super.getEventBusOptions();
        String clusterHost = System.getProperty( "vertx.options.clusterHost" );
        String clusterPort = System.getProperty("vertx.options.clusterPort");
        eventBusOptions.setPort(Integer.parseInt(clusterPort))
                       .setClusterPublicPort(Integer.parseInt(clusterPort))
                       .setHost(clusterHost)
                       .setClusterPublicHost(clusterHost);
        vertx = super.startVertx();

Could you please help me to set up eventBusOptions port and host.

tsegismont commented 1 year ago

It's possible

Le jeu. 27 oct. 2022 à 09:05, bkoripalli @.***> a écrit :

I am also facing similar issue. Can we expect this in 4.3.5?

— Reply to this email directly, view it on GitHub https://github.com/eclipse-vertx/vert.x/issues/4394#issuecomment-1293084796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALOLNQDXMMARBVKBLQMRJLWFISSRANCNFSM5YGRH7GA . You are receiving this because you were assigned.Message ID: @.***>

bkoripalli commented 1 year ago

@tsegismont my apps are running in kubernetes, I am using hazelcast k8s plugin for DNS lookup. First time all pods up and running and event bus messages are working fine. when I delete the pod manually or rolling updates. event bus messages are failed that particular service(new pod) verticle to verticle communication through event bus. For example A verticle to B verticle some times this communication working, some times failed with timeout error. when failing I am getting ConnectionHolder WARN message to trying to connect delete pod IP. Could please help me on this.

c.h.t.TransactionManagerService - [10.200.15.245]:5701 [nimbus-v3] [4.2.4] Committing/rolling-back live transactions of [10.200.79.84]:5701, UUID: c43ee6d9-c3ba-4ed0-b18f-13d1e7caa078
jvm 1    | 2022-11-04 18:50:31.331+0000 [] [Thread-27] WARN  c.h.s.i.o.OperationService - [10.200.15.245]:5701 [nimbus-v3] [4.2.4] Not Member! target: [10.200.79.84]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation, service: hz:impl:multiMapService
jvm 1    | 2022-11-04 18:50:31.453+0000 [] [hz.keen_elgamal.priority-generic-operation.thread-0] WARN  c.h.i.p.InternalPartitionService - [10.200.15.245]:5701 [nimbus-v3] [4.2.4] Following unknown addresses are found in partition table sent from master[[10.200.79.83]:5701]. (Probably they have recently joined or left the cluster.) {
jvm 1    |      [10.200.79.84]:5701 - c43ee6d9-c3ba-4ed0-b18f-13d1e7caa078
jvm 1    | }
jvm 1    | 2022-11-04 18:50:32.292+0000 [] [hz.keen_elgamal.cached.thread-6] INFO  c.h.i.server.tcp.TcpServerConnector - [10.200.15.245]:5701 [nimbus-v3] [4.2.4] Could not connect to: /[10.200.79.84:5701](http://10.200.79.84:5701/). Reason: IOException[No route to host to address /[10.200.79.84:5701](http://10.200.79.84:5701/)]
jvm 1    | 2022-11-04 18:52:20.997+0000 [] [vert.x-eventloop-thread-1] WARN  i.v.c.e.i.clustered.ConnectionHolder - Connecting to server feef0b88-fb61-4e27-b101-98b450e404d0 failed io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /[10.200.79.84:5151](http://10.200.79.84:5151/) -> Caused by: java.net.NoRouteToHostException: No route to host ->        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ->       at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ->      at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337) ->  at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ->     at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ->
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ->     at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ->     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ->      at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ->     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ->        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ->    at java.base/java.lang.Thread.run(Unknown Source)

Vertx 4.3.2 Halecast 4.2.4 hazelcast-kuberenetes-2.2.3

romain-marie commented 1 year ago

Hello, I'm also affected. Cluster is broken (bus messages not received, Records losts, ...) after a node has encountered network failure or server has been under heavy load during few seconds (tested with vertx 4.3.7 and infinispan). After some new tests, problem seems not to occur with hazelcast for me.

eclipse-vertx / vert.x

EventBus breaks on bad network connection #4394

Questions

Version