eclipse-vertx / vert.x

Vert.x is a tool-kit for building reactive applications on the JVM
http://vertx.io
Other
14.32k stars 2.08k forks source link

Previously clustered instances incorrectly cached #5202

Closed aksabg closed 5 months ago

aksabg commented 6 months ago

Questions

It seems that Vert.x caches non-existing addresses (previously shutdown) when using clustered event bus. Some eventbus send calls cause timeout exception because the listener on the address does not exist.

Version

Vert.x 4.5.5 hazelcast-kubernetes 3.2.3

Context

We are using clustered Vert.x with hazelcast kubernetes discovery. We have multiple kubernetes pods and each contains one Verticle.

Periodically we do underlying virtual machine updates in a way that we start a new virtual machine, shut down one pod (multiple replicas) on the old machine, start it on the new, repeat the process until all pods are migrated to the new machine and then delete the old machine.

It appears that somewhere in this process a split brain occured. It appears that Hazelcast was able to recover, but it seems Vert.x wasn't. It looks like Vert.x is trying to send event bus messages to addresses that no longer exist in the cluster. The only way we are able to solve this problem is to shut down all Verticles and then start them again.

Potentially relevant log statements


{
  "time": "2024-05-12T10:13:04.811512654Z",
  "level": "ERROR",
  "class": "com.myorg.MyClass",
  "message": "Received unknown error code. Error code received is -1, message received is Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.d9179ccb-e25f-4e4e-9189-ff04368e4abb, repliedAddress: status/check/NotifyService."
}

{
  "time": "2024-05-12T10:06:34.812700633Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server d9f36bb4-029f-44d6-8f7c-304be8285b22 failed",
  "stacktrace": "io.vertx.core.impl.NoStackTraceThrowable: Not a member of the cluster\n"
}

{
  "time": "2024-05-12T10:14:34.810948660Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server 43e35345-2b5a-4b89-bb59-f5bf67f01780 failed",
  "stacktrace": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.21.128.0:36519\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base/sun.nio.ch.Net.pollConnect(Native Method)\n\tat java.base/sun.nio.ch.Net.pollConnectNow(Unknown Source)\n\tat java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"
}

We cannot reproduce the issue consistently, it only happens sometimes.

Any ideas on what might be going on?

tsegismont commented 5 months ago

Hi @aksabg

This is the GH repository of the Vert.x core library, please send future reports to vertx-hazelcast

In the event of a split-brain, it is possible that subscriptions become inconsistent.

Please check out this recommendations: https://vertx.io/docs/vertx-hazelcast/java/#_recommendations

In summary, make sure you shutdown nodes gracefully and one after the other. And also add new nodes gradually.