Closed ashprojects closed 4 months ago
For more info:
2024-07-06 12:54:44.573 [Broker-4] [zb-actors-0] [] WARN │
│ io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 6 │
│ java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 │
│ at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] │
│ at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2] │
│ at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.1.0-jre.jar:?] │
│ at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2] │
│ at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] │
│ at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] │
│ at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:48) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2] │
│ at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:29) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2] │
│ at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2] │
│ at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final] │
│ at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final] │
│ at java.base/java.lang.Thread.run(Unknown Source) ~[?:?] │
│ Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 is not a known cluster member │
│ ... 22 more │
│
[Update] I had to manually analyse topology and restart throttling leader partition brokers one by one. This is happening twice a day, that too in production. Something is definitely not right here.
I noticed a pattern, whenever we run a backup, we see this state. 100% backpressure noticed when backup is scheduled - which fails by itself eventually. Could it be when backup is taken and we have load on the system, where electionTimeout of 2500ms is reached, we would get this state
The ticket was incorrectly opened for Camunda 7. The user already reported the ticket for Camunda 8: https://github.com/camunda/camunda/issues/20126
Environment (Required on creation)
Zeebe: 8.5.2 Total Partitions: 16 Nodes: 8 Each Zeebe node is 16GB and 4Core pod
Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)
We have noticed that some partitions would permanently start firing backpressure 100% even though load is limited.
We see all partitions as healthy but backpressure % is 100 for some of the partitions
On some metric observations I see this
Job activated per second is also 0
PVC / CPU / Memory is normal
Some stack traces from one of the brokers
Steps to reproduce (Required on creation)
Not really sure
Observed Behavior (Required on creation)
Partitions are stuck with backpressure 100% and system is not responding
Expected behavior (Required on creation)
Backpressure should be auto released and should start taking
Root Cause (Required on prioritization)
Solution Ideas
Hints
Links
Attached logs when this happened. Note time is in UTC. Diagrams have IST, so UTC + 5.30. Sorry for this
logs-insights-results (4).csv
Breakdown
Dev2QA handover