Partitions in Zeebe are stuck at 100% backpressure forever

Environment (Required on creation)

Zeebe: 8.5.2 Total Partitions: 16 Nodes: 8 Each Zeebe node is 16GB and 4Core pod

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

We have noticed that some partitions would permanently start firing backpressure 100% even though load is limited.

We see all partitions as healthy but backpressure % is 100 for some of the partitions

On some metric observations I see this

Job activated per second is also 0

PVC / CPU / Memory is normal

Some stack traces from one of the brokers

 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1987854ms                                                                                             │                                        │
│ 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:02.847 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1997855ms                                                                                             │                                        │
│ 2024-07-06 12:41:02.848 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:41:05.348 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:05.349 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:06.344 [Broker-0] [zb-actors-2] [InterPartitionCommandReceiverActor-8] WARN                                                                                                                                                   │                                        │

Steps to reproduce (Required on creation)

Not really sure

Observed Behavior (Required on creation)

Partitions are stuck with backpressure 100% and system is not responding

Expected behavior (Required on creation)

Backpressure should be auto released and should start taking

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Attached logs when this happened. Note time is in UTC. Diagrams have IST, so UTC + 5.30. Sorry for this

logs-insights-results (4).csv

Breakdown

### Pull Requests

Dev2QA handover

[ ] Does this ticket need a QA test and the testing goals are not clear from the description? Add a Dev2QA handover comment

For more info:

We have elastic-search exporters enabled.
We have deployed this in AWS-EKS and all zeebe pods are in different nodes
This happened few days ago, I did a restart and it went back up. But now it happened again
We have around 10M process instances, each of them waiting for a message correlation. According to business use case those messages are released to continue ahead

Some exceptions when we have this

2024-07-06 12:54:44.573 [Broker-4] [zb-actors-0] [] WARN                                                                                                                                                                                       │
│       io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 6                                                                                                                                                         │
│ java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502  │
│     at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]                                                                                                                                               │
│     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]                                                                                                                                                │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.1.0-jre.jar:?]                                                                                                                              │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]                                                                                                                                         │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:48) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:29) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                          │
│     at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                        │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                                   │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                    │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                             │
│     at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                  │
│     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                                     │
│     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                                               │
│     at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]                                                                                                                                                                                   │
│ Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 is not a known cluster member   │
│     ... 22 more                                                                                                                                                                                                                                │
│

camunda / camunda-bpm-platform