camunda / camunda-bpm-platform

Flexible framework for workflow and decision automation with BPMN and DMN. Integration with Quarkus, Spring, Spring Boot, CDI.
https://camunda.com/
Apache License 2.0
4.12k stars 1.56k forks source link

Partitions in Zeebe are stuck at 100% backpressure forever #4482

Closed ashprojects closed 4 months ago

ashprojects commented 4 months ago

Environment (Required on creation)

Zeebe: 8.5.2 Total Partitions: 16 Nodes: 8 Each Zeebe node is 16GB and 4Core pod

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

We have noticed that some partitions would permanently start firing backpressure 100% even though load is limited.

image

We see all partitions as healthy but backpressure % is 100 for some of the partitions

image

On some metric observations I see this

image

Job activated per second is also 0

image

PVC / CPU / Memory is normal

image image

Some stack traces from one of the brokers

 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1987854ms                                                                                             │                                        │
│ 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:02.847 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1997855ms                                                                                             │                                        │
│ 2024-07-06 12:41:02.848 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:41:05.348 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:05.349 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:06.344 [Broker-0] [zb-actors-2] [InterPartitionCommandReceiverActor-8] WARN                                                                                                                                                   │                                        │

Steps to reproduce (Required on creation)

Not really sure

Observed Behavior (Required on creation)

Partitions are stuck with backpressure 100% and system is not responding

Expected behavior (Required on creation)

Backpressure should be auto released and should start taking

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Attached logs when this happened. Note time is in UTC. Diagrams have IST, so UTC + 5.30. Sorry for this

logs-insights-results (4).csv

Breakdown

### Pull Requests

Dev2QA handover

ashprojects commented 4 months ago

For more info:

  1. We have elastic-search exporters enabled.
  2. We have deployed this in AWS-EKS and all zeebe pods are in different nodes
  3. This happened few days ago, I did a restart and it went back up. But now it happened again
  4. We have around 10M process instances, each of them waiting for a message correlation. According to business use case those messages are released to continue ahead
  5. Some exceptions when we have this
    2024-07-06 12:54:44.573 [Broker-4] [zb-actors-0] [] WARN                                                                                                                                                                                       │
    │       io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 6                                                                                                                                                         │
    │ java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502  │
    │     at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]                                                                                                                                                 │
    │     at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]                                                                                                                                               │
    │     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]                                                                                                                                                │
    │     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
    │     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
    │     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
    │     at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.1.0-jre.jar:?]                                                                                                                              │
    │     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
    │     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]                                                                                                                                                 │
    │     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]                                                                                                                                         │
    │     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
    │     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
    │     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:48) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
    │     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:29) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
    │     at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                          │
    │     at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                        │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
    │     at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                               │
    │     at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                                   │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
    │     at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                    │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
    │     at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                             │
    │     at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                  │
    │     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
    │     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                                     │
    │     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
    │     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                                               │
    │     at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]                                                                                                                                                                                   │
    │ Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 is not a known cluster member   │
    │     ... 22 more                                                                                                                                                                                                                                │
    │                                                                                                                            
ashprojects commented 4 months ago

[Update] I had to manually analyse topology and restart throttling leader partition brokers one by one. This is happening twice a day, that too in production. Something is definitely not right here.

I noticed a pattern, whenever we run a backup, we see this state. 100% backpressure noticed when backup is scheduled - which fails by itself eventually. Could it be when backup is taken and we have load on the system, where electionTimeout of 2500ms is reached, we would get this state

yanavasileva commented 4 months ago

The ticket was incorrectly opened for Camunda 7. The user already reported the ticket for Camunda 8: https://github.com/camunda/camunda/issues/20126