camunda / camunda

Process Orchestration Framework
https://camunda.com/platform/
3.33k stars 605 forks source link

Zeebe fails on start with Failed to handle message, host is not a known cluster member #17180

Open amitonlentra opened 7 months ago

amitonlentra commented 7 months ago

Describe the bug

We are upgrading zeebe from 8.2.12 to the 8.4.0 (even 8.4.5 failed) but zeebe brokers errors out on start up with an error java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member

The helm chart version for 8.4.0 was 9.0.2. We even tried starting up 8.4.5 and got the same error. We also tried with the latest (8.5.0-alpha2) locally on my laptop with the below helm command and saw the same issue in the logs for zeebe broker 0

helm install dev camunda/camunda-platform --set identity.enabled=false --set optimize.enabled=false --set tasklist.enabled=false --set operate.enabled=false --set connectors.enabled=false --set zeebe.affinity.podAntiAffinity=null --set zeebe-gateway.affinity.podAntiAffinity=null --set global.identity.auth.enabled=false

The installation fails on AWS setup with ec2 instances and even locally on a laptop.

To Reproduce

Install zeebe with the above command or with 8.4.0 (command below) and see logs for broker 0. In our setup we are trying to install 9 brokers with the below values.yaml file.

 global:
  identity:
    auth:
      enabled: false     
  image:
    tag: 8.4.0

identity:
  enabled: false

optimize:
  enabled: false

tasklist:
  enabled: false

operate:
  enabled: false

elasticsearch:
  enabled: true
  image:
    repository: bitnami/elasticsearch
    tag: 8.3.2
  master:
    replicaCount: 1
    resources:
      requests:
        cpu: 1
        memory: 2Gi
      limits:
        cpu: 1
        memory: 2Gi

connectors:
  enabled: false

zeebe:
  clusterSize: 3
  partitionCount: 3
  replicationFactor: 1
  cpuThreadCount: 4
  ioThreadCount: 4
  logLevel: info
  retention:
    enabled: true
    minimumAge: 10d
  affinity:
    podAntiAffinity: null
  env:
    - name: ZEEBE_BROKER_EXECUTION_METRICS_EXPORTER_ENABLED
      value: "true" 
  pvcSize: 128Gi
  resources:
    requests:
      cpu: 1
      memory: 512Mi
    limits:
      cpu: 1
      memory: 512Mi

zeebe-gateway:
  replicas: 2
  affinity:
    podAntiAffinity: null
  env:
    - name: ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS
      value: "4"
    - name: ZEEBE_GATEWAY_MONITORING_ENABLED
      value: "true"
  resources:
    requests:
      cpu: 1
      memory: 512Mi
    limits:
      cpu: 1
      memory: 512Mi

Log/Stacktrace

The below stacktrace indicates that broker 0 failed to connect with broker 8.

Full Stacktrace

``` 2024-03-28 08:36:56.153 [] [atomix-cluster-heartbeat-sender] [] INFO io.atomix.cluster.protocol.swim - 0 - Member added Member{id=2, address=dev-zeebe-2.dev-zeebe.default.svc:26502, properties={}} 2024-03-28 08:36:56.184 [Broker-0] [zb-actors-1] [] WARN io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 2 java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.0.0-jre.jar:?] at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0] at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?] at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:49) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0] at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:30) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0] at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.4.0.jar:8.4.0] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.104.Final.jar:4.1.104.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) ~[netty-transport-classes-epoll-4.1.104.Final.jar:4.1.104.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.104.Final.jar:4.1.104.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.104.Final.jar:4.1.104.Final] at java.base/java.lang.Thread.run(Unknown Source) ~[?:?] Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host dev-zeebe-0.dev-zeebe.default.svc:26502 is not a known cluster member ... 22 more ```

Environment:

amitonlentra commented 7 months ago

This has been a blocking issue for us since last week. On the other side it sounds like a configuration issue too since start up is failing. @npepinpe @Zelldon @deepthidevaki @oleschoenburg - could one of you please spare few mins to provide insights on whether this is a real issue or something basic is missing in the helm file?

Thanks.

npepinpe commented 7 months ago

Did you update directly from 8.2.x to 8.4.x, without first updating to 8.3.x? As stated in the docs, you can skip patch versions, but you cannot skip minor versions during an update, as there may be interim migrations required.

amitonlentra commented 7 months ago

@npepinpe - thanks for replying. This error even comes locally. On the local setup, we directly installed 8.4.0/8.4.5 through helm. While troubleshooting, I realised that this is logged as a warning. Is it something to worry about or a temporary issue? Single workflow execution works fine. We haven't got a chance to do a load test yet.

Ruivalim commented 6 months ago

Is there any solution for this?

amitonlentra commented 6 months ago

@Ruivalim - what kind of a solution are you looking for? Are you seeing any side-effects of the errors logged as warning?

linonetwo commented 4 months ago

I'm doing a fresh install of v8.5.4

I also get same error on a clean install. I also disable the identity/console/optimize/tasklist to make a minimal install.

it says io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host workflow-zeebe-0.workflow-zeebe.camunda-workflow.svc:26502 is not a known cluster member

while the service exist on workflow-zeebe.camunda-workflow:26502 TCP and pod workflow-zeebe-0 also exists

截屏2024-07-02 16 40 45 截屏2024-07-02 16 43 21

while I can

curl workflow-zeebe-0.workflow-zeebe.camunda-workflow.svc:26502
curl: (52) Empty reply from server

Sometimes got the log On zeebee

 Partition-1 failed, marking it as unhealthy: Partition-1{status=UNHEALTHY, issue=HealthIssue[message=null, throwable=null, cause=ZeebePartition-1{status=UNHEALTHY, issue=HealthIssue[message=Services not installed, throwable=null, cause=null]}]}

On zeebee-gateway

2024-07-02 11:03:12.521 [] [netty-messaging-event-epoll-client-0] [] WARN 
      io.atomix.cluster.messaging.impl.NettyMessagingService - Unexpected error while handling message stream-recreate from workflow-zeebe-0.workflow-zeebe.camunda-workflow.svc:26502
io.atomix.cluster.messaging.MessagingException$NoSuchMemberException: Failed to handle message, host workflow-zeebe-0.workflow-zeebe.camunda-workflow.svc:26502 is not a known cluster member
linonetwo commented 4 months ago

Seem to work after add these to zeebe-gateway's service, where they doesn't exist.

    - name: internal
      port: 26502
      protocol: TCP
      targetPort: 26502
    - name: command
      port: 26501
      protocol: TCP
      targetPort: 26501
2024-07-02 09:43:46.298 [Broker-0] [zb-actors-1] [HealthCheckService] INFO 
      io.camunda.zeebe.broker.system - Partition-1 recovered, marking it as healthy

but operate still have error log

2024-07-02 10:38:24.412  WARN 7 --- [-worker-ELG-1-2] i.c.z.c.i.ZeebeCallCredentials           : The request's security level does not guarantee that the credentials will be confidential.
Error occurred when requesting partition ids from Zeebe client: null

it worked after disable auth completely:

global:
  identity:
    auth:
      enabled: false