camunda / zeebe

Distributed Workflow Engine for Microservices Orchestration
https://zeebe.io
3.05k stars 549 forks source link

`ScaleDownBrokersTest` is flaky #18190

Open korthout opened 2 weeks ago

korthout commented 2 weeks ago

Summary

Try to answer the following as best as possible

Failures

Outline known failure cases, e.g. a failed assertion and its stacktrace obtained from Jenkins

Example assertion failure
⚠️ Unfinished test runs
io.camunda.zeebe.it.clustering.dynamic.ScaleDownBrokersTest

Hypotheses

List any hypotheses if you have one; can be omitted

Broker went out of disk and test could not complete.

Logs

If possible, provide more context here, e.g. standard output logs, link to build, etc.

Logs
16:37:07.418 [Broker-0] [DiskSpaceUsageMonitorActor] [zb-actors-0] WARN  io.camunda.zeebe.broker.system - Out of disk space. Current available 0 bytes. Minimum needed 134217728 bytes.
16:37:07.418 [Broker-0] [CommandApiRequestHandler] [zb-actors-0] DEBUG io.camunda.zeebe.broker.transport - Broker is out of disk space. All client requests will be rejected
deepthidevaki commented 3 days ago

From the logs, it looks like the tests passed, but it is stuck when shutting down the broker. The last logs from Broker-0 is Shutdown API Messaging service. But there are 4 remaining steps in the shutdown sequence.

16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown Command API
16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown Broker Transport
16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown API Messaging Service

After this we only see logs from gateway that successfully gossips to Broker-0, indicating the broker-0 is not shutdown.

16:37:09.969 [] [] [atomix-cluster-heartbeat-sender] DEBUG io.atomix.cluster.protocol.swim.sync - gateway-0 - Start synchronizing membership with Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}
16:37:09.969 [] [] [atomix-cluster-heartbeat-sender] DEBUG io.atomix.cluster.protocol.swim.sync - gateway-0 - Finished synchronizing membership with Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}, received: '[Member{id=gateway-0, address=0.0.0.0:2002, properties={event-service-topics-subscribed=KIIDAGpvYnNBdmFpbGFibOU=}, version=8.6.0-SNAPSHOT, timestamp=1714495010935, state=ALIVE, incarnationNumber=1714495010935}, Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}]'

So it looks like it is stuck when closing NettyMessagingService.

I'm not sure why disk space monitoring is logging out of disk space. But that should not affect the shutdown process.

deepthidevaki commented 3 days ago

We are expected to see following log

Stopped messaging service bound to ..

But we don't see it. So it is safe to assume that it is stuck in closing NettyMessagingService.