`ScaleDownBrokersTest` is flaky

korthout commented 2 weeks ago

Summary

Try to answer the following as best as possible

How often does the test fail? At least once. It didn't finish
Does it block your work? Nah, but it did fail a Zeebe CI run on main
Do we suspect that it is a real failure? Not sure, broker went out of disk space

Failures

Outline known failure cases, e.g. a failed assertion and its stacktrace obtained from Jenkins

Example assertion failure

⚠️ Unfinished test runs
io.camunda.zeebe.it.clustering.dynamic.ScaleDownBrokersTest

Hypotheses

List any hypotheses if you have one; can be omitted

Broker went out of disk and test could not complete.

Logs

If possible, provide more context here, e.g. standard output logs, link to build, etc.

Logs

16:37:07.418 [Broker-0] [DiskSpaceUsageMonitorActor] [zb-actors-0] WARN  io.camunda.zeebe.broker.system - Out of disk space. Current available 0 bytes. Minimum needed 134217728 bytes.
16:37:07.418 [Broker-0] [CommandApiRequestHandler] [zb-actors-0] DEBUG io.camunda.zeebe.broker.transport - Broker is out of disk space. All client requests will be rejected

deepthidevaki commented 3 days ago

From the logs, it looks like the tests passed, but it is stuck when shutting down the broker. The last logs from Broker-0 is Shutdown API Messaging service. But there are 4 remaining steps in the shutdown sequence.

16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown Command API
16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown Broker Transport
16:37:09.693 [Broker-0] [Startup] [zb-actors-0] INFO  io.camunda.zeebe.broker.system - Shutdown API Messaging Service

After this we only see logs from gateway that successfully gossips to Broker-0, indicating the broker-0 is not shutdown.

16:37:09.969 [] [] [atomix-cluster-heartbeat-sender] DEBUG io.atomix.cluster.protocol.swim.sync - gateway-0 - Start synchronizing membership with Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}
16:37:09.969 [] [] [atomix-cluster-heartbeat-sender] DEBUG io.atomix.cluster.protocol.swim.sync - gateway-0 - Finished synchronizing membership with Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}, received: '[Member{id=gateway-0, address=0.0.0.0:2002, properties={event-service-topics-subscribed=KIIDAGpvYnNBdmFpbGFibOU=}, version=8.6.0-SNAPSHOT, timestamp=1714495010935, state=ALIVE, incarnationNumber=1714495010935}, Member{id=0, address=0.0.0.0:1987, properties={brokerInfo=EADJAAAABAAAAAAAAwAAAAMAAAABAAAAAAABCgAAAGNvbW1hbmRBcGkMAAAAMC4wLjAuMDoxOTg2BQAADAAADgAAADguNi4wLVNOQVBTSE9UBQADAQAAAAACAAAAAAMAAAAA}, version=8.6.0-SNAPSHOT, timestamp=1714495011104, state=ALIVE, incarnationNumber=1714495011111}]'

So it looks like it is stuck when closing NettyMessagingService.

I'm not sure why disk space monitoring is logging out of disk space. But that should not affect the shutdown process.

deepthidevaki commented 3 days ago

We are expected to see following log

Stopped messaging service bound to ..

But we don't see it. So it is safe to assume that it is stuck in closing NettyMessagingService.

camunda / zeebe

`ScaleDownBrokersTest` is flaky #18190