bifromqio / bifromq

A Multi-Tenancy MQTT broker adopting Serverless architecture
https://bifromq.io
Apache License 2.0
619 stars 61 forks source link

mqtt 集群压测导致内存溢出然后宕机 #90

Closed masterOcean closed 3 months ago

masterOcean commented 3 months ago

系统内存溢出导致所有节点都宕机 集群3个节点(32C,64G)(20,54,124 三台),23w客户端,每隔5S发 40K body 压测,经过1d 15h 后节点 124 报错然后宕机,error.log

2024-06-23 05:55:49.530 ERROR   [basekv-client-executor-3] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@205972377
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted

随之出现堆内存溢出,gc 日志

[2024-06-23T05:55:45.250+0800][5888][gc] GC(14028) Garbage Collection (Allocation Rate) 12046M(42%)->2672M(9%)
[2024-06-23T05:55:58.390+0800][5888][gc] GC(14029) Garbage Collection (Allocation Rate) 12930M(45%)->1808M(6%)
[2024-06-23T05:56:19.354+0800][5888][gc] GC(14030) Garbage Collection (Proactive) 15498M(54%)->16020M(56%)
[2024-06-23T05:56:25.267+0800][5888][gc] GC(14031) Garbage Collection (Allocation Rate) 19278M(67%)->20070M(70%)
[2024-06-23T05:56:28.742+0800][5888][gc] GC(14032) Garbage Collection (Allocation Rate) 21576M(75%)->22608M(79%)
[2024-06-23T05:56:29.783+0800][5888][gc] GC(14033) Garbage Collection (Allocation Rate) 22670M(79%)->24064M(84%)
[2024-06-23T05:56:30.439+0800][5888][gc] GC(14034) Garbage Collection (Allocation Rate) 24222M(84%)->8768M(31%)
[2024-06-23T05:56:37.055+0800][5888][gc] GC(14035) Garbage Collection (Allocation Rate) 12050M(42%)->6164M(21%)
[2024-06-23T05:56:46.477+0800][5888][gc] GC(14036) Garbage Collection (Allocation Rate) 11210M(39%)->12542M(44%)
[2024-06-23T05:56:51.639+0800][5888][gc] GC(14037) Garbage Collection (Allocation Rate) 14532M(51%)->16076M(56%)
[2024-06-23T05:56:56.412+0800][5888][gc] GC(14038) Garbage Collection (Allocation Rate) 17584M(61%)->19366M(68%)
[2024-06-23T05:56:58.342+0800][5888][gc] GC(14039) Garbage Collection (Allocation Rate) 19404M(68%)->20662M(72%)
[2024-06-23T05:56:59.909+0800][5888][gc] GC(14040) Garbage Collection (Allocation Rate) 20690M(72%)->21826M(76%)
[2024-06-23T05:57:01.286+0800][5888][gc] GC(14041) Garbage Collection (Allocation Rate) 21870M(76%)->22812M(80%)

之后其他两台节点频繁打印 can not connect 24后也宕机。warn.log 如下

2024-06-23 05:57:08.313  WARN   [cluster-tcp-transport] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:247] Failed to connect to address: /10.89.144.124:8899
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.89.144.124:8899
    at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:351)
    at com.baidu.bifromq.basecluster.transport.TCPTransport.lambda$getChannel$8(TCPTransport.java:321)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)

BifroMQ

To Reproduce 压测客户端,23w client, 每隔5S 发送 body 40k qos =0 的消息, PUB Client :

Expected behavior

Logs 124 最先出问题 error.log 如下:全部都是如此

2024-06-23 05:55:49.530 ERROR   [basekv-client-executor-3] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@205972377
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted
2024-06-23 05:55:49.530 ERROR   [basekv-client-executor-2] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@663109504
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted
2024-06-23 05:55:49.530 ERROR   [basekv-client-executor-1] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@537959386
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted

gc 日志:

[2024-06-23T05:55:20.815+0800][5888][gc] GC(14025) Garbage Collection (Allocation Rate) 11356M(40%)->3736M(13%)
[2024-06-23T05:55:26.425+0800][5888][gc] GC(14026) Garbage Collection (Allocation Rate) 7222M(25%)->3638M(13%)
[2024-06-23T05:55:36.003+0800][5888][gc] GC(14027) Garbage Collection (Allocation Rate) 11362M(40%)->3690M(13%)
[2024-06-23T05:55:45.250+0800][5888][gc] GC(14028) Garbage Collection (Allocation Rate) 12046M(42%)->2672M(9%)
[2024-06-23T05:55:58.390+0800][5888][gc] GC(14029) Garbage Collection (Allocation Rate) 12930M(45%)->1808M(6%)
[2024-06-23T05:56:19.354+0800][5888][gc] GC(14030) Garbage Collection (Proactive) 15498M(54%)->16020M(56%)
[2024-06-23T05:56:25.267+0800][5888][gc] GC(14031) Garbage Collection (Allocation Rate) 19278M(67%)->20070M(70%)
[2024-06-23T05:56:28.742+0800][5888][gc] GC(14032) Garbage Collection (Allocation Rate) 21576M(75%)->22608M(79%)
[2024-06-23T05:56:29.783+0800][5888][gc] GC(14033) Garbage Collection (Allocation Rate) 22670M(79%)->24064M(84%)
[2024-06-23T05:56:30.439+0800][5888][gc] GC(14034) Garbage Collection (Allocation Rate) 24222M(84%)->8768M(31%)
[2024-06-23T05:56:37.055+0800][5888][gc] GC(14035) Garbage Collection (Allocation Rate) 12050M(42%)->6164M(21%)
[2024-06-23T05:56:46.477+0800][5888][gc] GC(14036) Garbage Collection (Allocation Rate) 11210M(39%)->12542M(44%)
[2024-06-23T05:56:51.639+0800][5888][gc] GC(14037) Garbage Collection (Allocation Rate) 14532M(51%)->16076M(56%)
[2024-06-23T05:56:56.412+0800][5888][gc] GC(14038) Garbage Collection (Allocation Rate) 17584M(61%)->19366M(68%)
[2024-06-23T05:56:58.342+0800][5888][gc] GC(14039) Garbage Collection (Allocation Rate) 19404M(68%)->20662M(72%)
[2024-06-23T05:56:59.909+0800][5888][gc] GC(14040) Garbage Collection (Allocation Rate) 20690M(72%)->21826M(76%)
[2024-06-23T05:57:01.286+0800][5888][gc] GC(14041) Garbage Collection (Allocation Rate) 21870M(76%)->22812M(80%)

54 节点: error.log

2024-06-23 05:58:10.205 ERROR   [basekv-client-executor-3] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@261676605
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted
2024-06-23 05:58:10.205 ERROR   [basekv-client-executor-1] --- c.b.b.baserpc.AbstractResponsePipeline  [AbstractResponsePipeline.java:93] Request handling with error in pipeline@1633905364
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.dist.server.scheduler.DistCallScheduler$DistWorkerCallBatcher$BatchDistCall.lambda$execute$21(DistCallScheduler.java:233)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$RequestTask.finish(ManagedRequestPipeline.java:433)
    at com.baidu.bifromq.baserpc.ManagedRequestPipeline$ResponseObserver.onNext(ManagedRequestPipeline.java:472)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:474)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:675)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:660)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 5
    at com.baidu.bifromq.dist.server.RunningAverage.log(RunningAverage.java:29)
    at com.baidu.bifromq.dist.server.DistResponsePipeline.lambda$handleRequest$1(DistResponsePipeline.java:98)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 19 common frames omitted

warn.log

2024-06-23 05:57:08.023  WARN   [bg-task-executor-3] --- c.b.b.b.b.impl.RangeLeaderBalancer      [RangeLeaderBalancer.java:76] Failed to balance rangeLeaders for there is KVRange with more than one leader!
2024-06-23 05:57:08.263  WARN   [cluster-tcp-transport] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:98] Outbound channel failure: remote=10.89.144.124/10.89.144.124:8899, cause=recvAddress(..) failed: Connection reset by peer
2024-06-23 05:57:08.265  WARN   [cluster-tcp-transport-3] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:98] Outbound channel failure: remote=10.89.144.124/10.89.144.124:8899, cause=recvAddress(..) failed: Connection reset by peer
2024-06-23 05:57:08.266  WARN   [cluster-tcp-transport-3] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:98] Outbound channel failure: remote=10.89.144.124/10.89.144.124:8899, cause=recvAddress(..) failed: Connection reset by peer
2024-06-23 05:57:08.266  WARN   [cluster-tcp-transport-1] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:98] Outbound channel failure: remote=10.89.144.124/10.89.144.124:8899, cause=recvAddress(..) failed: Connection reset by peer
2024-06-23 05:57:08.313  WARN   [cluster-tcp-transport] --- c.b.b.b.transport.TCPTransport          [TCPTransport.java:247] Failed to connect to address: /10.89.144.124:8899
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.89.144.124:8899
    at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:351)
    at com.baidu.bifromq.basecluster.transport.TCPTransport.lambda$getChannel$8(TCPTransport.java:321)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.89.144.124:8899
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
    at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Thread.java:842)

gc 日志:

[2024-06-23T05:56:01.433+0800][14135][gc] GC(13847) Garbage Collection (Allocation Rate) 11640M(41%)->3640M(13%)
[2024-06-23T05:56:11.441+0800][14135][gc] GC(13848) Garbage Collection (Allocation Rate) 11744M(41%)->3648M(13%)
[2024-06-23T05:56:21.413+0800][14135][gc] GC(13849) Garbage Collection (Allocation Rate) 11744M(41%)->3616M(13%)
[2024-06-23T05:56:30.547+0800][14135][gc] GC(13850) Garbage Collection (Allocation Rate) 11878M(41%)->2640M(9%)
[2024-06-23T05:56:39.857+0800][14135][gc] GC(13851) Garbage Collection (Allocation Rate) 9912M(35%)->3746M(13%)
[2024-06-23T05:56:46.522+0800][14135][gc] GC(13852) Garbage Collection (Allocation Rate) 8354M(29%)->3616M(13%)
[2024-06-23T05:56:56.419+0800][14135][gc] GC(13853) Garbage Collection (Allocation Rate) 11620M(41%)->3648M(13%)
[2024-06-23T05:57:06.546+0800][14135][gc] GC(13854) Garbage Collection (Allocation Rate) 11862M(41%)->3662M(13%)
[2024-06-23T05:57:16.492+0800][14135][gc] GC(13855) Garbage Collection (Allocation Rate) 12214M(43%)->3692M(13%)
[2024-06-23T05:57:21.770+0800][14135][gc] GC(13856) Garbage Collection (Allocation Rate) 6790M(24%)->4104M(14%)
[2024-06-23T05:57:30.498+0800][14135][gc] GC(13857) Garbage Collection (Allocation Rate) 11850M(41%)->2742M(10%)
[2024-06-23T05:57:39.932+0800][14135][gc] GC(13858) Garbage Collection (Allocation Rate) 10106M(35%)->3722M(13%)
[2024-06-23T05:57:46.601+0800][14135][gc] GC(13859) Garbage Collection (Allocation Rate) 8302M(29%)->3612M(13%)
[2024-06-23T05:57:55.559+0800][14135][gc] GC(13860) Garbage Collection (Allocation Rate) 11642M(41%)->2740M(10%)
[2024-06-23T05:58:05.006+0800][14135][gc] GC(13861) Garbage Collection (Allocation Rate) 10108M(35%)->3702M(13%)
[2024-06-23T05:58:11.637+0800][14135][gc] GC(13862) Garbage Collection (Allocation Rate) 8354M(29%)->2996M(10%)
[2024-06-23T05:58:29.446+0800][14135][gc] GC(13863) Garbage Collection (Proactive) 14854M(52%)->8874M(31%)
[2024-06-23T05:58:36.792+0800][14135][gc] GC(13864) Garbage Collection (Allocation Rate) 12380M(43%)->13808M(48%)
[2024-06-23T05:58:41.729+0800][14135][gc] GC(13865) Garbage Collection (Allocation Rate) 15582M(54%)->17214M(60%)
[2024-06-23T05:58:44.409+0800][14135][gc] GC(13866) Garbage Collection (Allocation Rate) 17242M(60%)->19128M(67%)
[2024-06-23T05:58:46.013+0800][14135][gc] GC(13867) Garbage Collection (Allocation Rate) 19172M(67%)->20230M(71%)
[2024-06-23T05:58:49.387+0800][14135][gc] GC(13868) Garbage Collection (Allocation Rate) 21424M(75%)->22678M(79%)
[2024-06-23T05:58:50.350+0800][14135][gc] GC(13869) Garbage Collection (Allocation Rate) 22858M(80%)->23862M(83%)
[2024-06-23T05:58:51.066+0800][14135][gc] GC(13870) Garbage Collection (Allocation Rate) 23930M(83%)->14566M(51%)
[2024-06-23T05:58:55.727+0800][14135][gc] GC(13871) Garbage Collection (Allocation Rate) 16828M(59%)->4834M(17%)
[2024-06-23T05:59:15.831+0800][14135][gc] GC(13872) Garbage Collection (Allocation Rate) 17578M(61%)->18548M(65%)
[2024-06-23T05:59:20.766+0800][14135][gc] GC(13873) Garbage Collection (Allocation Rate) 20874M(73%)->22000M(77%)
[2024-06-23T05:59:22.187+0800][14135][gc] GC(13874) Garbage Collection (Allocation Rate) 22010M(77%)->22954M(80%)
[2024-06-23T05:59:23.129+0800][14135][gc] GC(13875) Garbage Collection (System.gc()) 22954M(80%)->22960M(80%)

Configurations

OS(please complete the following information):

JVM:

Performance Related

Additional context Add any other context about the problem here.

mefly2012 commented 3 months ago

内部压力工具,5台压力机一共发起23w客户端,每个客户端clean session=true,pingTimeout=60s,keepAlive=60s,username=test,password=123456,每个客户端还订阅3个topic比如abc/{clientid}/topic1,abc/{clientid}/topic2,abc/{clientid}/topic3,这些客户端随机连接到服务端3个节点,每个连接每隔5s钟publish一个内容为40KB大小的数据随机到三个topic中的某一个比如test_topic1或者test_topic2或者test_topic3。