bifromqio / bifromq

A Multi-Tenancy MQTT broker adopting Serverless architecture
https://bifromq.io
Apache License 2.0
614 stars 61 forks source link

长时间压测后,线程wal-raft-executor-112680774442680320_0 和 basekv-range-mutator CPU高,一直降不下来 #94

Open masterOcean opened 2 months ago

masterOcean commented 2 months ago

长时间压测后,线程wal-raft-executor-112680774442680320_0 CPU高,一直降不下来 集群3个节点(32C,64G)(20,54,124 三台),35w客户端,每隔10s发 40K body 压测,每隔10-12小时休眠 3分钟左右。大概2天后,20节点 wal-raft-executor-112680774442680320_0 线程 CPU 占用高,54 节点上 wal-raft-executor-112680774434029568_0 线程 CPU 占用高,而且一直降不下来,同时 basekv-range-mutator 线程 CPU也很高而且无法将来下。 但这期间集群正常,warn.log 和 error.log 都没有错误打印, gc 日志正常。balancer 日志中能搜到该线程 20 节点 cpu 截图 image

20 节点 retain.store-fd6e1d50-7308-4146-84fd-5fa62de36212.log

2024-06-30 20:23:07.191  INFO [bg-task-executor-7] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2784, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e, 542e442a-3748-4ec4-b6db-eda13ad225e6], learner=[]}] result: true
2024-06-30 22:08:53.690  INFO [bg-task-executor-2] --- [KVRangeBalanceController.java:169] Balancer[ReplicaCntBalancer] run command: ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2788, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e], learner=[]}
2024-07-01 09:55:06.882  INFO [bg-task-executor-3] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2844, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e], learner=[]}] result: true
2024-07-02 17:55:13.775  INFO [bg-task-executor] --- [KVRangeBalanceController.java:181] Balancer command[ReplicaCntBalancer,ChangeConfigCommand{toStore=fd6e1d50-7308-4146-84fd-5fa62de36212, kvRangeId=112680774442680320_0, expectedVer=2856, voters=[fd6e1d50-7308-4146-84fd-5fa62de36212, 7a67e104-2788-4608-a121-7b80e9dc001e, 542e442a-3748-4ec4-b6db-eda13ad225e6], learner=[]}] result: true

54 节点 cpu 截图 image

54 节点 inbox.store-0a40673e-7e57-47d6-8fa9-e69a2305152e.log

2024-07-03 19:27:35.164  INFO [bg-task-executor] --- [KVRangeBalanceController.java:169] Balancer[ReplicaCntBalancer] run command: ChangeConfigCommand{toStore=0a40673e-7e57-47d6-8fa9-e69a2305152e, kvRangeId=112680774434029568_0, expectedVer=3640, voters=[62837868-8a27-4d5c-9bc3-1a155fc63a66, e8a84d42-8292-489e-a241-9ce716d14e07, 0a40673e-7e57-47d6-8fa9-e69a2305152e], learner=[]}

BifroMQ

To Reproduce 压测客户端,35w client, 每隔8.5S 发送 body 40k qos =0 的消息,每隔10-12小时休眠 3分钟以上 PUB Client :

Expected behavior

Logs

Configurations

OS(please complete the following information):

JVM:

Performance Related

Additional context Add any other context about the problem here.

popduke commented 1 month ago

用你给的reproduce信息无法复现你描述的现象,以下建议供参考:1)在issue描述中给出完整的稳定reproduce问题步骤,或者2)如果停止压测并重启后问题依然存在,可提供三台节点完整的data数据共诊断

masterOcean commented 1 month ago

用你给的reproduce信息无法复现你描述的现象,以下建议供参考:1)在issue描述中给出完整的稳定reproduce问题步骤,或者2)如果停止压测并重启后问题依然存在,可提供三台节点完整的data数据共诊断

data 数据 链接:https://pan.baidu.com/s/1K2gkC2vtzGz2ykbsFPYSAA?pwd=y5cg 提取码:y5cg

popduke commented 1 month ago

你的数据通过相关metrics(basekv_meta_ver)显示, inbox store和retain store的range经过了几千次的管理版本变更,并且副本之间的进展也不同步,占用cpu的线程应该是leader一直在尝试同步操作,这种情况你需要排查节点间的通信质量是否有问题。另外,3.2.1包含了一些存储引擎方面的稳定性优化,推荐用同样的场景实测。