bifromqio / bifromq

A Multi-Tenancy MQTT broker adopting Serverless architecture
https://bifromq.io
Apache License 2.0
620 stars 61 forks source link

Bifromq 3.2.2 集群情况下 OOM ([Gossiper.java:137] Too many elapsed periods, sweep gossips in advance, currentPeriod=1712447, elapsedPeriod=130) #99

Closed xiaonannet closed 1 month ago

xiaonannet commented 2 months ago

Describe the bug

构建集群各节点输出构建成功日志,集成了认证插件(插件其他环境正常在用的),验证功能正常可正常链接。跑了1天左右3个节点挂了。 可疑的是warn.log 中各节点都在输出: 2024-08-20 08:00:22.238 WARN [agent-host-scheduler] --- c.b.b.basecluster.messenger.Gossiper [Gossiper.java:137] Too many elapsed periods, sweep gossips in advance, currentPeriod=1712447, elapsedPeriod=130 2024-08-20 08:01:18.799 WARN [agent-host-scheduler] --- c.b.b.basecluster.messenger.Gossiper [Gossiper.java:137] Too many elapsed periods, sweep gossips in advance, currentPeriod=1712578, elapsedPeriod=282

Environment

Reproducible Steps

master:

image image

node1:

image image

node2:

image image

docker run -d -m 10G -e JVM_HEAP_OPTS='-Xms2G -Xmx4G -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=500m -XX:MaxDirectMemorySize=1G' --name bifromq \ --network host \ -v /data01/mqttsnet/bifromq/conf/standalone.yml:/home/bifromq/conf/standalone.yml \ bifromq/bifromq:3.2.2

host模式直接跑,各节点所用到的端口防火墙均已开放!

然后观测日志,info显示各个节点的构建日志,3个节点均成功,warn应该这时就会有关于 Gossiper 的告警信息。

内存占用一直上升,最后挂掉。

Publicly Accessible Diagnostic Data(If Reproducible Steps are not available)

bifromq-logs.zip

docker 守护进程日志:

efbd9ab1d07ab7d3e6f62acc3d77225073a537f3f0f0b5695f579d62826ad712-json.log

Additional context Add any other context about the problem here.

xiaonannet commented 2 months ago
image
popduke commented 2 months ago

按照Issue模板要求补充下你的Reproduce Steps?为了定位问题准确,复现环境不要包含自定义的插件。

另外,你提供的日志显示每个节点只有2C,如此受限的配置下会影响GC。

2024-08-20 11:51:17.024 INFO [main] --- c.b.bifromq.starter.StandaloneStarter [StandaloneStarter.java:626] Available Processors: 2 2024-08-20 11:51:17.034 INFO [main] --- c.b.bifromq.starter.StandaloneStarter [StandaloneStarter.java:628] JVM arguments: -Xms1G -Xmx4G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=512M

xiaonannet commented 2 months ago

按照Issue模板要求补充下你的Reproduce Steps?为了定位问题准确,复现环境不要包含自定义的插件。

另外,你提供的日志显示每个节点只有2C,如此受限的配置下会影响GC。

2024-08-20 11:51:17.024 INFO [main] --- c.b.bifromq.starter.StandaloneStarter [StandaloneStarter.java:626] Available Processors: 2 2024-08-20 11:51:17.034 INFO [main] --- c.b.bifromq.starter.StandaloneStarter [StandaloneStarter.java:628] JVM arguments: -Xms1G -Xmx4G -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=512M

已补充,每个节点2G,初始化情况下应该是够用的,因为没有任何的连接和业务操作,空跑的状态。现在的问题在于集群为啥一直有Gossiper 的错误

xiaonannet commented 2 months ago

版本重新换回Docker 3.0.2版本,启动正常。相同的操作步骤换 3.2.2 内存也会不断增加,从节点不可以链接,引导节点可以 image image image image

popduke commented 1 month ago

按你给的条件没法复现你的问题,3.2.2版本 2C4G docker host network组成的集群,即便是有压力的情况也依然运行正常。 从你的日志看集群地址跟你配置文件不一致,内存表现上大量内存被RPC对象占据,说明有业务压力。

2024-08-20 11:54:28.618 INFO [ForkJoinPool.commonPool-worker-1] --- c.b.bifromq.starter.StandaloneStarter [StandaloneStarter.java:688] AgentHost joined seedEndpoint: [/192.168.0.38:18899, /192.168.0.28:18899, /192.168.0.48:18899]

建议: 换一个cluster.env,用物理机组建干净的集群尝试复现问题

flow989 commented 1 month ago

较高配置宿主机安装3.2.2(ssl),3节点集群,完全空跑下仍占用极高内存:logs/gc-2024-08-26_18-40-24.log [2024-08-27T10:24:12.363+0800][2302][gc] GC(1174) Garbage Collection (Allocation Stall) 22408M(100%)->22320M(100%) 全部日志打包 logs_____node42.tgz standalone.yml见: standalone.tgz

flow989 commented 1 month ago

目录结构../bifromq-3.2.2 ├── bin │   ├── bifromq-start.sh │   ├── bifromq-stop.sh │   └── standalone.sh ├── conf │   ├── logback.xml │   ├── ssl │   ├── standalone.yml │   └── standalone.yml_orig ├── data

├── lib

├── logs │   ├── basekv │   ├── debug.log │   ├── error.2024-08-27_00.log │   ├── error.2024-08-27_02.log │   ├── error.2024-08-27_05.log │   ├── error.log │   ├── event.log │   ├── gc-2024-08-26_18-40-24.log │   ├── info.log │   ├── plugin.log │   ├── stdout.log │   ├── trace.log │   ├── warn.2024-08-27_03.log │   ├── warn.2024-08-27_04.log │   ├── warn.2024-08-27_05.log │   ├── warn.2024-08-27_06.log │   ├── warn.2024-08-27_07.log │   ├── warn.2024-08-27_08.log │   ├── warn.2024-08-27_09.log │   └── warn.log └── plugins ├── bifromq-plugin-3.2.2-SNAPSHOT.jar └── README.md

flow989 commented 1 month ago

ps -eo rss,pmem,pcpu,vsize,args | sort -k 1 -r -n | less

69830812 99.9 213 1115864288 ...jdk-17.0.3.1/bin/java -Xms11203m -Xmx22407m -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=500m -XX:MaxDirectMemorySize=6713219480 -server -XX:MaxInlineLevel=15 -Djava.awt.headless=true -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+UseZGC -XX:ZAllocationSpikeTolerance=5 -Xlog:async -Xlog:gc:file=/ws/app/bifromq-3.2.2/bin/../logs/gc-%t.log:time,tid,tags:filecount=5,filesize=50m -XX:+CrashOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/ws/app/bifromq-3.2.2/bin/../logs -Dio.netty.tryReflectionSetAccessible=true -Dio.netty.allocator.cacheTrimIntervalMillis=5000 --add-opens java.base/java.nio=ALL-UNNAMED -cp /ws/app/bifromq-3.2.2/bin/../lib/* -DLOG_DIR=/ws/app/bifromq-3.2.2/bin/../logs -DCONF_DIR=/ws/app/bifromq-3.2.2/bin/../conf -DDATA_DIR=/ws/app/bifromq-3.2.2/bin/../data -DBIND_ADDR=172.17.0.1 -Dlogback.configurationFile=/ws/app/bifromq-3.2.2/bin/../conf/logback.xml -Dpf4j.pluginsDir=/ws/app/bifromq-3.2.2/bin/../plugins com.baidu.bifromq.starter.StandaloneStarter -c /ws/app/bifromq-3.2.2/bin/../conf/standalone.yml

popduke commented 1 month ago

info.log里生效的配置显示,内部rpc server发现了与传入集群配置不同的地址,不确定这个地址在集群间是否有连通性问题。

rpcServerConfig: host: "172.17.0.1" port: 0 workerThreads: null sslConfig: null baseKVClientConfig: workerThreads: 2 sslConfig: null baseKVServerConfig: host: "172.17.0.1" port: 0 workerThreads: 4 sslConfig: null

你可以把日志调成debug或trace看下有什么输出,或者remote debug尝试定位内存增长的路径。

flow989 commented 1 month ago

info.log里生效的配置显示,内部rpc server发现了与传入集群配置不同的地址,不确定这个地址在集群间是否有连通性问题。

rpcServerConfig: host: "172.17.0.1" port: 0 workerThreads: null sslConfig: null baseKVClientConfig: workerThreads: 2 sslConfig: null baseKVServerConfig: host: "172.17.0.1" port: 0 workerThreads: 4 sslConfig: null

你可以把日志调成debug或trace看下有什么输出,或者remote debug尝试定位内存增长的路径。

standalone.yml 上面贴了,没搜到IP: host: "172.17.0.1" 帮忙看下是配置哪里不对?

popduke commented 1 month ago

这个地址不写会按规则发现,文档里有说明: https://bifromq.io/docs/admin_guide/configuration/config_file_manual/ https://bifromq.io/docs/admin_guide/configuration/configs_print/

flow989 commented 1 month ago

对3节点集群部署(tls,宿主机&docker两模式),能不能提供个完整配置参考? 光看各项说明,吃不准对不对

popduke commented 1 month ago

较高配置宿主机安装3.2.2(ssl),3节点集群,完全空跑下仍占用极高内存:logs/gc-2024-08-26_18-40-24.log [2024-08-27T10:24:12.363+0800][2302][gc] GC(1174) Garbage Collection (Allocation Stall) 22408M(100%)->22320M(100%) 全部日志打包 logs_____node42.tgz standalone.yml见: standalone.tgz

你的环境里有多个NetworkInterface,需要在standalone.yml配置文件里把这两个内部的server地址明确指定下,不然发现的是错误的docker bridge地址172.17.0.1:

rpcServerConfig:
  host: 192.168.0.42
baseKVServerConfig:
  host: 192.168.0.42
flow989 commented 1 month ago

较高配置宿主机安装3.2.2(ssl),3节点集群,完全空跑下仍占用极高内存:logs/gc-2024-08-26_18-40-24.log [2024-08-27T10:24:12.363+0800][2302][gc] GC(1174) Garbage Collection (Allocation Stall) 22408M(100%)->22320M(100%) 全部日志打包 logs_____node42.tgz standalone.yml见: standalone.tgz

你的环境里有多个NetworkInterface,需要在standalone.yml配置文件里把这两个内部的server地址明确指定下,不然发现的是错误的docker bridge地址172.17.0.1:

rpcServerConfig:
  host: 192.168.0.42
baseKVServerConfig:
  host: 192.168.0.42

我是3台宿主机部署的集群,standalone.yml补充这几个我再监视下

flow989 commented 1 month ago

较高配置宿主机安装3.2.2(ssl),3节点集群,完全空跑下仍占用极高内存:logs/gc-2024-08-26_18-40-24.log [2024-08-27T10:24:12.363+0800][2302][gc] GC(1174) Garbage Collection (Allocation Stall) 22408M(100%)->22320M(100%) 全部日志打包 logs_____node42.tgz standalone.yml见: standalone.tgz

你的环境里有多个NetworkInterface,需要在standalone.yml配置文件里把这两个内部的server地址明确指定下,不然发现的是错误的docker bridge地址172.17.0.1:

rpcServerConfig:
  host: 192.168.0.42
baseKVServerConfig:
  host: 192.168.0.42

我是3台宿主机部署的集群,standalone.yml补充这几个我再监视下

增加了配置还是老样,info.log仍然出现: -DBIND_ADDR=172.17.0.1 warn.log仍然:2024-08-28 19:03:02.321 WARN [agent-host-scheduler] --- c.b.b.basecluster.messenger.Gossiper [Gossiper.java:143] Some gossips are too old to spread, confirm gossips in advance, currentPeriod=51561, elapsedPeriod=12

你们能不能提供下验证ok的配置文件?

popduke commented 1 month ago

standalone.yml文件内容会优先于-DBIND_ADDR=172.17.0.1,你确认下info.log里输出的最终生效的全量yaml内容中这两个地址是否正确,如果地址正确的情况下你说的内存问题有没有解决? 配置文件要根据你的部署环境的配置,我们测试就是用的默认配置,没有全量不全量的说法。

flow989 commented 1 month ago

standalone.yml文件内容会优先于-DBIND_ADDR=172.17.0.1,你确认下info.log里输出的最终生效的全量yaml内容中这两个地址是否正确,如果地址正确的情况下你说的内存问题有没有解决? 配置文件要根据你的部署环境的配置,我们测试就是用的默认配置,没有全量不全量的说法。

我把这个节点的logs目录打包: logs.tgz info.log最开始打印172....后面无; warn.log一直 [Gossiper.java:137] Too many elapsed periods, sweep。。。 gc-2024-08-28_16-11-10.log: [2024-08-28T22:44:04.995+0800][20353][gc] GC(594) Garbage Collection (Allocation Rate) 20006M(89%)->20186M(90%)

WX20240828-224652@2x ![Uploading WX20240828-224727@2x.png…]()

popduke commented 1 month ago

你把"com.baidu.bifromq"日志调到debug level,三个节点的日志全部传上来

flow989 commented 1 month ago

昨天跑的其中1台冒出个文件:69G Aug 29 07:33 core.5611 1724897857487

flow989 commented 1 month ago

刚把debug打开,debug文件很快几十G,狂报: 2024-08-29 09:55:11.942 DEBUG [basekv-client-executor] — c.b.b.baserpc.ManagedRequestPipeline [ManagedRequestPipeline.java:489] ReqPipeline@1985172618 of linearizedQuery internal stream@1245593290 error: state=Normal io.grpc.StatusRuntimeException: UNAVAILABLE: server is unreachable now at io.grpc.Status.asRuntimeException(Status.java:539) at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576) at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:49) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) 2024-08-29 09:55:11.942 DEBUG [basekv-client-executor-1] — c.b.b.baserpc.ManagedRequestPipeline [ManagedRequestPipeline.java:127] ReqPipeline@596737142 of linearizedQuery targeting to server[668d7250-20ac-4a24-9dce-b03c75c68636/1128733310]

3台机器firewall开启,但clusterConfig.port: 28899打开了互通的。

popduke commented 1 month ago

info.log里生效的配置显示,内部rpc server发现了与传入集群配置不同的地址,不确定这个地址在集群间是否有连通性问题。

rpcServerConfig: host: "172.17.0.1" port: 0 workerThreads: null sslConfig: null baseKVClientConfig: workerThreads: 2 sslConfig: null baseKVServerConfig: host: "172.17.0.1" port: 0 workerThreads: 4 sslConfig: null

你可以把日志调成debug或trace看下有什么输出,或者remote debug尝试定位内存增长的路径。

standalone.yml 上面贴了,没搜到IP: host: "172.17.0.1" 帮忙看下是配置哪里不对?

你打开的是集群通信端口,不是rpc server的端口。文档里说的很清楚,port: 0会随机选择可用端口,你能确认这个ip的随机端口全部放行了? https://bifromq.io/docs/admin_guide/configuration/config_file_manual/

flow989 commented 1 month ago

不是rpc server的端口。文档里说的很清楚,port: 0会随机选择可用端口: 我在conf/standalone.yml已经设置了clusterConfig.port: 28899 ,还会其他随机端口?这样也必须firewall全部关闭? WX20240829-115541@2x

flow989 commented 1 month ago

rpcServerConfig.port可以指定端口,firewall添加放行?

popduke commented 1 month ago

你在回复前能仔细看下文档吗?如果不愿意看文档可以考虑付费一对一咨询

flow989 commented 1 month ago

baseKVServerConfig.port官方文档根本没提及