alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.19k stars 12.83k forks source link

[必现] nacos启动2小时之后,CPU占用突然达到90%以上 Two hours after nacos started, the CPU usage suddenly reached over 90% #10080

Closed coder-xiaomo closed 1 year ago

coder-xiaomo commented 1 year ago

翻阅之前的issues,发现存在类似的issue如下,但所提及的解决方案未能解决问题,故另开此issue。

Referring to previous issues, similar issues were found as follows, but the mentioned solution failed to solve the problem, so another issue was opened.

相关issue: Related issue: #9071 #3903

Describe the bug 问题描述

nacos standalone 模式下启动2小时之后,CPU占用突然达到90%以上

In nacos standalone mode, the CPU fills up 2 hours after starting (90% +).

Expected behavior 预期行为

nacos正常运行,CPU处在正常水平。

nacos is running properly and the CPU is at normal level.

Actually behavior 实际行为

启动2小时后,nacos占用了90%以上CPU。

After 2 hours of startup, nacos occupied more than 90% of the CPU.

How to Reproduce 如何复现

复现步骤:

  1. 使用 nacos 2.1.2 版本,所有配置均为默认配置
  2. 使用 startup.cmd -m standalone 命令启动 nacos
  3. 不做任何操作,等待2小时
  4. CPU直接从 1%以下 升到 90%以上

Steps to reproduce the behavior:

  1. nacos 2.1.2 is used, and all configurations are default
  2. Run the 'startup. cmd-m standalone' command to start nacos
  3. Do not perform any operation and wait for 2 hours
  4. CPU goes directly from below 1% to above 90%

Desktop 使用环境


Additional context 更多信息

是否必现:100% 必现,而且是启动正好2小时后,1秒都不差,非常准时

100% sure to show up, and it started exactly two hours later, right on time


翻阅日志发现文件中存在大量如下信息(从启动开始差不多一直在打印):

protocol-raft.log

日志中第一条记录在 2023-03-12 12:20:40 时刻;到 2023-03-12 15:27:28 时刻,该日志文件大小已有 34.3MB,基本全是 ERROR Failed to join the cluster, retry... 报错

完整日志 log file:

https://file-sharing-1302260381.cos.ap-nanjing.myqcloud.com/github/issue-attachment/nacos-20230312/protocol-raft.log

摘录部分日志如下: Excerpts:

2023-03-12 15:27:23,900 ERROR Fail to refresh leader for group : naming_service_metadata, status is : Status[UNKNOWN<-1>: Unknown leader]

2023-03-12 15:27:23,901 ERROR Fail to refresh route configuration for group : naming_service_metadata, status is : Status[UNKNOWN<-1>: Fail to get leader of group naming_service_metadata]

2023-03-12 15:27:24,610 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_instance_metadata, Unknown leader
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:631)
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:524)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-03-12 15:27:24,798 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service_v2, Unknown leader
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:631)
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:524)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-03-12 15:27:24,798 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_service_metadata, Unknown leader
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:631)
    at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:524)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353)
    at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-03-12 15:27:25,706 ERROR Failed to join the cluster, retry...

alipay-jraft.log

该日志文件大小也有1M多,全部是如下信息

完整日志 log file:

https://file-sharing-1302260381.cos.ap-nanjing.myqcloud.com/github/issue-attachment/nacos-20230312/alipay-jraft.log

摘录部分日志如下: Excerpts:

2023-03-12 15:21:36,020 INFO Node <naming_persistent_service_v2/192.168.2.10:7848> term 1 start preVote.

2023-03-12 15:21:36,020 WARN Node <naming_persistent_service_v2/192.168.2.10:7848> can't do preVote as it is not in conf <ConfigurationEntry [id=LogId [index=1, term=1], conf=192.168.2.27:7848, oldConf=]>.

2023-03-12 15:21:38,528 INFO Node <naming_service_metadata/192.168.2.10:7848> term 1 start preVote.

2023-03-12 15:21:38,528 WARN Node <naming_service_metadata/192.168.2.10:7848> can't do preVote as it is not in conf <ConfigurationEntry [id=LogId [index=1, term=1], conf=192.168.2.27:7848, oldConf=]>.

2023-03-12 15:21:39,658 INFO Node <naming_instance_metadata/192.168.2.10:7848> term 1 start preVote.

2023-03-12 15:21:39,658 WARN Node <naming_instance_metadata/192.168.2.10:7848> can't do preVote as it is not in conf <ConfigurationEntry [id=LogId [index=1, term=1], conf=192.168.2.27:7848, oldConf=]>.

2023-03-12 15:21:41,627 INFO Node <naming_persistent_service_v2/192.168.2.10:7848> term 1 start preVote.

2023-03-12 15:21:41,627 WARN Node <naming_persistent_service_v2/192.168.2.10:7848> can't do preVote as it is not in conf <ConfigurationEntry [id=LogId [index=1, term=1], conf=192.168.2.27:7848, oldConf=]>.

2023-03-12 15:21:43,745 INFO Node <naming_service_metadata/192.168.2.10:7848> term 1 start preVote.

2023-03-12 15:21:43,745 WARN Node <naming_service_metadata/192.168.2.10:7848> can't do preVote as it is not in conf <ConfigurationEntry [id=LogId [index=1, term=1], conf=192.168.2.27:7848, oldConf=]>.

jstack 快照 (jstack snapshot) jstack -l <PID> >> jstackLog.out

https://file-sharing-1302260381.cos.ap-nanjing.myqcloud.com/github/issue-attachment/nacos-20230312/jstackLog.out

jstack输出中 CPU 占用较大的几个 thread:

pool-2-thread-1 pool-3-thread-1 pool-7-thread-1 pool-8-thread-1

I/O dispatcher 一共25个(从 I/O dispatcher 1I/O dispatcher 25

大部分相同,这里只贴出了代表性的几个:

"pool-2-thread-1" #44 prio=5 os_prio=0 cpu=372125.00ms elapsed=9004.37s tid=0x000001cb92cbb000 nid=0x5aa4 runnable  [0x000000a0794ff000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.WindowsSelectorImpl.resetWakeupSocket0(java.base@11/Native Method)
    at sun.nio.ch.WindowsSelectorImpl.resetWakeupSocket(java.base@11/WindowsSelectorImpl.java:489)
    - locked <0x00000000e1db2438> (a java.lang.Object)
    at sun.nio.ch.WindowsSelectorImpl.doSelect(java.base@11/WindowsSelectorImpl.java:182)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11/SelectorImpl.java:124)
    - locked <0x00000000e1358b78> (a sun.nio.ch.Util$2)
    - locked <0x00000000e1358af0> (a sun.nio.ch.WindowsSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(java.base@11/SelectorImpl.java:136)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:343)
    at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:194)
    at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
    at java.lang.Thread.run(java.base@11/Thread.java:834)

   Locked ownable synchronizers:
    - None

"I/O dispatcher 1" #45 prio=5 os_prio=0 cpu=370593.75ms elapsed=9004.35s tid=0x000001cb92cc9000 nid=0x5600 runnable  [0x000000a0795fe000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.WindowsSelectorImpl.resetWakeupSocket0(java.base@11/Native Method)
    at sun.nio.ch.WindowsSelectorImpl.resetWakeupSocket(java.base@11/WindowsSelectorImpl.java:489)
    - locked <0x00000000e1df9d60> (a java.lang.Object)
    at sun.nio.ch.WindowsSelectorImpl.doSelect(java.base@11/WindowsSelectorImpl.java:182)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11/SelectorImpl.java:124)
    - locked <0x00000000e13661b8> (a sun.nio.ch.Util$2)
    - locked <0x00000000e1366130> (a sun.nio.ch.WindowsSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(java.base@11/SelectorImpl.java:136)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:255)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
    at java.lang.Thread.run(java.base@11/Thread.java:834)

   Locked ownable synchronizers:
    - None

Nacos启动时输出信息:

D:\nacos\bin>startup.cmd -m standalone
"nacos is starting with standalone"

         ,--.
       ,--.'|
   ,--,:  : |                                           Nacos 2.1.2
,`--.'`|  ' :                       ,---.               Running in stand alone mode, All function modules
|   :  :  | |                      '   ,'\   .--.--.    Port: 8848
:   |   \ | :  ,--.--.     ,---.  /   /   | /  /    '   Pid: 9408
|   : '  '; | /       \   /     \.   ; ,. :|  :  /`./   Console: http://192.168.2.10:8848/nacos/index.html
'   ' ;.    ;.--.  .-. | /    / ''   | |: :|  :  ;_
|   | | \   | \__\/: . ..    ' / '   | .; : \  \    `.      https://nacos.io
'   : |  ; .' ," .--.; |'   ; :__|   :    |  `----.   \
|   | '`--'  /  /  ,.  |'   | '.'|\   \  /  /  /`--'  /
'   : |     ;  :   .'   \   :    : `----'  '--'.     /
;   |.'     |  ,     .-./\   \  /            `--'---'
'---'        `--`---'     `----'

2023-03-12 12:20:36,391 INFO Tomcat initialized with port(s): 8848 (http)

2023-03-12 12:20:36,503 INFO Root WebApplicationContext: initialization completed in 3529 ms

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.alipay.sofa.jraft.util.internal.UnsafeUtil (jar:file:/D:/nacos/target/nacos-server.jar!/BOOT-INF/lib/jraft-core-1.3.8.jar!/) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.alipay.sofa.jraft.util.internal.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-03-12 12:20:41,923 INFO Adding welcome page: class path resource [static/index.html]

2023-03-12 12:20:42,397 WARN You are asking Spring Security to ignore Ant [pattern='/**']. This is not recommended -- please use permitAll via HttpSecurity#authorizeHttpRequests instead.

2023-03-12 12:20:42,397 INFO Will not secure Ant [pattern='/**']

2023-03-12 12:20:42,446 INFO Will secure any request with [org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter@709f0202, org.springframework.security.web.context.SecurityContextPersistenceFilter@4c19d42e, org.springframework.security.web.header.HeaderWriterFilter@4b862408, org.springframework.security.web.csrf.CsrfFilter@25a94b55, org.springframework.security.web.authentication.logout.LogoutFilter@1150d471, org.springframework.security.web.savedrequest.RequestCacheAwareFilter@65629ac6, org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter@10244722, org.springframework.security.web.authentication.AnonymousAuthenticationFilter@654e6a90, org.springframework.security.web.session.SessionManagementFilter@500051c5, org.springframework.security.web.access.ExceptionTranslationFilter@4d09cade]

2023-03-12 12:20:42,476 INFO Exposing 1 endpoint(s) beneath base path '/actuator'

2023-03-12 12:20:42,576 INFO Tomcat started on port(s): 8848 (http) with context path '/nacos'

2023-03-12 12:20:42,607 INFO Nacos started successfully in stand alone mode. use embedded storage

个人猜测

My guess :

coder-xiaomo commented 1 year ago

刚启动时,CPU占用正常(2023.03.12 12:21):

When starting, CPU usage is normal (2023.03.12 12:21) :

截图20230312122111

2小时后,CPU爆满(2023.03.12 14:21):

After 2 hours, the CPU is full (2023.03.12 14:21) :

截图20230312142134

CPU被塞满:

The CPU is jammed:

截图20230312144355

VisualVM截图如下:

The VisualVM screenshot is as follows:

截图20230312155425

截图20230312154728

截图20230312160049

coder-xiaomo commented 1 year ago

使用当前最新版 nacos-server-2.2.0.1 进行测试,同样存在此问题。

唯一不同的是,2.2.0.1 版本 protocol-raft.log 日志文件看起来很正常,没有 2.1.2 版本中报出的这个异常: java.lang.IllegalStateException: Fail to get leader of group naming_instance_metadata, Unknown leader

image

image

sixsixsix516 commented 1 year ago

你是不是集群模式启动也用的默认的配置 nacos.home ,这样导致 集群模式 和 standalone 模式用的同一个home目录 建议换一个 nacos.home 再试试

-Dnacos.home=

KomachiSion commented 1 year ago

jdk的bug, 你提出的关联issue已经有了很长的讨论。

无论是grpc还是tomcat都用到了nio, jdk的Nio selector有bug,导致不等待io无限进行循环导致的。

可以尝试换其他版本的jdk重试或者换linux。

coder-xiaomo commented 1 year ago

@sixsixsix516 我刚拿了一台未安装过nacos的Windows系统下测试了一下,是2.2.0.1版本压缩包解压出来直接 startup.cmd -m standalone 启动的,之前没有用集群模式启动过,也会出现此问题,所以应该不是集群启动的问题。

coder-xiaomo commented 1 year ago

@KomachiSion 感谢解答,原来如此❤️ 想问下 Windows 系统下一般用哪个版本 JDK 呀,最近试了 OpenJDK 11 、 OpenJDK 8 和 JDK 8 这三个版本都会出现这个问题。以及这个问题是否只在 Windows 系统下出现而 Linux 系统下不存在这个问题呢

zhovy commented 1 year ago

把HTTPDebug卸载了 再启动尝试一下

KomachiSion commented 1 year ago

@KomachiSion 感谢解答,原来如此❤️ 想问下 Windows 系统下一般用哪个版本 JDK 呀,最近试了 OpenJDK 11 、 OpenJDK 8 和 JDK 8 这三个版本都会出现这个问题。以及这个问题是否只在 Windows 系统下出现而 Linux 系统下不存在这个问题呢

linux不会, 只有window会有, 可以搜索一下issue, 用label = os/windows 过滤一下,应该比较容易找到

coder-xiaomo commented 1 year ago

@zhovy 刚刚把 HTTPDebug 卸载后使用 OpenJDK 11 、Alibaba Dragonwell Extended Edition 11 两个 JDK 进行测试,仍然会出现此问题,应该不是 HTTPDebug 的问题。

coder-xiaomo commented 1 year ago

@zhovy 刚刚把 HTTPDebug 卸载后使用 OpenJDK 11 、Alibaba Dragonwell Extended Edition 11 两个 JDK 进行测试,仍然会出现此问题,应该不是 HTTPDebug 的问题。

@zhovy 抱歉,是我的测试不严谨,我卸载了但没有重启, 卸载 HTTPDebug 后重启 Windows 系统就正常了。

coder-xiaomo commented 1 year ago

【做个总结 Conclusion】

翻之前的issue,发现这个问题好像被提到了挺多次,这里汇总一下related issue,方便遇到同样的人快速检索到: Looking over the previous issue, I found that this issue seems to have been mentioned quite a few times. Here is a summary of the related issue, so that people who meet the same people can quickly retrieve it:

Related Issues: #9099 #9071 #3903 #3854 #3557 #1936 (#1987)

今天在 Ubuntu 22.04 LTS 桌面版用 OpenJDK 11 测试正常! 猜测 Linux 下其他 JDK 版本也应该没问题。 Ubuntu 22.04 LTS Desktop Edition with OpenJDK 11 tested OK today! Presumably other JDK versions under Linux should be fine.

按照上面 @zhovy 的描述, 卸载 HTTPDebug(Pro) 后重启即可解决问题 As described by @zhovy above, uninstalling HTTPDebug(Pro) and restarting will fix the problem.

总结来说还是 Windows 系统下面安装 HTTPDebug(Pro) 触发了 JDK 的 Bug(或许更换其他版本 JDK 能够解决问题,但是这段时间来看我尝试了好几个市面上的 JDK 版本都不行)。 Linux 下面一切正常 ,如果大家的项目必须在 Windows 下运行,或许也可以尝试单独弄一台 Linux 服务器或虚拟机跑 naocs。 To conclude, installing HTTPDebug(Pro) on Windows is causing a Bug in the JDK (maybe a different version of the JDK will solve the problem, but I've tried several JDK versions on the market in the past few days). Everything is fine under Linux. If you have to run your project under Windows, you might want to try a separate Linux server or virtual machine to run naocs.

关联的 JDK Bug: Related JDK Bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8255627 https://issues.apache.org/jira/browse/HTTPCORE-650

其实最开始,我以为这个问题是 startup.cmd / startup.sh 中的内存参数配置的问题,但现在来看其实并不是因为分配的内存大小所导致的。 At first, I thought it was a memory parameter configuration problem in startup.cmd / startup.sh, but it turns out it's not due to the amount of memory allocated.

另外,除了系统的问题,或许可能还有一些错误使用导致的CPU跑满,比如与 Druid 集成遇到的问题: By the way, in addition to system issues, there may be some CPU overruns caused by incorrect usage, such as integration issues with Druid: https://www.cnblogs.com/FlyAway2013/p/10818589.html

感谢大佬们的解答! Thank you for your answers!

hiasenddup commented 3 weeks ago

HTTPDebug 真有这么多bug吗? 除了nacos , 还会造成其他的应用莫名其秒的占用90%cpu