alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.31k stars 12.85k forks source link

Nacos 2.1.0 修改配置后服务掉线 #8464

Closed Johnson-Jia closed 2 years ago

Johnson-Jia commented 2 years ago

BUG 现象: Dubbo 生产者无缘无故消失不见

版本: Dubbo 2.7.14 dubbo-spring-boot-starter 2.7.14 nacos-server 2.1.0, nacos-client 2.1.0

问题描述: nacos 集群 6个节点 ,mysql 数据库 ( nacos 用作 配置中心 、Dubbo 注册中心) Dubbo 应用实例 500+

修改 nacos 配置中心的所有应用公共 properties 配置后,Dubbo 应用的实例接口 生产者逐个消失,

之后 nacos-server 端逐个节点重启后 Dubbo生产者接口实例 又恢复正常,继续观察一段时间(20-50分钟)后又逐个消失,

nacos-server 服务端所有节点全部杀掉,重新启动所有节点恢复正常,生产者不再消失。

KomachiSion commented 2 years ago

看下是是不是生产者的链接都断开了,或者看下有没有什么错误日志。

Johnson-Jia commented 2 years ago

猜测是触发了: true disconnect, remove instances and subscribers 之后被清理了服务: services are automatically cleaned

Johnson-Jia commented 2 years ago

WARN [NamingServerHttpClientManager] Start destroying HTTP-Client 2022-05-25 23:36:43,718 INFO Client connection 1653492441739_1xx.xx.x.8_55154 disconnect, remove instances and subscribers 2022-05-25 23:36:43,718 INFO Client connection 1653492262578_1xx.xx.x.1_62171 disconnect, remove instances and subscribers 2022-05-25 23:36:43,719 INFO Client connection 1653492441522_1xx.xx.x.5_30379 disconnect, remove instances and subscribers 2022-05-25 23:36:43,719 INFO Client connection 1653492441558_1xx.xx.x.7_39498 disconnect, remove instances and subscribers

WARN [NamingServerHttpClientManager] Destruction of the end

Johnson-Jia commented 2 years ago

这个错误有用吗 ?

2022-05-25 23:36:45,001 ERROR [UPDATE-DOMAIN] Exception while taking item from LinkedBlockingDeque.

Johnson-Jia commented 2 years ago

2022-05-25 23:36:46,231 ERROR [NACOS-DISTRO] Error while handling notifying task

java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) at com.alibaba.nacos.naming.consistency.ephemeral.distro.DistroConsistencyServiceImpl$Notifier.run(DistroConsistencyServiceImpl.java:412) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2022-05-25 23:37:11,652 INFO distro notifier started

Johnson-Jia commented 2 years ago

2022-05-25 23:32:01,748 ERROR [PUSH-FAIL] 5000ms, Service{namespace='09abac76-bfa3-41a2-95bc-9f1730078ab9', group='DEFAULT_GROUP', name='providers:com.xx.xx.xx.SystemSettingService:1.0:', ephemeral=true, revision=116}, reason=Timeout After 5000 milliseconds,requestId =187749, target=1.xx.xx.224

2022-05-25 23:32:01,748 ERROR Reason detail:

java.util.concurrent.TimeoutException: Timeout After 5000 milliseconds,requestId =187749 at com.alibaba.nacos.api.remote.DefaultRequestFuture$TimeoutHandler.run(DefaultRequestFuture.java:194) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

是不是和这个报错有关系 ?
Dubbo应用部署在k8s容器内, nacos部署在容器外,nacos-server 会主动连Dubbo的应用? 然后超时,被清除了服务 ?

Johnson-Jia commented 2 years ago

刚刚生产环境大量 Dubbo服务上线, 上线后 生产者服务接口 消失一半 nacos-server 所有节点杀掉,重新启动,生产者服务接口数量恢复正常。 刚刚生产环境实测。。。。。

KomachiSion commented 2 years ago

明显是你provider和nacos-server之间的网络可能存在问题,或者是provider本身存在问题,导致链接断开了,断开之后server就把实例移除了。

Johnson-Jia commented 2 years ago

明显是你provider和nacos-server之间的网络可能存在问题,或者是provider本身存在问题,导致链接断开了,断开之后server就把实例移除了。

上面的报错信息排查出来了,是因为k8s容器内的Dubbo服务重新部署了,所以造成原来的ip不存在,就连不通。 但是k8s 容器内新启动的 Dubbo服务不明白为何也被清除掉? 具体看上一条记录,那天晚上上完线又出现生产者大量消失的问题,但是杀掉 nacos-server 所有节点后再次启动所有 nacos-server 节点就好了。

特别是每次大量Dubbo服务上线后,就会出现生产者被清除的现象。

上线会进行的操作:
1、修改nacos配置中心的配置 2、重启各个Dubbo服务。(k8s容器内部署服务,会生成新的ip注册到nacos-server ,网络上是同一个内网网段)

KomachiSion commented 2 years ago

两种可能, 1是修改配置后,provider出现bug或异常,和nacos-server间的链接断开了。 2是nacos-server的参数配置不合理,在大量订阅或大量上下线时,集群出现大量FullGC或OOM,导致链接断开了。

Johnson-Jia commented 2 years ago

第一种可能性比较大 第二种不存在目前nacos-server内存配置 -Xms10g -Xmx10g -Xmn6g,且未在nacos 日志中发现大量fullgc、oom 情况

KomachiSion commented 2 years ago
WARN [NamingServerHttpClientManager] Start destroying HTTP-Client
2022-05-25 23:36:43,718 INFO Client connection 1653492441739_1xx.xx.x.8_55154 disconnect, remove instances and subscribers
2022-05-25 23:36:43,718 INFO Client connection 1653492262578_1xx.xx.x.1_62171 disconnect, remove instances and subscribers
2022-05-25 23:36:43,719 INFO Client connection 1653492441522_1xx.xx.x.5_30379 disconnect, remove instances and subscribers
2022-05-25 23:36:43,719 INFO Client connection 1653492441558_1xx.xx.x.7_39498 disconnect, remove instances and subscribers

WARN [NamingServerHttpClientManager] Destruction of the end

你应用都调用了shutdownHook了

KomachiSion commented 2 years ago

2022-05-25 23:32:01,748 ERROR [PUSH-FAIL] 5000ms, Service{namespace='09abac76-bfa3-41a2-95bc-9f1730078ab9', group='DEFAULT_GROUP', name='providers:com.xx.xx.xx.SystemSettingService:1.0:', ephemeral=true, revision=116}, reason=Timeout After 5000 milliseconds,requestId =187749, target=1.xx.xx.224

2022-05-25 23:32:01,748 ERROR Reason detail:

java.util.concurrent.TimeoutException: Timeout After 5000 milliseconds,requestId =187749 at com.alibaba.nacos.api.remote.DefaultRequestFuture$TimeoutHandler.run(DefaultRequestFuture.java:194) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

是不是和这个报错有关系 ? Dubbo应用部署在k8s容器内, nacos部署在容器外,nacos-server 会主动连Dubbo的应用? 然后超时,被清除了服务 ?

这是推送数据啊,说明你的dubbo consumer已经不正常了。

Johnson-Jia commented 2 years ago

个人感觉 nacos-server 还是存在问题,可否模拟测试一下? 在大量(比如300+)的配置中心(properties方式)、大量(比如300+)应用Dubbo注册服务,多次修改公共配置文件,再进行部分服务重启,接着访问相关接口并进行观察验证

KomachiSion commented 2 years ago

nacos-server经过大规模压测的,10w服务,每秒变更5k~1w服务, 推送并没有超时情况, 一般推送超时是客户端出现了问题,无法处理推送的下去的数据包导致没有response才会出现。建议还是排查下客户端。

当然你也可以排查一下服务端是不是有FGC,或者cpu争抢之类的问题。

KomachiSion commented 2 years ago

No more response from author, and community can't reproduce this problem.

lihuawei321 commented 10 months ago

警告 [NamingServerHttpClientManager] 开始销毁 HTTP 客户端 2022-05-25 23:36:43,718 INFO 客户端连接 1653492441739_1xx.xx.x.855154 断开连接,删除实例和订阅者 2022-05-25 23:36:43,718 INFO 客户端连接 1653492262578 1xx。 xx.x.1_62171 断开连接、删除实例和订阅者 2022-05-25 23:36:43,719 INFO 客户端连接 1653492441522_1xx.xx.x.5_30379 断开连接、删除实例和订阅者 2022-05-25 23:36:43,719 INFO 客户端连接1653492441558_1xx.xx.x.7_39498 断开连接、删除实例和订阅者

WARN [NamingServerHttpClientManager] 破坏结束

我2.x版本的客户端和服务端也遇到了这个问题,这个是说明服务实例大量下线吗?有影响吗