alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.11k stars 12.81k forks source link

Java程序已停止,在Nacos的服务详情中仍存在 #4007

Closed supermareo closed 3 years ago

supermareo commented 3 years ago

Describe the bug 测试服务器使用了 单机版 NacosServer 1.2.1,本地电脑(Win10)上启动Java程序并注册到Nacos的服务列表中(见下图红框),测试完成后,停止该Java程序,后来发现Nacos控制台的服务详情中该服务仍然存在,且是上线状态。后升级NacosServer版本至最新 1.3.2, 问题仍存在。 image

首先怀疑是Java程序没有真正关掉,通过任务管理器与命令查看,确认本地电脑上并无任何Java进程在运行。为了确保本地Java程序确实被关闭,重启本地电脑。重启后查看NacosServer控制台,该服务仍然存在且是上线状态。

之后检查部署NacosServer的Centos服务器,过滤连接到NacosServer的IP信息。由下图可以看出,172.25.240.12(本地电脑的IP)并没有到NacosServer的连接。 image

Expected behavior Java程序停止后,服务详情中应该不再有该服务的信息

Acutally behavior Java程序停止后,该服务仍然在服务详情中,且无法删除。重启NacosServer后仍然存在。

How to Reproduce 该问题非必现,操作流程在上面的 Describe the bug 部分已描述。

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

KomachiSion commented 3 years ago

I can't reproduce your problem.

Please check nacos naming-server and naming-event logs to check whether any error or exception.

And you can use curl -X PUT 'ip:port/nacos/v1/ns/operator/logs?logName=naming-event&logLevel=DEBUG' and check naming-event log to make sure no heartbeat from this service and instance.

supermareo commented 3 years ago

I can't reproduce your problem.

Please check nacos naming-server and naming-event logs to check whether any error or exception.

And you can use curl -X PUT 'ip:port/nacos/v1/ns/operator/logs?logName=naming-event&logLevel=DEBUG' and check naming-event log to make sure no heartbeat from this service and instance.

name-server.log

找到一条相关的WARN级别日志,没有Error与Exception相关日志

2020-10-16 16:14:23,128 WARN cluster: DEFAULT not found, ip: {"app":"unknown","clusterName":"DEFAULT","enabled":true,"ephemeral":true,"healthy":true,"instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000,"instanceId":"172.25.240.12#10010#DEFAULT#DEFAULT_GROUP@@konke-iot-cloud-api","instanceIdGenerator":"","ip":"172.25.240.12","ipDeleteTimeout":30000,"lastBeat":1602836063105,"marked":false,"metadata":{"appVersion":"v1.3.0.RELEASE","appDesc":"设备云平台-API服务","preserved.register.source":"SPRING_CLOUD","appStartAt":"2020-10-16 11:56:30.695"},"port":10010,"serviceName":"DEFAULT_GROUP@@konke-iot-cloud-api","tenant":"","weight":1.0}, will create new cluster with default configuration.

name-event.log

没有异常日志

ip:port/nacos/v1/ns/operator/logs?logName=naming-event&logLevel=DEBUG

接口报错,见下图 image

name-server与naming-event今天的完整日志见附件

naming-server.log naming-event.log

KomachiSion commented 3 years ago

curl -X PUT 'ip:port/nacos/v1/ns/operator/log?logName=naming-event&logLevel=DEBUG'

Sorry I add more s in api.

Please use this api open debug log for naming-event and check whether receive heartbeat of this instance.

Other way to check whether receive heartbeat of this instance is use curl -X GET 'ip:port/nacos/v1/ns/catalog/instances?serviceName=&pageSize=10&pageNo=1&namespaceId= and check the lastBeat changes?

KomachiSion commented 3 years ago

@supermareo Any process?

supermareo commented 3 years ago

@KomachiSion 暂时复现不出来了,等下次出现的时候我再排查下

KomachiSion commented 3 years ago

那我先关闭issue了。 复现的时候记得打开日志或者用openAPI来确认心跳时间,最好能够知道如何复现问题。

supermareo commented 3 years ago

@KomachiSion 你好,这个问题又出现了,还是172.25.240.12这个ip,这段时间NacosServer没有重启过。目前我电脑的ip已经变更了,联系运维查看当前172.25.240.12这个ip被分配给了一台手机,所以可以确定并没有启动Java程序。

我按照你上面的指导,修改了日志的级别,curl -X PUT 'http://172.25.240.37:8848/nacos/v1/ns/operator/log?logName=naming-event&logLevel=DEBUG',过滤了一下日志(tail -f naming-event.log | grep 172.25.240.12),打印如下:

2020-10-27 15:46:44,722 DEBUG [CLIENT-BEAT] processing beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10011,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:44,722 DEBUG [CLIENT-BEAT] refresh beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10011,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:45,295 DEBUG [CLIENT-BEAT] processing beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10010,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:45,295 DEBUG [CLIENT-BEAT] refresh beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10010,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:49,734 DEBUG [CLIENT-BEAT] processing beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10011,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:49,734 DEBUG [CLIENT-BEAT] refresh beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10011,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:50,297 DEBUG [CLIENT-BEAT] processing beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10010,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}
2020-10-27 15:46:50,297 DEBUG [CLIENT-BEAT] refresh beat: {"load":0.0,"cpu":0.0,"rt":0.0,"qps":0.0,"mem":0.0,"port":10010,"ip":"172.25.240.12","cluster":"DEFAULT","weight":0.0,"ephemeral":true}

我不太清楚这段日志的含义,是否是指我在172.25.240.12这个ip上启动的服务在定时上报心跳。所以我按照我的思路,进行了一些排查。

  1. 首先是再次确认运行NacosServer的服务器是否与172.25.240.12这个ip有连接,确认并没有连接。 image

  2. 在NacosServer所在服务器上通过iptables禁用172.25.240.12这个ip image 再次查看日志,仍然和之前一样有心跳日志,并且界面上仍有这个服务。理论上这个时候两个服务之间的连接应该断开了。 image

不知道这些信息能不能帮助定位,如果有其它需要提供的内容,麻烦告诉我,谢谢。

mx-go commented 3 years ago

1.2.1遇到同样的问题

niejian commented 3 years ago

focus

yydzxz commented 8 months ago

同样的问题 日志: failed to delete ip automatically, ip: {"instanceId":"10.32.0.188#8080#DEFAULT#DEFAULT_GROUP@@saas-center-identity","ip":"10.32.0.188","port":8080,"weight":1.0,"healthy":false,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@saas-center-identity","metadata":{"preserved.register.source":"SPRING_CLOUD"},"lastBeat":1706581703587,"marked":false,"app":"DEFAULT","instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000,"ipDeleteTimeout":30000}, error: {}

yydzxz commented 8 months ago

nacos集群一共3个实例: nacos-server-thirdapp-nacos-server-ha-extdb-0 nacos-server-thirdapp-nacos-server-ha-extdb-1 nacos-server-thirdapp-nacos-server-ha-extdb-2

其中nacos-server-thirdapp-nacos-server-ha-extdb-2与另外两台实例断开了(但不是网络导致的,具体看下面)

nacos-server-thirdapp-nacos-server-ha-extdb-2的部分日志如下:

748511:2024-01-30 10:49:47,256 WARN [STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: nacos-server-thirdapp-nacos-server-ha-extdb-0.nacos-hs.uat.svc.cluster.local:8848
748534:2024-01-30 10:49:47,256 WARN [STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: nacos-server-thirdapp-nacos-server-ha-extdb-1.nacos-hs.uat.svc.cluster.local:8848
2024-01-30 10:49:48,846 ERROR [IP-DEAD] failed to delete ip automatically, ip: {"instanceId":"10.32.1.227#8080#DEFAULT#DEFAULT_GROUP@@saas-center-user","ip":"10.32.1.227","port":8080,"weight":1.0,"healthy":false,"enabled":false,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@saas-center-user","metadata":{"preserved.register.source":"SPRING_CLOUD"},"lastBeat":1706495234710,"marked":false,"app":"DEFAULT","instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000,"ipDeleteTimeout":30000}, error: {}

java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
        at org.apache.http.util.Asserts.check(Asserts.java:46)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)
        at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:75)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:108)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:92)
        at com.alibaba.nacos.common.http.client.request.DefaultAsyncHttpClientRequest.execute(DefaultAsyncHttpClientRequest.java:52)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.execute(NacosAsyncRestTemplate.java:364)
        at com.alibaba.nacos.common.http.client.NacosAsyncRestTemplate.delete(NacosAsyncRestTemplate.java:105)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpRequest(HttpClient.java:195)
        at com.alibaba.nacos.naming.misc.HttpClient.asyncHttpDelete(HttpClient.java:159)
        at com.alibaba.nacos.naming.healthcheck.ClientBeatCheckTask.deleteIp(ClientBeatCheckTask.java:144)
        at com.alibaba.nacos.naming.healthcheck.ClientBeatCheckTask.run(ClientBeatCheckTask.java:122)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

nacos-server-thirdapp-nacos-server-ha-extdb-2 调用其他实例nacos/v1/ns/service/status接口的时候报错了, 但是抓取参数,通过curl调用是成功的. 最终找到错误原因如下:

Request cannot be executed; I/O reactor status: STOPPED的错误原因 https://blog.csdn.net/zhaopeng_yu/article/details/118763153

@KomachiSion

yydzxz commented 8 months ago

我猜测,nacos页面上, 服务健康状态一会儿true,一会儿false,也是这个导致的. 因为发现实例挂掉的nacos节点,无法把实例状态同步给其他nacos实例.