alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.01k stars 12.8k forks source link

单例模式启动, 服务启动报jarft错误“No leader for raft group naming_persistent_service” #12504

Closed damonj closed 1 week ago

damonj commented 4 weeks ago

从2.4.0.1升级到2.4.1,服务启动时报错误

Caused by: com.alibaba.nacos.api.exception.NacosException: failed to req API:/api//nacos/v1/ns/instance after all servers([...:6001]) tried: server is DOWNnow, detailed error message: Optional[No leader for raft group naming_persistent_service, please see logs alipay-jraft.log or naming-raft.log to see details.] at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:496) at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:401) at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:397) at com.alibaba.nacos.client.naming.net.NamingProxy.registerService(NamingProxy.java:212) at com.alibaba.nacos.client.naming.NacosNamingService.registerInstance(NacosNamingService.java:207) at com.alibaba.cloud.nacos.registry.NacosServiceRegistry.register(NacosServiceRegistry.java:64)

KomachiSion commented 3 weeks ago

看下对应日志,为什么没选出leader

KomachiSion commented 3 weeks ago

是不是本机ip变更了,之前的ip因为raft的元数据持久化导致无法访问而无法选主。

karsonto commented 3 weeks ago

可以删除user.home 下面nacos data文件夹再启动试试。

damonj commented 3 weeks ago

看下对应日志,为什么没选出leader 2024-08-16 21:54:56,140 INFO Initializes the Raft protocol, raft-config info : {"data":{},"members":["10.1.1.2:5004"],"selfMember":"10.1.1.2:5004"}

2024-08-16 21:54:57,459 INFO ========= The raft protocol is starting... =========

2024-08-16 21:54:58,832 INFO ========= The raft protocol start finished... =========

2024-08-16 21:55:03,130 INFO create raft group : naming_persistent_service

2024-08-16 21:55:04,452 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:04,567 INFO create raft group : naming_persistent_service_v2

2024-08-16 21:55:04,891 INFO create raft group : naming_instance_metadata

2024-08-16 21:55:04,928 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service_v2', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,262 INFO This Raft event changes : RaftEvent{groupId='naming_instance_metadata', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,267 INFO create raft group : naming_service_metadata

2024-08-16 21:55:05,552 INFO This Raft event changes : RaftEvent{groupId='naming_service_metadata', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,780 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:605) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:498) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-08-16 21:57:59,048 INFO shutdown jraft server

KomachiSion commented 3 weeks ago

Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service

元数据里有一个10.1.1.2:5004, 你看下本机ip应该不是这个。

damonj commented 3 weeks ago

Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service

元数据里有一个10.1.1.2:5004, 你看下本机ip应该不是这个。

ip没问题,回退到2.4.0.1就不报错了;但是也有环境里是升级到2.4.1是成功的。

KomachiSion commented 2 weeks ago

日志其实很明显,启动的时候,发现元数据里有ip:10.1.1.2:5004作为leader, 于是尝试加入自身到集群中,加入集群的操作需要通过leader写入到元数据中, 但是此时加入集群失败,原因是没有找到leader,这个加载的元数据内容矛盾。

可以判断当时肯定是无法连接上10.1.1.2:5004以获取最新的group信息和元数据, 也没有这个ip的leader来进行心跳续约,所以最终没有找到leader,没有加入集群成功。

可以按照@karsonto的方法,移除本地data目录后重试,同时再看一下alipay-jraft日志,有可能你会发现日志显示的新的leaderip或端口和之前这个不同

akinlau commented 2 weeks ago

我也是从2.3.2升级到2.4.1,alipay-jraft.log日志一直报错: 2024-08-28 12:38:44,979 INFO Node <naming_persistent_service_v2/192.168.1.2:8895> term 0 start preVote.

2024-08-28 12:38:44,980 WARN Node <naming_persistent_service_v2/192.168.1.2:8895> PreVote to 192.168.1.3:8895 error: Status[ENOENT<1012>: Peer id not found: 192.168.1.3:8895, group: naming_persistent_service_v2].

2024-08-28 12:38:44,980 WARN Node <naming_persistent_service_v2/192.168.1.2:8895> PreVote to 192.168.1.4:8895 error: Status[ENOENT<1012>: Peer id not found: 192.168.1.4:8895, group: naming_persistent_service_v2].

尝试把data目录下的文件全部删重启也一样,使用api查看状态,提示server down了 curl -X GET 'http://192.168.1.2:9895/nacos/v1/ns/raft/state' server is DOWNnow, detailed error message: Optional[No leader for raft group naming_persistent_service, please see logs alipay-jraft.log or naming-raft.log to see details.]

回退到2.3.2就正常

wsldl123292 commented 2 weeks ago

ip没问题,回退到2.4.0.1就不报错了;但是也有环境里是升级到2.4.1是成功的。我也是这个效果,我升级了3个地方的nacos到2.4.1,成功了2个,一个也是这个错误,退回去2.4.0就好了

KomachiSion commented 1 week ago

https://github.com/alibaba/nacos/pull/12573 优化了一下对Server Status的校验逻辑, 再不使用到raft相关的接口上不再直接拦截请求,以保证核心功能的可用性。

但如果一直保持jraft无法选主的情况下, 对应依赖raft的功能仍然会有问题无法使用, 需要介入修复raft选主问题.