alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30k stars 12.8k forks source link

The Raft Group [nacos_config] did not find the Leader node #9787

Closed mygithubzh123 closed 1 year ago

mygithubzh123 commented 1 year ago

background:

nacos v2.1.0 embedded storage cluster mode

cluster.conf:

2023-01-03T13:11:31.917

192.168.10.110:47588 192.168.11.13:47588 192.168.11.249:47588 192.168.8.192:47588 192.168.8.63:47588 #just for mark: this is the failed node

application.properteis: all by default except for properties below: server.port=47588 nacos.core.protocol.raft.data.rpc_request_timeout_ms=10000 nacos.core.protocol.raft.data.election_timeout_ms=10000

situation:

The cluster which has 5 nodes starts up with 1 node failed, then i restart the failed node via '/bin/start.sh -p embedded' ,but it doesn't work. Related resources are identical from each other within the 5 nodes, including nacos-server.jar, application.properties, cluster.conf ,env variables...

here are some key logs below:

alipay-jraft.log:

2023-01-03 13:11:32,634 INFO SPI service [com.alipay.sofa.jraft.JRaftServiceFactory - com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory] loading. 2023-01-03 13:11:32,758 INFO SPI service [com.alipay.sofa.jraft.rpc.RaftRpcFactory - com.alipay.sofa.jraft.rpc.impl.GrpcRaftRpcFactory] loading. 2023-01-03 13:11:33,270 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeDescribeSignalHandler] loading. 2023-01-03 13:11:33,271 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeMetricsSignalHandler] loading. 2023-01-03 13:11:33,271 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.ThreadPoolMetricsSignalHandler] loading. 2023-01-03 13:11:33,284 INFO SPI service [com.alipay.sofa.jraft.util.timer.RaftTimerFactory - com.alipay.sofa.jraft.util.timer.DefaultRaftTimerFactory] loading. 2023-01-03 13:11:33,288 INFO The number of active nodes increment to 1. 2023-01-03 13:11:33,502 INFO Success to init RocksDBLogStorage, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/naming_persistent_service/log. 2023-01-03 13:11:33,625 INFO Starts FSMCaller successfully. 2023-01-03 13:11:33,632 WARN No data for snapshot reader /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/naming_persistent_service/snapshot. 2023-01-03 13:11:33,658 INFO Node <naming_persistent_service/192.168.8.63:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.11.13:46588,192.168.11.249:46588,192.168.8.63:46588,192.168.8.192:46588,192.168.10.110:46588, oldConf=. 2023-01-03 13:11:33,660 INFO Node <naming_persistent_service/192.168.8.63:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-03 13:11:33,660 WARN RPC server is not started in RaftGroupService. 2023-01-03 13:11:33,660 INFO Start the RaftGroupService successfully. 2023-01-03 13:11:33,805 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.13:46588. 2023-01-03 13:11:33,805 WARN Channel in SHUTDOWN state: 192.168.11.13:46588. 2023-01-03 13:11:34,522 ERROR Fail to connect 192.168.11.13:46588, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2023-01-03 13:11:34,524 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.249:46588. 2023-01-03 13:11:34,524 WARN Channel in SHUTDOWN state: 192.168.11.249:46588. 2023-01-03 13:11:34,950 WARN Channel in TRANSIENT_FAILURE state: 192.168.8.63:46588. 2023-01-03 13:11:34,950 WARN Channel in SHUTDOWN state: 192.168.8.63:46588. 2023-01-03 13:11:35,228 WARN Channel in TRANSIENT_FAILURE state: 192.168.8.192:46588. 2023-01-03 13:11:35,228 WARN Channel in SHUTDOWN state: 192.168.8.192:46588. 2023-01-03 13:11:35,320 WARN Channel in TRANSIENT_FAILURE state: 192.168.10.110:46588. 2023-01-03 13:11:35,320 WARN Channel in SHUTDOWN state: 192.168.10.110:46588. 2023-01-03 13:11:36,553 ERROR Fail to connect 192.168.11.13:46588, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2023-01-03 13:11:40,116 INFO Node <naming_persistent_service/192.168.8.63:46588> received PreVoteRequest from 192.168.8.192:46588, term=1, currTerm=0, granted=true, requestLastLogId=LogId [index=0, term=0], lastLogId=LogId [index=0, term=0]. 2023-01-03 13:11:40,186 WARN Node <naming_persistent_service/192.168.8.63:46588> is not in active state, currTerm=0. 2023-01-03 13:11:40,186 INFO Node <naming_persistent_service/192.168.8.63:46588> received RequestVoteRequest from 192.168.8.192:46588, term=1, currTerm=0. 2023-01-03 13:11:40,186 INFO Node <naming_persistent_service/192.168.8.63:46588> stepDown, term=0, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:40,250 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/naming_persistent_service/meta-data, term=1, votedFor=0.0.0.0:0, cost time=64 ms 2023-01-03 13:11:40,250 INFO Node <naming_persistent_service/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:40,300 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/naming_persistent_service/meta-data, term=1, votedFor=192.168.8.192:46588, cost time=51 ms 2023-01-03 13:11:40,330 INFO Node <naming_persistent_service/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:40,394 INFO -Djraft.recyclers.maxCapacityPerThread: 4096. 2023-01-03 13:11:42,471 INFO The number of active nodes increment to 2. 2023-01-03 13:11:42,488 INFO Success to init RocksDBLogStorage, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/nacos_config/log. 2023-01-03 13:11:42,734 INFO Starts FSMCaller successfully. 2023-01-03 13:11:42,734 WARN No data for snapshot reader /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/nacos_config/snapshot. 2023-01-03 13:11:42,737 INFO Node <nacos_config/192.168.8.63:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.11.13:46588,192.168.11.249:46588,192.168.8.63:46588,192.168.8.192:46588,192.168.10.110:46588, oldConf=. 2023-01-03 13:11:42,737 INFO Node <nacos_config/192.168.8.63:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-03 13:11:42,737 WARN RPC server is not started in RaftGroupService. 2023-01-03 13:11:42,737 INFO Start the RaftGroupService successfully. 2023-01-03 13:11:49,591 INFO Node <nacos_config/192.168.8.63:46588> received PreVoteRequest from 192.168.8.192:46588, term=1, currTerm=0, granted=true, requestLastLogId=LogId [index=0, term=0], lastLogId=LogId [index=0, term=0]. 2023-01-03 13:11:49,637 WARN Node <nacos_config/192.168.8.63:46588> is not in active state, currTerm=0. 2023-01-03 13:11:49,637 INFO Node <nacos_config/192.168.8.63:46588> received RequestVoteRequest from 192.168.8.192:46588, term=1, currTerm=0. 2023-01-03 13:11:49,637 INFO Node <nacos_config/192.168.8.63:46588> stepDown, term=0, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:49,653 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedFor=0.0.0.0:0, cost time=15 ms 2023-01-03 13:11:49,653 INFO Node <nacos_config/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:49,662 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedFor=192.168.8.192:46588, cost time=8 ms 2023-01-03 13:11:49,679 INFO Node <nacos_config/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:49,739 WARN Channel in TRANSIENT_FAILURE state: 192.168.8.192:46588. 2023-01-03 13:11:49,739 WARN Channel in SHUTDOWN state: 192.168.8.192:46588. 2023-01-03 13:11:49,742 INFO Peer 192.168.8.192:46588 is connected. 2023-01-03 13:11:49,776 INFO Node <naming_persistent_service/192.168.8.63:46588> shutdown, currTerm=1 state=STATE_FOLLOWER. 2023-01-03 13:11:49,776 INFO Node <naming_persistent_service/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:49,777 INFO onStopFollowing: LeaderChangeContext [leaderId=192.168.8.192:46588, term=1, status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]]. 2023-01-03 13:11:49,833 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/naming_persistent_service/meta-data, term=1, votedFor=192.168.8.192:46588, cost time=27 ms 2023-01-03 13:11:49,833 INFO Shutting down FSMCaller... 2023-01-03 13:11:49,834 INFO onShutdown. 2023-01-03 13:11:49,834 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@4983159f[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-03 13:11:49,834 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@101639ae[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, comp leted tasks = 0]. 2023-01-03 13:11:49,834 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=10000, name='JRaft-ElectionTimer-<naming_persistent_service/192.168.8.63:46588>'}. 2023-01-03 13:11:49,834 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=10000, name='JRaft-VoteTimer-<naming_persistent_service/192.168.8.63:46588>'}. 2023-01-03 13:11:49,834 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=5000, name='JRaft-StepDownTimer-<naming_persistent_service/192.168.8.63:46588>'}. 2023-01-03 13:11:49,834 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JRaft-SnapshotTimer-<naming_persistent_service/192.168.8.63:46588>'}. 2023-01-03 13:11:49,834 INFO Node <naming_persistent_service/192.168.8.63:46588> shutdown, currTerm=1 state=STATE_SHUTTING. 2023-01-03 13:11:49,834 INFO Stop the RaftGroupService successfully. 2023-01-03 13:11:49,834 INFO Node <nacos_config/192.168.8.63:46588> shutdown, currTerm=1 state=STATE_FOLLOWER. 2023-01-03 13:11:49,834 INFO Node <nacos_config/192.168.8.63:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-03 13:11:49,840 INFO onStopFollowing: LeaderChangeContext [leaderId=192.168.8.192:46588, term=1, status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]]. 2023-01-03 13:11:49,847 INFO The number of active nodes decrement to 1. 2023-01-03 13:11:49,876 INFO Save raft meta, path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedFor=192.168.8.192:46588, cost time=29 ms 2023-01-03 13:11:49,876 INFO Shutting down FSMCaller... 2023-01-03 13:11:49,876 INFO onShutdown. 2023-01-03 13:11:49,876 INFO Shutdown managed channel: 192.168.8.192:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=33, target=192.168.8.192:46588}}. 2023-01-03 13:11:49,883 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@1d2bd371[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, comp leted tasks = 0]. 2023-01-03 13:11:49,883 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@1fd74636[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]. 2023-01-03 13:11:49,883 INFO ThreadPool is terminated: JRaft-Global-ElectionTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@2dad66e3[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, complet ed tasks = 1]. 2023-01-03 13:11:49,883 INFO The number of active nodes decrement to 0. 2023-01-03 13:11:49,886 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=10000, name='JRaft-ElectionTimer-<nacos_config/192.168.8.63:46588>'}. 2023-01-03 13:11:49,886 INFO ThreadPool is terminated: JRaft-Global-VoteTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@44040454[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed t asks = 0]. 2023-01-03 13:11:49,886 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=10000, name='JRaft-VoteTimer-<nacos_config/192.168.8.63:46588>'}. 2023-01-03 13:11:49,886 INFO ThreadPool is terminated: JRaft-Global-StepDownTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@65fe9e33[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, complet ed tasks = 0]. 2023-01-03 13:11:49,886 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=5000, name='JRaft-StepDownTimer-<nacos_config/192.168.8.63:46588>'}. 2023-01-03 13:11:49,886 INFO ThreadPool is terminated: JRaft-Global-SnapshotTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@1ac31baf[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, complet ed tasks = 0]. 2023-01-03 13:11:49,886 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JRaft-SnapshotTimer-<nacos_config/192.168.8.63:46588>'}. 2023-01-03 13:11:49,887 INFO Node <nacos_config/192.168.8.63:46588> shutdown, currTerm=1 state=STATE_SHUTTING. 2023-01-03 13:11:49,887 INFO Stop the RaftGroupService successfully. 2023-01-03 13:11:49,887 INFO Shutdown managed channel: 192.168.11.249:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=10, target=192.168.11.249:46588}}. 2023-01-03 13:11:49,887 INFO Shutdown managed channel: 192.168.8.192:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=20, target=192.168.8.192:46588}}. 2023-01-03 13:11:49,888 INFO Shutdown managed channel: 192.168.11.13:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=4, target=192.168.11.13:46588}}. 2023-01-03 13:11:49,889 INFO Shutdown managed channel: 192.168.8.63:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=14, target=192.168.8.63:46588}}. 2023-01-03 13:11:49,890 INFO Connection disconnected: /192.168.8.63:52236 2023-01-03 13:11:49,891 INFO Shutdown managed channel: 192.168.10.110:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=24, target=192.168.10.110:46588}}. 2023-01-03 13:11:49,899 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@4782b825[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 62].

config-fatal.log:

2023-01-03 13:11:49,690 WARN isQuickStart:false

2023-01-03 13:11:49,765 ERROR An exception occurred during the query operation : java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node

2023-01-03 13:11:49,765 ERROR Nacos Server did not start because dumpservice bean construction failure : com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node

protocol-raft.log

java.lang.NullPointerException: null at com.alipay.sofa.jraft.core.CliServiceImpl.getLeader(CliServiceImpl.java:469) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:621) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:515) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:364) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:268) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-01-03 14:54:43,746 ERROR Failed to join the cluster, retry...

java.lang.NullPointerException: null at com.alipay.sofa.jraft.core.CliServiceImpl.getLeader(CliServiceImpl.java:469) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:621) at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:515) at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:364) at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:268) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-01-03 14:54:43,746 ERROR Failed to join the cluster, retry...

start.out

/usr/local/tools/jdk1.8.0_201/bin/java -Djava.ext.dirs=/usr/local/tools/jdk1.8.0_201/jre/lib/ext:/usr/local/tools/jdk1.8.0_201/lib/ext -DembeddedStorage=true -server -Xms962m -Xmx962m -Xmn481m -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=128m -XX:-OmitStackTraceInFastThrow -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/logs/java_heapdump.hprof -XX:-UseLargePages -Dnacos.member.list= -Xloggc:/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/logs/nacos_gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dloader.path=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/plugins/health,/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/plugins/cmdb,/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/plugins/selector -Dnacos.home=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos -jar /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/target/nacos-server.jar --spring.config.additional-location=file:/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/conf/ --logging.config=/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/conf/nacos-logback.xml --server.max-http-header-size=524288

     ,--.
   ,--.'|

,--,: : | Nacos 2.1.0 ,--.'| ' : ,---. Running in cluster mode, All function modules | : : | | ' ,'\ .--.--. Port: 47588 : | \ | : ,--.--. ,---. / / | / / ' Pid: 6597 | : ' '; | / \ / . ; ,. :| : /./ Console: http://192.168.8.63:47588/nacos/index.html ' ' ;. ;.--. .-. | / / '' | |: :| : ;_ | | | \ | \__\/: . .. ' / ' | .; : \ \. https://nacos.io ' : | ; .' ," .--.; |' ; :__| : | ----. \ | | '--' / / ,. |' | '.'|\ \ / / /--' / ' : | ; : .' \ : :----' '--'. / ; |.' | , .-./\ \ / --'---' '---'-----'----'

2023-01-03 13:11:24,181 INFO The server IP list of Nacos is [192.168.11.249:47588, 192.168.11.13:47588, 192.168.10.110:47588, 192.168.8.192:47588, 192.168.8.63:47588]

2023-01-03 13:11:25,184 INFO Nacos is starting...

2023-01-03 13:11:26,185 INFO Nacos is starting...

2023-01-03 13:11:27,189 INFO Nacos is starting...

2023-01-03 13:11:28,190 INFO Nacos is starting...

2023-01-03 13:11:29,193 INFO Nacos is starting...

2023-01-03 13:11:30,198 INFO Nacos is starting...

2023-01-03 13:11:31,208 INFO Nacos is starting...

2023-01-03 13:11:32,210 INFO Nacos is starting...

2023-01-03 13:11:33,215 INFO Nacos is starting...

2023-01-03 13:11:34,218 INFO Nacos is starting...

2023-01-03 13:11:35,219 INFO Nacos is starting...

2023-01-03 13:11:36,220 INFO Nacos is starting...

2023-01-03 13:11:37,221 INFO Nacos is starting...

2023-01-03 13:11:38,223 INFO Nacos is starting...

2023-01-03 13:11:39,223 INFO Nacos is starting...

2023-01-03 13:11:40,225 INFO Nacos is starting...

2023-01-03 13:11:41,225 INFO Nacos is starting...

2023-01-03 13:11:42,227 INFO Nacos is starting...

2023-01-03 13:11:43,230 INFO Nacos is starting...

2023-01-03 13:11:44,230 INFO Nacos is starting...

2023-01-03 13:11:45,231 INFO Nacos is starting...

2023-01-03 13:11:46,231 INFO Nacos is starting...

2023-01-03 13:11:47,233 INFO Nacos is starting...

2023-01-03 13:11:48,233 INFO Nacos is starting...

2023-01-03 13:11:49,234 INFO Nacos is starting...

2023-01-03 13:11:49,963 INFO Nacos Log files: /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/logs

2023-01-03 13:11:49,964 INFO Nacos Log files: /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/conf

2023-01-03 13:11:49,964 INFO Nacos Log files: /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data

2023-01-03 13:11:49,964 ERROR Startup errors :

org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'communicationController' defined in URL [jar:file:/msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-config-2.1.0.jar!/com/alibaba/nacos/config/server/controller/CommunicationController.class]: Unsatisfied dependency expressed through constructor parameter 0; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'embeddedDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : errCode: 500, errMsg: java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:800) at org.springframework.beans.factory.support.ConstructorResolver.autowireConstructor(ConstructorResolver.java:229) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.autowireConstructor(AbstractAutowireCapableBeanFactory.java:1372) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1222) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:582) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542) at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:333) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:208) at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:953) at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:918) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:583) at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:145) at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:740) at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:415) at org.springframework.boot.SpringApplication.run(SpringApplication.java:303) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1312) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1301) at com.alibaba.nacos.Nacos.main(Nacos.java:35) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:108) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.PropertiesLauncher.main(PropertiesLauncher.java:467) Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'embeddedDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : errCode: 500, errMsg: java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:160) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:440) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1796) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:620) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:542) at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:333) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:208) at org.springframework.beans.factory.config.DependencyDescriptor.resolveCandidate(DependencyDescriptor.java:276) at org.springframework.beans.factory.support.DefaultListableBeanFactory.doResolveDependency(DefaultListableBeanFactory.java:1389) at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveDependency(DefaultListableBeanFactory.java:1309) at org.springframework.beans.factory.support.ConstructorResolver.resolveAutowiredArgument(ConstructorResolver.java:887) at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:791) ... 27 common frames omitted Caused by: com.alibaba.nacos.api.exception.NacosException: Nacos Server did not start because dumpservice bean construction failure : errCode: 500, errMsg: java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node at com.alibaba.nacos.config.server.service.dump.DumpService.dumpOperate(DumpService.java:225) at com.alibaba.nacos.config.server.service.dump.EmbeddedDumpService$1.lambda$update$0(EmbeddedDumpService.java:109) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: java.util.concurrent.CompletionException: com.alibaba.nacos.consistency.exception.ConsistencyException: The Raft Group [nacos_config] did not find the Leader node at com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl.queryOne(DistributedDatabaseOperateImpl.java:238) at com.alibaba.nacos.config.server.service.repository.embedded.EmbeddedStoragePersistServiceImpl.findConfigMaxId(EmbeddedStoragePersistServiceImpl.java:756) at com.alibaba.nacos.config.server.service.dump.processor.DumpAllProcessor.process(DumpAllProcessor.java:51) at com.alibaba.nacos.config.server.service.dump.DumpService.dumpConfigInfo(DumpService.java:282) at com.alibaba.nacos.config.server.service.dump.DumpService.dumpOperate(DumpService.java:195) ... 8 common frames omitted 2023-01-03 13:11:52,078 WARN [WatchFileCenter] start close

2023-01-03 13:11:52,078 WARN [WatchFileCenter] start to shutdown this watcher which is watch : /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/data/loader

2023-01-03 13:11:52,078 WARN [WatchFileCenter] start to shutdown this watcher which is watch : /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/conf

2023-01-03 13:11:52,079 WARN [WatchFileCenter] already closed

2023-01-03 13:11:52,079 WARN [NotifyCenter] Start destroying Publisher

2023-01-03 13:11:52,079 WARN [NotifyCenter] Destruction of the end

2023-01-03 13:11:52,079 ERROR Nacos failed to start, please see /msercm/91c2041571cdaa29c3bca37559e6b257/nacos/nacos/logs/nacos.log for more details.

===================================================== i don't know why the cluster starts failed? and how to fix it?

KomachiSion commented 1 year ago

From the logs of alipay-jraft.log, I see the node can't connect to other nodes so that it can't vote for leader and get leader.

config with derby storage cluster should use jraft to sync config data but jraft can work without leader, so the node can start.

You should fix the network problem let the node can connect to other nodes.

mygithubzh123 commented 1 year ago

I've considered the point that the networking problem matters ,but it seems not.

server port:47588 then raft port :46588 from the failed node(192.168.8.63) telnet port of 46588 of the other 4 nodes seperately, both work.

I've also thought about transient network problems, but it may not lead to repeated restart failures.

KomachiSion commented 1 year ago

From the logs, when server starting, the jraft can't connect to other nodes, and due to the jraft can't vote leader so that the config can't get data. For protect the data consistency, server can't start and do shutdown self. After this, the network can connect is invalid.

Two suggestion for this case:

  1. make sure the connection can work when server before starting.
  2. use external storage like MySQL, can also check it can connect before starting.
mygithubzh123 commented 1 year ago

Previously on

I had solved the networking problems and the rpc connection would not fail again,but the "nacos_config group can't find leader" situation still occurred.

Problems(5 node cluster with embeded storage and one of it's node start failed)

The failed node restarted then failed again with error occurred when doing "ReadIndex". So i used tcpdump and wireshark to analysis the packets in and out during the node's startup, and i found that ,the node did find the leader from the other peer via the "GetLeaderRequest" calling then got all peers from the leader via the "GetPeersRequest" calling, and read index from leader via the "ReadIndexRequest" calling(networking cost only several ms) , but then logs showed it's read-index timeout error.

image

here are some logs below:

alipay-jraft.log

2023-01-09 13:52:38,061 INFO SPI service [com.alipay.sofa.jraft.JRaftServiceFactory - com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory] loading. 2023-01-09 13:52:38,205 INFO SPI service [com.alipay.sofa.jraft.rpc.RaftRpcFactory - com.alipay.sofa.jraft.rpc.impl.GrpcRaftRpcFactory] loading. 2023-01-09 13:52:38,819 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeDescribeSignalHandler] loading. 2023-01-09 13:52:38,820 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeMetricsSignalHandler] loading. 2023-01-09 13:52:38,820 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.ThreadPoolMetricsSignalHandler] loading. 2023-01-09 13:52:38,822 INFO SPI service [com.alipay.sofa.jraft.util.timer.RaftTimerFactory - com.alipay.sofa.jraft.util.timer.DefaultRaftTimerFactory] loading. 2023-01-09 13:52:38,842 INFO The number of active nodes increment to 1. 2023-01-09 13:52:39,020 INFO Success to init RocksDBLogStorage, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_s ervice/log. 2023-01-09 13:52:39,114 INFO Starts FSMCaller successfully. 2023-01-09 13:52:39,121 WARN No data for snapshot reader /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_service/snaps hot. 2023-01-09 13:52:39,149 INFO Node <naming_persistent_service/192.168.11.37:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.10.40:46588,192. 168.11.3:46588,192.168.9.172:46588,192.168.11.37:46588,192.168.8.218:46588, oldConf=. 2023-01-09 13:52:39,151 INFO Node <naming_persistent_service/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:39,151 WARN RPC server is not started in RaftGroupService. 2023-01-09 13:52:39,151 INFO Start the RaftGroupService successfully. 2023-01-09 13:52:39,381 WARN Channel in TRANSIENT_FAILURE state: 192.168.10.40:46588. 2023-01-09 13:52:39,381 WARN Channel in SHUTDOWN state: 192.168.10.40:46588. 2023-01-09 13:52:40,316 INFO Success to connect peer 192.168.10.40:46588 to get leader for group naming_persistent_service. 2023-01-09 13:52:40,398 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.3:46588. 2023-01-09 13:52:40,398 WARN Channel in SHUTDOWN state: 192.168.11.3:46588. 2023-01-09 13:52:47,879 INFO The number of active nodes increment to 2. 2023-01-09 13:52:47,881 INFO Success to init RocksDBLogStorage, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/log. 2023-01-09 13:52:47,957 INFO Starts FSMCaller successfully. 2023-01-09 13:52:47,957 WARN No data for snapshot reader /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/snapshot. 2023-01-09 13:52:47,963 INFO Node <nacos_config/192.168.11.37:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.10.40:46588,192.168.11.3:46588,192.168.9.172:46588,192.168.11.37:46588,192.168.8.218:46588, oldConf=. 2023-01-09 13:52:47,963 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:47,963 WARN RPC server is not started in RaftGroupService. 2023-01-09 13:52:47,963 INFO Start the RaftGroupService successfully. 2023-01-09 13:52:47,963 INFO Success to connect peer 192.168.10.40:46588 to get leader for group nacos_config. 2023-01-09 13:52:51,120 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=0, newTerm=1, wakeupCandidate=false. 2023-01-09 13:52:51,135 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedFor=0.0.0.0:0, cost time=14 ms 2023-01-09 13:52:51,212 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.3:46588. 2023-01-09 13:52:51,212 WARN Channel in SHUTDOWN state: 192.168.11.3:46588. 2023-01-09 13:52:51,226 INFO Peer 192.168.11.3:46588 is connected. 2023-01-09 13:52:53,225 INFO Node <naming_persistent_service/192.168.11.37:46588> shutdown, currTerm=0 state=STATE_FOLLOWER. 2023-01-09 13:52:53,225 INFO Node <naming_persistent_service/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:53,249 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_service/meta-data, term=0, votedFor=0.0.0.0:0, cost time=17 ms 2023-01-09 13:52:53,249 INFO Shutting down FSMCaller... 2023-01-09 13:52:53,249 INFO onShutdown. 2023-01-09 13:52:53,250 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@23bff419[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,250 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@44e3a2b2[Shut ting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-ElectionTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-VoteTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=225000, name='JRa ft-StepDownTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JR aft-SnapshotTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Node <naming_persistent_service/192.168.11.37:46588> shutdown, currTerm=0 state=STATE_SHUTTING. 2023-01-09 13:52:53,250 INFO Stop the RaftGroupService successfully. 2023-01-09 13:52:53,250 INFO Node <nacos_config/192.168.11.37:46588> shutdown, currTerm=1 state=STATE_FOLLOWER. 2023-01-09 13:52:53,250 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-09 13:52:53,253 INFO onStopFollowing: LeaderChangeContext [leaderId=192.168.11.3:46588, term=1, status=Status[ESHUTDOWN<1007>: Raft node is going to qui t.]]. 2023-01-09 13:52:53,256 INFO The number of active nodes decrement to 1. 2023-01-09 13:52:53,263 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedF or=0.0.0.0:0, cost time=8 ms 2023-01-09 13:52:53,263 INFO Shutting down FSMCaller... 2023-01-09 13:52:53,263 INFO onShutdown. 2023-01-09 13:52:53,263 INFO Shutdown managed channel: 192.168.11.3:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=13, target=192.168.11.3 :46588}}. 2023-01-09 13:52:53,272 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@44040454[Shut ting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,273 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@1551b45a[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 1]. 2023-01-09 13:52:53,275 INFO The number of active nodes decrement to 0. 2023-01-09 13:52:53,276 INFO ThreadPool is terminated: JRaft-Global-ElectionTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@50057935[Shuttin g down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-ElectionTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-VoteTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@65fe9e33[Shutting do wn, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-VoteTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-StepDownTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@18bc345[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=225000, name='JRa ft-StepDownTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-SnapshotTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@65c5d0f4[Shuttin g down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JR aft-SnapshotTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO Node <nacos_config/192.168.11.37:46588> shutdown, currTerm=1 state=STATE_SHUTTING. 2023-01-09 13:52:53,277 INFO Stop the RaftGroupService successfully. 2023-01-09 13:52:53,277 INFO Shutdown managed channel: 192.168.10.40:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=4, target=192.168.10.4 0:46588}}. 2023-01-09 13:52:53,279 INFO Shutdown managed channel: 192.168.11.3:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=9, target=192.168.11.3: 46588}}. 2023-01-09 13:52:53,285 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@1e907aa2[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 4].

protocol-raft.log

2023-01-09 13:52:38,003 INFO Initializes the Raft protocol, raft-config info : {"rpc_request_timeout_ms":"450000","election_timeout_ms":"450000"} 2023-01-09 13:52:38,369 INFO ========= The raft protocol is starting... ========= 2023-01-09 13:52:38,370 INFO [JRaftServer start], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:38,794 INFO ========= The raft protocol start finished... ========= 2023-01-09 13:52:38,795 INFO group naming_persistent_service add processer: com.alibaba.nacos.naming.consistency.persistent.impl.PersistentServiceProcessor 2023-01-09 13:52:38,795 INFO success Init Raft-File dir , path: /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft 2023-01-09 13:52:38,799 INFO create raft group : naming_persistent_service 2023-01-09 13:52:39,155 INFO [JRaftServer createMultiRaftGroup], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:47,877 INFO group nacos_config add processer: com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl 2023-01-09 13:52:47,877 INFO success Init Raft-File dir , path: /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft 2023-01-09 13:52:47,879 INFO create raft group : nacos_config 2023-01-09 13:52:47,966 INFO [***JRaftServer createMultiRaftGroup], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:51,142 INFO This Raft event changes : RaftEvent{groupId='nacos_config', leader='192.168.11.3:46588', term=1, raftClusterInfo=[192.168.10.40:46588, 192.168.11.3:46588, 192.168.9.172:46588, 192.168.11.37:46588, 192.168.8.218:46588]} 2023-01-09 13:52:53,194 ERROR ReadIndex has error : read-index request timeout, go to Leader read. 2023-01-09 13:52:53,216 INFO invokeToLeader by group : nacos_config 2023-01-09 13:52:53,217 ERROR >>>>>>>>> invokeToLeader by group failed! <<<<<<<<<<<< : nacos_config 2023-01-09 13:52:53,225 INFO shutdown jraft server 2023-01-09 13:52:53,225 INFO ========= The raft protocol is starting to close ========= 2023-01-09 13:52:53,282 INFO ========= The raft protocol has been closed =========

Summary

I had tested several clusters with embedded storage of different size such as 3,5,7,9 again and again,and i concluded that the "one node' nacos_config group can't find leader" situation occurred more frequently when the cluster size come to be larger. Although the clusters using MySQL as external storage,that i tested, worked every time,but they are not my cup of tea.

So I wonder if external storage is more recommended and more suitable than embedded storage especially when cluster size become large enough. Is something i have to do when i need such large cluster with embeded storage?

KomachiSion commented 1 year ago

maybe question about your last quesion: #9800 #9792

mygithubzh123 commented 1 year ago

Ok. It seems that switching the cluster storage from derby to MySQL is more practical than debugging and analyzing the failures like this, although my case prefers internal storage .Thank you!

JiangJianCong commented 2 months ago

Previously on

I had solved the networking problems and the rpc connection would not fail again,but the "nacos_config group can't find leader" situation still occurred.

Problems(5 node cluster with embeded storage and one of it's node start failed)

The failed node restarted then failed again with error occurred when doing "ReadIndex". So i used tcpdump and wireshark to analysis the packets in and out during the node's startup, and i found that ,the node did find the leader from the other peer via the "GetLeaderRequest" calling then got all peers from the leader via the "GetPeersRequest" calling, and read index from leader via the "ReadIndexRequest" calling(networking cost only several ms) , but then logs showed it's read-index timeout error.

image

here are some logs below:

alipay-jraft.log

2023-01-09 13:52:38,061 INFO SPI service [com.alipay.sofa.jraft.JRaftServiceFactory - com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory] loading. 2023-01-09 13:52:38,205 INFO SPI service [com.alipay.sofa.jraft.rpc.RaftRpcFactory - com.alipay.sofa.jraft.rpc.impl.GrpcRaftRpcFactory] loading. 2023-01-09 13:52:38,819 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeDescribeSignalHandler] loading. 2023-01-09 13:52:38,820 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeMetricsSignalHandler] loading. 2023-01-09 13:52:38,820 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.ThreadPoolMetricsSignalHandler] loading. 2023-01-09 13:52:38,822 INFO SPI service [com.alipay.sofa.jraft.util.timer.RaftTimerFactory - com.alipay.sofa.jraft.util.timer.DefaultRaftTimerFactory] loading. 2023-01-09 13:52:38,842 INFO The number of active nodes increment to 1. 2023-01-09 13:52:39,020 INFO Success to init RocksDBLogStorage, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_s ervice/log. 2023-01-09 13:52:39,114 INFO Starts FSMCaller successfully. 2023-01-09 13:52:39,121 WARN No data for snapshot reader /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_service/snaps hot. 2023-01-09 13:52:39,149 INFO Node <naming_persistent_service/192.168.11.37:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.10.40:46588,192. 168.11.3:46588,192.168.9.172:46588,192.168.11.37:46588,192.168.8.218:46588, oldConf=. 2023-01-09 13:52:39,151 INFO Node <naming_persistent_service/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:39,151 WARN RPC server is not started in RaftGroupService. 2023-01-09 13:52:39,151 INFO Start the RaftGroupService successfully. 2023-01-09 13:52:39,381 WARN Channel in TRANSIENT_FAILURE state: 192.168.10.40:46588. 2023-01-09 13:52:39,381 WARN Channel in SHUTDOWN state: 192.168.10.40:46588. 2023-01-09 13:52:40,316 INFO Success to connect peer 192.168.10.40:46588 to get leader for group naming_persistent_service. 2023-01-09 13:52:40,398 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.3:46588. 2023-01-09 13:52:40,398 WARN Channel in SHUTDOWN state: 192.168.11.3:46588. 2023-01-09 13:52:47,879 INFO The number of active nodes increment to 2. 2023-01-09 13:52:47,881 INFO Success to init RocksDBLogStorage, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/log. 2023-01-09 13:52:47,957 INFO Starts FSMCaller successfully. 2023-01-09 13:52:47,957 WARN No data for snapshot reader /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/snapshot. 2023-01-09 13:52:47,963 INFO Node <nacos_config/192.168.11.37:46588> init, term=0, lastLogId=LogId [index=0, term=0], conf=192.168.10.40:46588,192.168.11.3:46588,192.168.9.172:46588,192.168.11.37:46588,192.168.8.218:46588, oldConf=. 2023-01-09 13:52:47,963 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:47,963 WARN RPC server is not started in RaftGroupService. 2023-01-09 13:52:47,963 INFO Start the RaftGroupService successfully. 2023-01-09 13:52:47,963 INFO Success to connect peer 192.168.10.40:46588 to get leader for group nacos_config. 2023-01-09 13:52:51,120 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=0, newTerm=1, wakeupCandidate=false. 2023-01-09 13:52:51,135 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedFor=0.0.0.0:0, cost time=14 ms 2023-01-09 13:52:51,212 WARN Channel in TRANSIENT_FAILURE state: 192.168.11.3:46588. 2023-01-09 13:52:51,212 WARN Channel in SHUTDOWN state: 192.168.11.3:46588. 2023-01-09 13:52:51,226 INFO Peer 192.168.11.3:46588 is connected. 2023-01-09 13:52:53,225 INFO Node <naming_persistent_service/192.168.11.37:46588> shutdown, currTerm=0 state=STATE_FOLLOWER. 2023-01-09 13:52:53,225 INFO Node <naming_persistent_service/192.168.11.37:46588> stepDown, term=0, newTerm=0, wakeupCandidate=false. 2023-01-09 13:52:53,249 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/naming_persistent_service/meta-data, term=0, votedFor=0.0.0.0:0, cost time=17 ms 2023-01-09 13:52:53,249 INFO Shutting down FSMCaller... 2023-01-09 13:52:53,249 INFO onShutdown. 2023-01-09 13:52:53,250 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@23bff419[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,250 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@44e3a2b2[Shut ting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-ElectionTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-VoteTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=225000, name='JRa ft-StepDownTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JR aft-SnapshotTimer-<naming_persistent_service/192.168.11.37:46588>'}. 2023-01-09 13:52:53,250 INFO Node <naming_persistent_service/192.168.11.37:46588> shutdown, currTerm=0 state=STATE_SHUTTING. 2023-01-09 13:52:53,250 INFO Stop the RaftGroupService successfully. 2023-01-09 13:52:53,250 INFO Node <nacos_config/192.168.11.37:46588> shutdown, currTerm=1 state=STATE_FOLLOWER. 2023-01-09 13:52:53,250 INFO Node <nacos_config/192.168.11.37:46588> stepDown, term=1, newTerm=1, wakeupCandidate=false. 2023-01-09 13:52:53,253 INFO onStopFollowing: LeaderChangeContext [leaderId=192.168.11.3:46588, term=1, status=Status[ESHUTDOWN<1007>: Raft node is going to qui t.]]. 2023-01-09 13:52:53,256 INFO The number of active nodes decrement to 1. 2023-01-09 13:52:53,263 INFO Save raft meta, path=/msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft/nacos_config/meta-data, term=1, votedF or=0.0.0.0:0, cost time=8 ms 2023-01-09 13:52:53,263 INFO Shutting down FSMCaller... 2023-01-09 13:52:53,263 INFO onShutdown. 2023-01-09 13:52:53,263 INFO Shutdown managed channel: 192.168.11.3:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=13, target=192.168.11.3 :46588}}. 2023-01-09 13:52:53,272 INFO ThreadPool is terminated: JRaft-Node-ScheduleThreadPool, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@44040454[Shut ting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,273 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@1551b45a[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 1]. 2023-01-09 13:52:53,275 INFO The number of active nodes decrement to 0. 2023-01-09 13:52:53,276 INFO ThreadPool is terminated: JRaft-Global-ElectionTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@50057935[Shuttin g down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-ElectionTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-VoteTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@65fe9e33[Shutting do wn, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=450000, name='JRa ft-VoteTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-StepDownTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@18bc345[Shutting down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=225000, name='JRa ft-StepDownTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO ThreadPool is terminated: JRaft-Global-SnapshotTimer, com.alipay.sofa.jraft.util.MetricScheduledThreadPoolExecutor@65c5d0f4[Shuttin g down, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]. 2023-01-09 13:52:53,277 INFO Destroy timer: RepeatedTimer{timeout=null, stopped=true, running=false, destroyed=true, invoking=false, timeoutMs=1800000, name='JR aft-SnapshotTimer-<nacos_config/192.168.11.37:46588>'}. 2023-01-09 13:52:53,277 INFO Node <nacos_config/192.168.11.37:46588> shutdown, currTerm=1 state=STATE_SHUTTING. 2023-01-09 13:52:53,277 INFO Stop the RaftGroupService successfully. 2023-01-09 13:52:53,277 INFO Shutdown managed channel: 192.168.10.40:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=4, target=192.168.10.4 0:46588}}. 2023-01-09 13:52:53,279 INFO Shutdown managed channel: 192.168.11.3:46588, ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=9, target=192.168.11.3: 46588}}. 2023-01-09 13:52:53,285 INFO ThreadPool is terminated: JRaft-RPC-Processor, com.alipay.sofa.jraft.util.MetricThreadPoolExecutor@1e907aa2[Shutting down, pool siz e = 0, active threads = 0, queued tasks = 0, completed tasks = 4].

protocol-raft.log

2023-01-09 13:52:38,003 INFO Initializes the Raft protocol, raft-config info : {"rpc_request_timeout_ms":"450000","election_timeout_ms":"450000"} 2023-01-09 13:52:38,369 INFO ========= The raft protocol is starting... ========= 2023-01-09 13:52:38,370 INFO [JRaftServer start], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:38,794 INFO ========= The raft protocol start finished... ========= 2023-01-09 13:52:38,795 INFO group naming_persistent_service add processer: com.alibaba.nacos.naming.consistency.persistent.impl.PersistentServiceProcessor 2023-01-09 13:52:38,795 INFO success Init Raft-File dir , path: /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft 2023-01-09 13:52:38,799 INFO create raft group : naming_persistent_service 2023-01-09 13:52:39,155 INFO [JRaftServer createMultiRaftGroup], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:47,877 INFO group nacos_config add processer: com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl 2023-01-09 13:52:47,877 INFO success Init Raft-File dir , path: /msercm/f27ac7a25b83547a7ddc53cf576c288f/nacos/nacos/data/protocol/raft 2023-01-09 13:52:47,879 INFO create raft group : nacos_config 2023-01-09 13:52:47,966 INFO [***JRaftServer createMultiRaftGroup], electionTimeout : 450000, RpcConnectTimeoutMs: 450000 2023-01-09 13:52:51,142 INFO This Raft event changes : RaftEvent{groupId='nacos_config', leader='192.168.11.3:46588', term=1, raftClusterInfo=[192.168.10.40:46588, 192.168.11.3:46588, 192.168.9.172:46588, 192.168.11.37:46588, 192.168.8.218:46588]} 2023-01-09 13:52:53,194 ERROR ReadIndex has error : read-index request timeout, go to Leader read. 2023-01-09 13:52:53,216 INFO invokeToLeader by group : nacos_config 2023-01-09 13:52:53,217 ERROR >>>>>>>>> invokeToLeader by group failed! <<<<<<<<<<<< : nacos_config 2023-01-09 13:52:53,225 INFO shutdown jraft server 2023-01-09 13:52:53,225 INFO ========= The raft protocol is starting to close ========= 2023-01-09 13:52:53,282 INFO ========= The raft protocol has been closed =========

Summary

I had tested several clusters with embedded storage of different size such as 3,5,7,9 again and again,and i concluded that the "one node' nacos_config group can't find leader" situation occurred more frequently when the cluster size come to be larger. Although the clusters using MySQL as external storage,that i tested, worked every time,but they are not my cup of tea.

So I wonder if external storage is more recommended and more suitable than embedded storage especially when cluster size become large enough. Is something i have to do when i need such large cluster with embeded storage?

hi, I want to figure out why this happens. Did you finally find out why this happened?