alibaba / nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
https://nacos.io
Apache License 2.0
30.41k stars 12.87k forks source link

Nacos 2.1.0 集群模式下(3节点),使用内置数据库,模拟故障场景,强制关机节点后重启恢复,概率性出现节点无法启动 #11959

Closed ZrBac closed 7 months ago

ZrBac commented 7 months ago

Describe the bug 内置数据库derby,3节点集群,模拟故障恢复场景,节点关机后重启,概率性存在个别节点无法启动,报错原因为加载derby_data文件失败

Expected behavior 相关日志:

nacos.log

Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'distributedDatabaseOperateImpl' defined in URL [jar:file:/opt/CSE/apps/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-config-2.1.0.jar!/com/alibaba/nacos/config/server/service/repository/embedded/DistributedDatabaseOperateImpl.class]: Bean instantiation via constructor failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl]: Constructor threw exception; nested exception is java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:304) at org.springframework.beans.factory.support.ConstructorResolver.autowireConstructor(ConstructorResolver.java:285) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.autowireConstructor(AbstractAutowireCapableBeanFactory.java:1338) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1185) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:554) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:514) at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:321) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:319) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199) at org.springframework.beans.factory.config.DependencyDescriptor.resolveCandidate(DependencyDescriptor.java:277) at org.springframework.beans.factory.support.DefaultListableBeanFactory.doResolveDependency(DefaultListableBeanFactory.java:1276) at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveDependency(DefaultListableBeanFactory.java:1196) at org.springframework.beans.factory.support.ConstructorResolver.resolveAutowiredArgument(ConstructorResolver.java:857) at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:760) ... 41 common frames omitted Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl]: Constructor threw exception; nested exception is java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:187) at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:117) at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:300) ... 55 common frames omitted Caused by: java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at com.alipay.sofa.jraft.RaftServiceFactory.createAndInitRaftNode(RaftServiceFactory.java:48) at com.alipay.sofa.jraft.RaftGroupService.start(RaftGroupService.java:129) at com.alibaba.nacos.core.distributed.raft.JRaftServer.createMultiRaftGroup(JRaftServer.java:269) at com.alibaba.nacos.core.distributed.raft.JRaftProtocol.addRequestProcessors(JRaftProtocol.java:163) at com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl.init(DistributedDatabaseOperateImpl.java:208) at com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl.(DistributedDatabaseOperateImpl.java:174) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:175) ... 57 common frames omitted

protocol-raft.log

2024-04-09 15:07:34,017 ERROR Fail to load snapshot from /opt/CSE/apps/nacos/data/protocol/raft/nacos_config/snapshot, FirstSnapshotLoadDone status is Status[UNKNOWN<-1>: StateMachine onSnapshotLoad failed].

2024-04-09 15:07:34,017 ERROR Encountered an error=Status[ESTATEMACHINE<10002>: StateMachine onSnapshotLoad failed] on StateMachine com.alibaba.nacos.core.distributed.raft.NacosStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.

com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,017 ERROR Node <nacos_config/devuc-gamma-az2-1:7848> initSnapshotStorage failed.

2024-04-09 15:07:34,017 WARN Node <nacos_config/devuc-gamma-az2-1:7848> got error: {}.

com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,017 WARN FSMCaller already in error status, ignore new error.

com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,021 INFO Node <naming_persistent_service/devuc-gamma-az2-1:7848> shutdown, currTerm=5 state=STATE_FOLLOWER.

config-fatal.log 2024-04-11 15:11:06,959 ERROR Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.

java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.util.zip.ZipInputStream.read(ZipInputStream.java:194) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1127) at org.apache.commons.io.IOUtils.copy(IOUtils.java:849) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1104) at org.apache.commons.io.IOUtils.copy(IOUtils.java:825) at com.alibaba.nacos.sys.utils.DiskUtils.decompress(DiskUtils.java:433) at com.alibaba.nacos.config.server.service.repository.embedded.DerbySnapshotOperation.onSnapshotLoad(DerbySnapshotOperation.java:120) at com.alibaba.nacos.core.distributed.raft.NacosStateMachine$1.onSnapshotLoad(NacosStateMachine.java:308) at com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onSnapshotLoad(NacosStateMachine.java:172) at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:654) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748)

从日志看,应该是关机后重启的节点,derby_data.zip文件无法正常解压和加载。

咨询有没有什么好的解决方式或者规避方式

Actually behavior A clear and concise description of what you actually to happen.

How to Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See errors

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

KomachiSion commented 7 months ago

Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.

java.util.zip.ZipException: invalid stored block lengths

从报错上看, 你的快照文件有损坏, zip在解压缩的时候提示block的长度不正确。

ZrBac commented 7 months ago

Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.

java.util.zip.ZipException: invalid stored block lengths

从报错上看, 你的快照文件有损坏, zip在解压缩的时候提示block的长度不正确。

这种文件损坏的场景,有办法规避,或者恢复节点吗

KomachiSion commented 7 months ago

多节点集群的话,直接删除掉对应group的data文件即可,启动后会从leader处重新获取snapshot并写入磁盘。

如果是单机版则没有办法,只能删除掉文件后启动,不过会导致数据丢失。

little-cui commented 7 months ago

是否可以增加检查机制,遇到异常文件进行转移,不影响启动?这样集群场景下还能自愈恢复。

KomachiSion commented 7 months ago

不行,根据raft协议的定义, snapshot,apply中遇到的异常,会导致状态机的数据不一致和异常,此时状态机处于不可工作的状态,CP协议也是可舍弃可用性A,必须要保证数据一致性C, 因此当快照加载失败(无论什么原因)或数据重放失败(无论什么原因),导致状态机无法达到数据一致时, 必须停止,需要手动介入恢复。

事实上集群这个时候应该是可用的,只是存在一个故障节点(除非只有这一个节点)。