Closed ZrBac closed 7 months ago
Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.
java.util.zip.ZipException: invalid stored block lengths
从报错上看, 你的快照文件有损坏, zip在解压缩的时候提示block的长度不正确。
Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.
java.util.zip.ZipException: invalid stored block lengths
从报错上看, 你的快照文件有损坏, zip在解压缩的时候提示block的长度不正确。
这种文件损坏的场景,有办法规避,或者恢复节点吗
多节点集群的话,直接删除掉对应group的data文件即可,启动后会从leader处重新获取snapshot并写入磁盘。
如果是单机版则没有办法,只能删除掉文件后启动,不过会导致数据丢失。
是否可以增加检查机制,遇到异常文件进行转移,不影响启动?这样集群场景下还能自愈恢复。
不行,根据raft协议的定义, snapshot,apply中遇到的异常,会导致状态机的数据不一致和异常,此时状态机处于不可工作的状态,CP协议也是可舍弃可用性A,必须要保证数据一致性C, 因此当快照加载失败(无论什么原因)或数据重放失败(无论什么原因),导致状态机无法达到数据一致时, 必须停止,需要手动介入恢复。
事实上集群这个时候应该是可用的,只是存在一个故障节点(除非只有这一个节点)。
Describe the bug 内置数据库derby,3节点集群,模拟故障恢复场景,节点关机后重启,概率性存在个别节点无法启动,报错原因为加载derby_data文件失败
Expected behavior 相关日志:
nacos.log
Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'distributedDatabaseOperateImpl' defined in URL [jar:file:/opt/CSE/apps/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-config-2.1.0.jar!/com/alibaba/nacos/config/server/service/repository/embedded/DistributedDatabaseOperateImpl.class]: Bean instantiation via constructor failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl]: Constructor threw exception; nested exception is java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:304) at org.springframework.beans.factory.support.ConstructorResolver.autowireConstructor(ConstructorResolver.java:285) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.autowireConstructor(AbstractAutowireCapableBeanFactory.java:1338) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1185) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:554) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:514) at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:321) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:319) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199) at org.springframework.beans.factory.config.DependencyDescriptor.resolveCandidate(DependencyDescriptor.java:277) at org.springframework.beans.factory.support.DefaultListableBeanFactory.doResolveDependency(DefaultListableBeanFactory.java:1276) at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveDependency(DefaultListableBeanFactory.java:1196) at org.springframework.beans.factory.support.ConstructorResolver.resolveAutowiredArgument(ConstructorResolver.java:857) at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:760) ... 41 common frames omitted Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl]: Constructor threw exception; nested exception is java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:187) at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:117) at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:300) ... 55 common frames omitted Caused by: java.lang.IllegalStateException: Fail to init node, please see the logs to find the reason. at com.alipay.sofa.jraft.RaftServiceFactory.createAndInitRaftNode(RaftServiceFactory.java:48) at com.alipay.sofa.jraft.RaftGroupService.start(RaftGroupService.java:129) at com.alibaba.nacos.core.distributed.raft.JRaftServer.createMultiRaftGroup(JRaftServer.java:269) at com.alibaba.nacos.core.distributed.raft.JRaftProtocol.addRequestProcessors(JRaftProtocol.java:163) at com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl.init(DistributedDatabaseOperateImpl.java:208) at com.alibaba.nacos.config.server.service.repository.embedded.DistributedDatabaseOperateImpl.(DistributedDatabaseOperateImpl.java:174)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:175)
... 57 common frames omitted
protocol-raft.log
2024-04-09 15:07:34,017 ERROR Fail to load snapshot from /opt/CSE/apps/nacos/data/protocol/raft/nacos_config/snapshot, FirstSnapshotLoadDone status is Status[UNKNOWN<-1>: StateMachine onSnapshotLoad failed].
2024-04-09 15:07:34,017 ERROR Encountered an error=Status[ESTATEMACHINE<10002>: StateMachine onSnapshotLoad failed] on StateMachine com.alibaba.nacos.core.distributed.raft.NacosStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,017 ERROR Node <nacos_config/devuc-gamma-az2-1:7848> initSnapshotStorage failed.
2024-04-09 15:07:34,017 WARN Node <nacos_config/devuc-gamma-az2-1:7848> got error: {}.
com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,017 WARN FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: StateMachine onSnapshotLoad failed at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:656) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2024-04-09 15:07:34,021 INFO Node <naming_persistent_service/devuc-gamma-az2-1:7848> shutdown, currTerm=5 state=STATE_FOLLOWER.
config-fatal.log 2024-04-11 15:11:06,959 ERROR Fail to load snapshot, path=/opt/test/apps/nacos/data/protocol/raft/nacos_config/snapshot/snapshot_4115, file list={derby_data.zip=LocalFileMeta{fileMeta={checkSum=5a8bc59229f45409}}}, {}.
java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at java.util.zip.ZipInputStream.read(ZipInputStream.java:194) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1127) at org.apache.commons.io.IOUtils.copy(IOUtils.java:849) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1104) at org.apache.commons.io.IOUtils.copy(IOUtils.java:825) at com.alibaba.nacos.sys.utils.DiskUtils.decompress(DiskUtils.java:433) at com.alibaba.nacos.config.server.service.repository.embedded.DerbySnapshotOperation.onSnapshotLoad(DerbySnapshotOperation.java:120) at com.alibaba.nacos.core.distributed.raft.NacosStateMachine$1.onSnapshotLoad(NacosStateMachine.java:308) at com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onSnapshotLoad(NacosStateMachine.java:172) at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:654) at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:399) at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148) at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748)
从日志看,应该是关机后重启的节点,derby_data.zip文件无法正常解压和加载。
咨询有没有什么好的解决方式或者规避方式
Actually behavior A clear and concise description of what you actually to happen.
How to Reproduce Steps to reproduce the behavior:
Desktop (please complete the following information):
Additional context Add any other context about the problem here.