cita-cloud / controller

Apache License 2.0
8 stars 0 forks source link

备份恢复之后一段时间不出快 #189

Closed rink1969 closed 1 year ago

rink1969 commented 1 year ago

在节点849高度时做备份。

2022-11-14T00:03:08.406899871+00:00 INFO controller::chain - finalize_block height: 1050
2022-11-14T00:03:08.418200512+00:00 INFO controller::node_manager - update node: NodeAddress(b624507a66cc509e)
2022-11-14T00:03:08.418242371+00:00 INFO controller::controller - update global status node(NodeAddress(b624507a66cc509e)) height(1050)
2022-11-14T00:03:08.419183077+00:00 INFO controller::node_manager - update node: NodeAddress(f7f184b354658ab8)
2022-11-14T00:03:08.419934466+00:00 INFO controller::node_manager - update node: NodeAddress(4d6dd7ba39a7f0eb)
2022-11-14T00:03:08.422801518+00:00 INFO controller::node_manager - update node: NodeAddress(facdeb92ed4b447c)
2022-11-14T00:03:09.729300060+00:00 INFO cloud_util::signal - exit by signal: 15

节点到1050的时候,停掉节点,然后从备份恢复数据。然后重新启动节点:

2022-11-14T00:06:14.456532721+00:00 INFO controller - this is an old chain!
2022-11-14T00:06:14.457412741+00:00 INFO controller - current block number: 849, current block hash: 0x6c020c583258209041dbcc9187b9d3be01d61623c05cd32740d0782b0f98e11e
2022-11-14T00:06:14.458591599+00:00 INFO controller - lock_id: 1000 stored data
2022-11-14T00:06:14.459349543+00:00 INFO controller - lock_id: 1001 stored data
2022-11-14T00:06:14.464943316+00:00 INFO controller - lock_id: 1006 stored data
2022-11-14T00:06:14.466106271+00:00 INFO controller - load sys_config complete
2022-11-14T00:06:14.466153583+00:00 INFO controller::controller - node address: efbb0a6f99f2060a5b62dec66aaf3fc108d2dd3b
2022-11-14T00:06:14.468632516+00:00 INFO controller::chain - confirm executor status
2022-11-14T00:06:14.471630694+00:00 INFO controller::chain - exec_block(0): status: ReenterBlock, state_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421
2022-11-14T00:06:14.471663150+00:00 INFO controller::chain - executor is ready!
2022-11-14T00:06:14.528213598+00:00 INFO controller::controller - init_block_number: 849
2022-11-14T00:06:14.528418544+00:00 INFO controller - start controller grpc server!
2022-11-14T00:06:14.528440411+00:00 INFO controller - metrics on
2022-11-14T00:06:14.528543886+00:00 INFO cloud_util::metrics - exporting metrics to http://0.0.0.0:60004/metrics
2022-11-14T00:06:14.529589664+00:00 INFO controller - receive BroadCastCSI event task
2022-11-14T00:06:14.531097875+00:00 INFO controller::controller - reconfigure consensus success!
2022-11-14T00:06:14.535131835+00:00 INFO cloud_util::metrics - register histogram consensus_to_GetProposal succeeded
2022-11-14T00:06:14.535298937+00:00 INFO controller::chain - add_proposal: tx poll len 0
2022-11-14T00:06:14.536563262+00:00 INFO controller::chain - proposal 850 block_hash 0x079a41c99987274079f8a856154cf09dc05258978436477afd22a79d76eb03d4 prevhash 0x6c020c583258209041dbcc9187b9d3be01d61623c05cd32740d0782b0f98e11e
2022-11-14T00:06:14.558839600+00:00 INFO cloud_util::metrics - register histogram network_to_ProcessNetworkMsg succeeded
2022-11-14T00:06:14.567755265+00:00 INFO controller::node_manager - update node: NodeAddress(4d6dd7ba39a7f0eb)
2022-11-14T00:06:14.569179440+00:00 INFO controller::controller - update global status node(NodeAddress(4d6dd7ba39a7f0eb)) height(1086)
2022-11-14T00:06:14.572105511+00:00 INFO controller::node_manager - update node: NodeAddress(b624507a66cc509e)
2022-11-14T00:06:14.572105511+00:00 INFO controller::node_manager - update node: NodeAddress(b624507a66cc509e)
2022-11-14T00:06:14.668551084+00:00 WARN controller - rpc: process_network_msg(chain_status) failed: EarlyStatus
2022-11-14T00:06:14.728631880+00:00 INFO controller::node_manager - update node: NodeAddress(facdeb92ed4b447c)
2022-11-14T00:06:14.779005744+00:00 WARN controller - rpc: process_network_msg(chain_status) failed: EarlyStatus

节点高度回退到849,也发现了其他节点的高度比它高。但是没有开始同步区块。 等待高度5分钟没有增长,controller重启之后,又可以正常同步区块了。

2022-11-14T00:12:31.574407442+00:00 INFO controller - this is an old chain!
2022-11-14T00:12:31.575283934+00:00 INFO controller - current block number: 849, current block hash: 0x6c020c583258209041dbcc9187b9d3be01d61623c05cd32740d0782b0f98e11e
2022-11-14T00:12:31.576421909+00:00 INFO controller - lock_id: 1000 stored data
2022-11-14T00:12:31.577147637+00:00 INFO controller - lock_id: 1001 stored data
2022-11-14T00:12:31.582885113+00:00 INFO controller - lock_id: 1006 stored data
2022-11-14T00:12:31.584407473+00:00 INFO controller - load sys_config complete
2022-11-14T00:12:31.584452786+00:00 INFO controller::controller - node address: efbb0a6f99f2060a5b62dec66aaf3fc108d2dd3b
2022-11-14T00:12:31.586018163+00:00 INFO controller::chain - confirm executor status
2022-11-14T00:12:31.589264160+00:00 INFO controller::chain - exec_block(0): status: ReenterBlock, state_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421
2022-11-14T00:12:31.589298544+00:00 INFO controller::chain - executor is ready!
2022-11-14T00:12:31.647027107+00:00 INFO controller::controller - init_block_number: 849
2022-11-14T00:12:31.647204745+00:00 INFO controller - start controller grpc server!
2022-11-14T00:12:31.647229426+00:00 INFO controller - metrics on
2022-11-14T00:12:31.647330525+00:00 INFO cloud_util::metrics - exporting metrics to http://0.0.0.0:60004/metrics
2022-11-14T00:12:31.648412703+00:00 INFO controller - receive BroadCastCSI event task
2022-11-14T00:12:31.649605171+00:00 INFO controller::controller - reconfigure consensus success!
2022-11-14T00:12:38.897888596+00:00 INFO cloud_util::metrics - register histogram network_to_ProcessNetworkMsg succeeded
2022-11-14T00:12:38.905037854+00:00 INFO controller::node_manager - update node: NodeAddress(b624507a66cc509e)
2022-11-14T00:12:38.907175610+00:00 INFO controller::controller - update global status node(NodeAddress(b624507a66cc509e)) height(1171)
2022-11-14T00:12:38.907386816+00:00 INFO controller::protocol::sync_manager - SyncBlockRequest: start 850, end 859
rink1969 commented 1 year ago

根据代码分析,是因为在发起同步之前,节点出了一个proposal。 这样在try_sync_block中判断时,因为 self.candidates.is_empty() 条件不成立,所以判定为 ChainStep::OnlineStep,进而决定不开始同步:

ChainStep::OnlineStep => {
        controller_clone.set_sync_state(false).await;
        controller_clone.sync_manager.clear().await;
        return;
    }
rink1969 commented 1 year ago

重启之后恢复正常很容易理解,因为重启之后,proposal没有了。 但是重启之前,如果收到别的节点发来的更高的proposal或者chain status,应该会把本节点的candidates清理掉,然后开始同步。 所以现在问题就变成了,为什么没有触发清理candidates?

rink1969 commented 1 year ago
  1. 没有收到更高的proposal是因为节点高度跟别的节点差比较多。bft里有相关的机制,不会把proposal发给落后很多的节点。
  2. 没有收到别的节点的chain status。一是因为别的节点没有重启,在正常运行,不会主动发送chain status。二是本节点发送的chain status被别的节点接收到之后,判定为EarlyStatus,没有给响应。