cita-cloud / consensus_raft

The raft consensus component for CITA Cloud.
Apache License 2.0
3 stars 0 forks source link

new validator node wait for long time to join raft #95

Closed rink1969 closed 6 months ago

rink1969 commented 6 months ago

controller log

2024-03-12 17:12:44.259 | 2024-03-12T17:12:44.259441663+08:00  INFO finalize_block: controller::core::chain: finalize block(163) success: pool len: 0, pool quota: 0. hash: 0xa4b369aecb5c0dde7d4d0b9649c18c78f1953be28bd09901a912d26cf047bb0b |  
-- | -- | --
  |   | 2024-03-12 17:12:44.259 | 2024-03-12T17:12:44.2594258+08:00  INFO finalize_block: controller::core::chain: update auditor and pool, tx_hash_list len 1 |  
  |   | 2024-03-12 17:12:44.258 | 2024-03-12T17:12:44.258841612+08:00  INFO finalize_block: controller::core::chain: store AllBlockData(163) success: hash: 0xa4b369aecb5c0dde7d4d0b9649c18c78f1953be28bd09901a912d26cf047bb0b |  
  |   | 2024-03-12 17:12:44.256 | 2024-03-12T17:12:44.256412026+08:00  INFO finalize_block: controller::core::chain: execute block(163) Success: state_root: 0xcfb38b8f72345a07b9d9d86dfd67a1823e8be216c0aecf8663dfd0b078bcad92. hash: 0xa4b369aecb5c0dde7d4d0b9649c18c78f1953be28bd09901a912d26cf047bb0b |  
  |   | 2024-03-12 17:12:44.253 | 2024-03-12T17:12:44.253604158+08:00  INFO controller::protocol::sync_manager: sync_manager get block: from origin: 6fb494bc455c1d79, heights: [163]

在163高度将新节点加入共识节点列表,新节点同步到了163并顺利执行 但是节点确迟迟没有参与共识

consensus_raft log

  |   | 2024-03-12 17:12:56.313 | Mar 12 09:12:56.313 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
-- | -- | -- | -- | --
  |   | 2024-03-12 17:12:53.311 | Mar 12 09:12:53.311 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:50.310 | Mar 12 09:12:50.310 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:47.309 | Mar 12 09:12:47.309 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:44.308 | Mar 12 09:12:44.308 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:41.308 | Mar 12 09:12:41.306 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:38.305 | Mar 12 09:12:38.304 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:35.303 | Mar 12 09:12:35.303 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:32.303 | Mar 12 09:12:32.302 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:29.301 | Mar 12 09:12:29.301 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:26.299 | Mar 12 09:12:26.299 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:306 |  
  |   | 2024-03-12 17:12:23.298 | Mar 12 09:12:23.298 INFO incoming config doesn't contain this node, wait for next one, module: consensus::peer:328 |  
  |   | 2024-03-12 17:12:18.298 | Mar 12 09:12:18.298 INFO ping_controller.., module: consensus::peer:320
rink1969 commented 6 months ago

原因分析: 该节点为新创建的节点,从头同步区块的时间非常快,每个高度同步并执行之后都会给共识发送config消息。导致通道内堆积了100多个config消息。 但是consensus_raft是按3s的间隔去处理config消息,且每次只处理一个,所以要等300多秒之后才发现自己已经是共识节点了,然后才开始参与共识。

解决方案: consensus_raft是按3s的间隔去处理config消息,但是一次就把所有堆积的消息都处理掉。

rink1969 commented 6 months ago

这个只是问题的一种情况,即一开始节点不是共识节点,后来变成共识节点。

还有一种情况是节点一开始是共识节点,后来剔除出共识节点了。 如果节点数据丢失从头开始同步,那么一开始就会以为自己是共识节点,进而发起投票。 哪怕区块更新到最新,它已经不是共识节点了,但是它自己并不知道,因为其他raft节点不会再跟它通信了,所以它还是会一直尝试发起投票,会对其他共识节点有影响。