Test Failure: nodeos_startup_catchup_if_lr_test

heifner commented 4 months ago

https://github.com/AntelopeIO/spring/actions/runs/8831635726/job/24247940383

ERROR: Node14 has unlinkable blocks

heifner commented 4 months ago

Did eventually come out of it, but had 219 unlinkable block exceptions. The test expects less than 15.

heifner commented 4 months ago

This looks suspicious:

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 161 to dispatcher strand
debug 2024-04-25T11:44:33.707 net-0     net_plugin.cpp:3840           operator()           ] posting block 161 to app thread
debug 2024-04-25T11:44:33.709 net-2     net_plugin.cpp:3788           handle_message       ] ["localhost:9877 - b41fa1a" - 15 127.0.0.1:42822] posting block 163 to dispatcher strand
debug 2024-04-25T11:44:33.709 net-1     net_plugin.cpp:3840           operator()           ] posting block 163 to app thread

Looks like net-3 thread didn't get any CPU for about 3ms. These log statements that should happen right after each other. There is a mutex lock between them, which I assume is what is causing this large delay.

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:2532           sync_recv_block      ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] calling sync_wait, block 162
debug 2024-04-25T11:44:33.710 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 162 to dispatcher strand

During those 3ms, the net_plugin has started syncing from a different peer and received 14 blocks. We continue to post from two different streams of blocks causing a huge number of unlinkable (out of order) blocks.

One possible solution is to keep an explicit ordered queue of incoming blocks while in LIB catchup. We could insert into this queue and pull from it as we process the blocks. If the next block is not in the queue we could wait for it to arrive. Currently there is an implicit queue for the app thread we post into. This explicit ordered queue would only be applicable during LIB catchup. After LIB catchup the current scheme of processing blocks as they come in would need to be used.

Seems low priority as the node does recover and continue syncing, the only real harm is a large number of unlinkable block log messages and a bit of extra processing determining that the block does not link which does not take much time.

heifner commented 4 months ago

Same failure: https://github.com/AntelopeIO/spring/actions/runs/8847642165/job/24296229867

debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:3130           process_next_block_m ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] received block 149, id bf2b2b53dc02a90e..., latency: 64168ms, head 1
debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:2476           sync_recv_block      ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] got block 149:bf2b2b53dc02a90e.. latency 0ms
debug 2024-04-26T11:53:34.168 net-0     net_plugin.cpp:2532           sync_recv_block      ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] calling sync_wait, block 149
debug 2024-04-26T11:53:34.171 net-0     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - 24f47bf" - 6 127.0.0.1:9880] posting block 149 to dispatcher strand
debug 2024-04-26T11:53:34.171 net-3     net_plugin.cpp:3840           operator()           ] posting block 149 to app thread

3ms between receiving and posting on the main, meanwhile many other blocks have been posted to the main thread.

arhag commented 4 months ago

For this issue, we just want to make the test more reliable and less sensitive to the number of unlinkable blocks that occurred during the test.

Either we remove the check that the number of unlinkable blocks is less than the threshold, or we just increase the threshold to account for the fact that such unlinkable blocks during lib catchup are more likely with the changes we made to the implementation for how we process blocks.

A separate issue can track the low-priority optimization to reduce the likelihood of these unlinkable blocks occurring during lib catchup in the first place.

AntelopeIO / spring

Test Failure: nodeos_startup_catchup_if_lr_test #69