AntelopeIO / spring

C++ implementation of the Antelope protocol with Savanna consensus
Other
9 stars 5 forks source link

Net: Reduce unlinkable block errors while LIB catchup syncing #81

Closed heifner closed 4 weeks ago

heifner commented 6 months ago

The test failure reported in: https://github.com/AntelopeIO/spring/issues/69 describes a scenario while syncing in LIB catchup blocks posted to the main thread can be posted out of order causing unlinkable block exceptions. From the issue:

Issue observed in test failure

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 161 to dispatcher strand
debug 2024-04-25T11:44:33.707 net-0     net_plugin.cpp:3840           operator()           ] posting block 161 to app thread
debug 2024-04-25T11:44:33.709 net-2     net_plugin.cpp:3788           handle_message       ] ["localhost:9877 - b41fa1a" - 15 127.0.0.1:42822] posting block 163 to dispatcher strand
debug 2024-04-25T11:44:33.709 net-1     net_plugin.cpp:3840           operator()           ] posting block 163 to app thread

net-3 thread didn't get any CPU for about 3ms. These log statements that should happen right after each other. There is a mutex lock between them, which I assume is what is causing this large delay.

debug 2024-04-25T11:44:33.707 net-3     net_plugin.cpp:2532           sync_recv_block      ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] calling sync_wait, block 162
debug 2024-04-25T11:44:33.710 net-3     net_plugin.cpp:3788           handle_message       ] ["localhost:9880 - adfca38" - 6 127.0.0.1:9880] posting block 162 to dispatcher strand

During those 3ms, the net_plugin has started syncing from a different peer and received 14 blocks. We continue to post from two different streams of blocks causing a huge number of unlinkable (out of order) blocks.

Possible fix:

One possible solution is to keep an explicit ordered queue of incoming blocks while in LIB catchup. We could insert into this queue and pull from it as we process the blocks. If the next block is not in the queue we could wait for it to arrive. Currently there is an implicit queue for the app thread we post into. This explicit ordered queue would only be applicable during LIB catchup. After LIB catchup the current scheme of processing blocks as they come in would need to be used.

Alternatively, we could keep track of the last posted block to the app thread during LIB catchup and sleep for a few milliseconds before attempting to post again and then drop the block if it would not be linkable. This would be simpler than an explicit queue, but drops the blocks and requires requesting them again from the network.

Priority

Seems low priority as the node does recover and continue syncing, the only real harm is a large number of unlinkable block log messages and a bit of extra processing determining that the block does not link which does not take much time.

heifner commented 4 weeks ago

Addressed via https://github.com/AntelopeIO/spring/pull/619