The test failure reported in: https://github.com/AntelopeIO/spring/issues/69 describes a scenario while syncing in LIB catchup blocks posted to the main thread can be posted out of order causing unlinkable block exceptions. From the issue:
net-3 thread didn't get any CPU for about 3ms. These log statements that should happen right after each other. There is a mutex lock between them, which I assume is what is causing this large delay.
During those 3ms, the net_plugin has started syncing from a different peer and received 14 blocks. We continue to post from two different streams of blocks causing a huge number of unlinkable (out of order) blocks.
Possible fix:
One possible solution is to keep an explicit ordered queue of incoming blocks while in LIB catchup. We could insert into this queue and pull from it as we process the blocks. If the next block is not in the queue we could wait for it to arrive. Currently there is an implicit queue for the app thread we post into. This explicit ordered queue would only be applicable during LIB catchup. After LIB catchup the current scheme of processing blocks as they come in would need to be used.
Alternatively, we could keep track of the last posted block to the app thread during LIB catchup and sleep for a few milliseconds before attempting to post again and then drop the block if it would not be linkable. This would be simpler than an explicit queue, but drops the blocks and requires requesting them again from the network.
Priority
Seems low priority as the node does recover and continue syncing, the only real harm is a large number of unlinkable block log messages and a bit of extra processing determining that the block does not link which does not take much time.
The test failure reported in: https://github.com/AntelopeIO/spring/issues/69 describes a scenario while syncing in LIB catchup blocks posted to the main thread can be posted out of order causing unlinkable block exceptions. From the issue:
Issue observed in test failure
net-3
thread didn't get any CPU for about 3ms. These log statements that should happen right after each other. There is a mutex lock between them, which I assume is what is causing this large delay.During those 3ms, the
net_plugin
has started syncing from a different peer and received 14 blocks. We continue to post from two different streams of blocks causing a huge number of unlinkable (out of order) blocks.Possible fix:
One possible solution is to keep an explicit ordered queue of incoming blocks while in LIB catchup. We could insert into this queue and pull from it as we process the blocks. If the next block is not in the queue we could wait for it to arrive. Currently there is an implicit queue for the app thread we post into. This explicit ordered queue would only be applicable during LIB catchup. After LIB catchup the current scheme of processing blocks as they come in would need to be used.
Alternatively, we could keep track of the last posted block to the app thread during LIB catchup and sleep for a few milliseconds before attempting to post again and then drop the block if it would not be linkable. This would be simpler than an explicit queue, but drops the blocks and requires requesting them again from the network.
Priority
Seems low priority as the node does recover and continue syncing, the only real harm is a large number of unlinkable block log messages and a bit of extra processing determining that the block does not link which does not take much time.