Closed liamsi closed 3 years ago
So after extensive investigations, the issue with DAS content resolution timing out appeared relatively trivial but not straightforward to fix.
TL;DR; Block production is much faster than DHT providing, leading to constant scattering between DASing node and network.
The most valuable contributor to problem resolution is potentially https://github.com/lazyledger/lazyledger-core/issues/378, where we store DAHeader in IPFS and rely on one DataHash instead of multiple roots. Furthermore, we should start synchronously providing DataHash to meet the first takeaway and the DataHash only to satisfy the second one. However, we may also consider keeping asynchronous providing for all the remaining nodes and leaves in the background to contribute to subtrees observability in case. Optimistically, the providing operation should end before the next turn for a node to propose and start providing again.
Fortunately, the recently discussed topic of the new DHT update comes into place to help here as well. More info here. It also contributes to the third takeaway by introducing a new DHT node type with full routing tables that can help short-circuit long queries. Although, I need to look more deeply into the implementation to understand all the features and possible tradeoffs before relying on it.
The proper solution would take too much time for the MVP. Thus we need to come out with something short-term and desirably not time wasteful:
Even after testing with disabled Bitswap providing, we won't achieve ~30secs for an announcement of all DAH roots with max block size(yet to be proven). Thus, the issue remains.
I can now confirm that just with manual sync providing and with all the IPFS/Bitswap async providing disabled, I still get similar and impractical ~3mins to announce 32 roots to the networks
For the MVP case, we can also rely on rows only. The workaround decreases providing time to a half(~1.5min). I observed that practically.
More info and explanation regarding the mentioned new DHT client in the proper solution and how it can be helpful to solve this specific case with long-lasting providing. To understand why it is helpful we should understand how it and regular client work.
Let's start with an explanation of how kDHT searching works. Anybody who is reading this should imagine a circle of dots in buckets(groups) of k size(circle formation is out of context), where each dot is a network node storing some part of the global network key to value mappings. And when any dot in the circle wants to find/put some value for a key it:
So basic DHT client struggles from the requirement to do these multiple hops for closest dots and those hops are the main reason why it took so much time to provide/put something on the DHT.
New DHT client instead of just keeping some portion of key/values periodically crawls and syncs the whole network. This allows having 0 hops and to directly do set or get ops with dots. Comparably to blockchain state syncing, this DHT client also requires some time to instantiate and download the whole network state. Luckily, we already can rely on the practical results with providing time <3sec. Furthermore, the new client comes into place for the case with disappearing and unreliable DHT mediums, as it just remembers what they were providing preserving content discoverability. However, having a copy of the full DHT network/routing table on the node is not cheap, but proposers' interest aligns with fast providing and preserving solid content discoverability, so that's a valid tradeoff.
Can we verify that after a node downloads data from a node that it has discovered via DHT, it will maintain a connection to that node using BitSwap?
Edited the opening comment to reflect this sub-task and hidden our both comments to keep this focused.
to understand if we would see these timeouts during consensus too
setting this to done
Thanks! Can you update your comment above to include a sentence or two about the result? Otherwise it is hard to see what the outcome of this was.
For our DHT case, we have Provide and GetProviders operations. On IPFS network Provide operation can take up to 3 mins what was the main cause of the issue, GetProviders can take up to 1 min, but often it takes less than 10 secs. For networks less than 20 nodes, both operations should take less than 10 secs, as bucker size is 20 and no hops are expected.
@liamsi, those timings are mostly inevitable and they are applied on any case, so if used with consensus they would present as well. Good for us, we decided to go with push approach.
Closing this, further work and info is now here: https://github.com/lazyledger/lazyledger-core/issues/395
Summary
We (me, and later @Wondertan confirmed my observation) observed the following behavior: when spinning up a lazyledger validator node on digital ocean and starting a light client locally, DAS for the light client times out.
We currently work around adding the fullnode's IPFS multiaddress to the light client's bootstrap nodes but it is important: