Investigate initial timeouts on DHT / DAS

liamsi commented 3 years ago

Summary

We (me, and later @Wondertan confirmed my observation) observed the following behavior: when spinning up a lazyledger validator node on digital ocean and starting a light client locally, DAS for the light client times out.

We currently work around adding the fullnode's IPFS multiaddress to the light client's bootstrap nodes but it is important:

[x] to understand why the dht lookup doesn't work initially
[x] to investigate if we see the same behaviour among multiple DO nodes (or only local <-> DO)
[x] to make this work for lightclients (particularly in the local <-> DO/server case)
[x] to understand if we would see these timeouts during consensus too
[x] verify that after a node downloads data from a node that it has discovered via DHT, it will maintain a connection to that node using BitSwap (@Wondertan)

Wondertan commented 3 years ago

So after extensive investigations, the issue with DAS content resolution timing out appeared relatively trivial but not straightforward to fix.

TL;DR; Block production is much faster than DHT providing, leading to constant scattering between DASing node and network.

Inveestigation

When we add block data to IPFS, we basically save the data locally and find a peer closest to our content hash by XOR metric on the whole IPFS network.
Our block data is essentially a binary tree with many nodes and leaves for each block, depending on the block size. With one block per second default Dummyapp production setup we primarily used for testing and evaluating MVP, we produce amounts of CIDs that DHT could not process in time, where providing also slows down through time.
Furthermore, Bitswap intentionally throttles async announcement to the network. Unfortunately, in the current implementation of Bitswap, a user never knows whenever data announced or not.
To test the hypothesis that providing is slow and attempting to fix it, I wrote a synchronous providing in PutBlock in https://github.com/lazyledger/lazyledger-core/pull/375, which proves that providing is slow by blocking block production until DAH root CIDs announced.
The code also lays down an optimization workaround to announce fewer data to DHT. New code aims only to provide Row and Col DAH roots instead of all leaves and nodes. Optimization bases on the assumption that a DAH root provider has all the remaining data for discovery and exchange over Bitswap. Thus, by providing fewer CIDs(only DAH roots), we decrease overall providing time without compromising data discoverability.
However, the experiment showed that the providing time is still huge, comparing to desired block time. With a 16x16 block size, it takes ~3min to announce all the DAH roots for a single block synchronously. But it is worth pointing that Bitswap in the experiment was still providing along the way, obviously increasing overall operation time. Testing with disabled Bitswap providing is yet to be done and requires some investigation, as IPFS does not provide high-level options to achieve this. On the other hand, using Bitswap directly allows doing so through the option.
Even after testing with disabled Bitswap providing, we won't achieve ~30secs for an announcement of all DAH roots with max block size(yet to be proven). Thus, the issue remains.

Key Takeaways

We have to keep synchronous block providing for a proposer to be confident that data is available not to be slashed by consensus.
~~We should always aim to minimize the DHT data footprint for block-related operations. Otherwise, we cannot have predictable timings for block data providing.~~ This is wrong. Although, we still need to aim for predictable provide time.
DHT entry that some peer store some data rely on incentivized and unreliable medium - random peers which public key's hash is closest to the data hash. The problem with those is they can disappear, preventing others from finding the content hosters through them. In practice, though, at least one connection to the full node solves the problem by exchanging wanted data through Bitswap and/or PubSub. But do we assume at least one connection to the full node? I guess no, so the case with a light client without a full node connection DASing some old block can theoretically end up with failure for content resolution with data available on the network. Therefore, any node needs to connect to either one full node or one node which reliably keeps specific DHT records or batches.

Solutions

Proper

The most valuable contributor to problem resolution is potentially https://github.com/lazyledger/lazyledger-core/issues/378, where we store DAHeader in IPFS and rely on one DataHash instead of multiple roots. Furthermore, we should start synchronously providing DataHash to meet the first takeaway and the DataHash only to satisfy the second one. However, we may also consider keeping asynchronous providing for all the remaining nodes and leaves in the background to contribute to subtrees observability in case. Optimistically, the providing operation should end before the next turn for a node to propose and start providing again.

Fortunately, the recently discussed topic of the new DHT update comes into place to help here as well. More info here. It also contributes to the third takeaway by introducing a new DHT node type with full routing tables that can help short-circuit long queries. Although, I need to look more deeply into the implementation to understand all the features and possible tradeoffs before relying on it.

Quicker for MVP

The proper solution would take too much time for the MVP. Thus we need to come out with something short-term and desirably not time wasteful:

The easiest way to go is to embrace the problem for the MPV and hardcode additional bootstrap full node along with IPFS bootstrap peers.
The existing initiative to spin up our MVP network may be a good solution for this case. With a custom test network, we can lower providing time to acceptable frame, as a custom network would be much smaller, and it would take less time for DHT queries to finish. However, if the network is too small, the way won't be different from having just one bootstrap node, as everyone would connect, removing the need for DHT queries entirely.

Wondertan commented 3 years ago

Even after testing with disabled Bitswap providing, we won't achieve ~30secs for an announcement of all DAH roots with max block size(yet to be proven). Thus, the issue remains.

I can now confirm that just with manual sync providing and with all the IPFS/Bitswap async providing disabled, I still get similar and impractical ~3mins to announce 32 roots to the networks

Wondertan commented 3 years ago

For the MVP case, we can also rely on rows only. The workaround decreases providing time to a half(~1.5min). I observed that practically.

Wondertan commented 3 years ago

More info and explanation regarding the mentioned new DHT client in the proper solution and how it can be helpful to solve this specific case with long-lasting providing. To understand why it is helpful we should understand how it and regular client work.

Let's start with an explanation of how kDHT searching works. Anybody who is reading this should imagine a circle of dots in buckets(groups) of k size(circle formation is out of context), where each dot is a network node storing some part of the global network key to value mappings. And when any dot in the circle wants to find/put some value for a key it:

gets closest to the key dots within own bucket
queries/request them for closest dots in their buckets
executes 2 recursively until it finally reaches the dot with the value to get or set it.

So basic DHT client struggles from the requirement to do these multiple hops for closest dots and those hops are the main reason why it took so much time to provide/put something on the DHT.

New DHT client instead of just keeping some portion of key/values periodically crawls and syncs the whole network. This allows having 0 hops and to directly do set or get ops with dots. Comparably to blockchain state syncing, this DHT client also requires some time to instantiate and download the whole network state. Luckily, we already can rely on the practical results with providing time <3sec. Furthermore, the new client comes into place for the case with disappearing and unreliable DHT mediums, as it just remembers what they were providing preserving content discoverability. However, having a copy of the full DHT network/routing table on the node is not cheap, but proposers' interest aligns with fast providing and preserving solid content discoverability, so that's a valid tradeoff.

musalbas commented 3 years ago

Can we verify that after a node downloads data from a node that it has discovered via DHT, it will maintain a connection to that node using BitSwap?

liamsi commented 3 years ago

Edited the opening comment to reflect this sub-task and hidden our both comments to keep this focused.

Wondertan commented 3 years ago

to understand if we would see these timeouts during consensus too

setting this to done

liamsi commented 3 years ago

Thanks! Can you update your comment above to include a sentence or two about the result? Otherwise it is hard to see what the outcome of this was.

Wondertan commented 3 years ago

For our DHT case, we have Provide and GetProviders operations. On IPFS network Provide operation can take up to 3 mins what was the main cause of the issue, GetProviders can take up to 1 min, but often it takes less than 10 secs. For networks less than 20 nodes, both operations should take less than 10 secs, as bucker size is 20 and no hops are expected.

@liamsi, those timings are mostly inevitable and they are applied on any case, so if used with consensus they would present as well. Good for us, we decided to go with push approach.

Closing this, further work and info is now here: https://github.com/lazyledger/lazyledger-core/issues/395

celestiaorg / celestia-core