celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
929 stars 925 forks source link

Tree-based approach for partial storage nodes network sharding and peer discovery #1133

Open musalbas opened 3 years ago

musalbas commented 3 years ago

Partial storage nodes are nodes that only store some of the blocks in the blockchain, and can be queried by any other nodes (including light clients and other partial nodes) to download data from parts of the chain.

There's two main questions:

I propose a method of answering the above, with a scalable "tree-based" approach.

Granularity

Let's assume a network-wide constant MIN_GRANULARITY = 1000 blocks where MIN_GRANULARITY is the minimum number of consecutive blocks you can advertise that you are storing to the network (which we call a "blockset"), and constant BASE = 10. We call a range of blocksets a "blockrange" (e.g. blockrange 0-10K consists of blocksets 0-1K, 0-2K, ..., 9K-10K). We can organise the blocksets into a directory structure, where each directory has BASE number of subdirectories (blockranges) or files (blocksets). Let's say there's 10 million blocks in the chain, the directory would look as follows:

0-10M/
├─ 0-1M/
│  ├─ 0-100K/
│  │  ├─ 0-10K/
│  │  │  ├─ 0-1K
│  │  │  ├─ 1K-2K
│  │  │  ├─ ...
│  │  │  ├─ 9K-10K
│  ├─ 100K-200K/
│  ├─ .../
│  ├─ 900K-1M/
├─ 1M-2M/
├─ .../
├─ 9M-10M/

Peer discovery

Subnet-based

Each subdirectories (blockranges) or files (blocksets) would be its own network topic. For example, a topic could be 0-10K (blockrange) or 0-1K (blockset). The network has the following interfaces:

The above operations might be expensive or time-consuming. Therefore, depending on how many blocks and blockranges there are in the network, partial storage nodes may only advertise up to a certain height of blockranges, and likewise clients querying the nodes might only try to get peers from a certain height of blockranges. Let's assume a client-side variable GRANULARITY, where GRANULARITY >= MIN_GRANULARITY, on both partial storage nodes and client nodes.

When a partial storage node wants to call Advertise() on blockranges that it's serving, it will only do so on blockranges that have a greater granularity than GRANULARITY. For example, if a partial storage node is serving blocks 0-1M, and GRANULARITY = 100,000, the it will call Advertise() on 0-1M, 0-100K, ..., 900K-1M, but not 0-10K, ..., 9K-10K, etc.

Similarly, if a client wants to download data in block 1500 for example, the deepest blockrange it would try to GetPeers() for is 0-100K. One can also construct different algorithms to find peers, using a top-to-bottom approach. For example, the client can first call GetPeers() on blocks 0-10M, but if no node is storing 10M blocks, it could then try calling GetPeers() on blocks 0-1M, and so on.

This would allow the network to self-adjust the acceptable data in each shard, depending on how big blocks are or how much storage resources partial nodes have.

Note: GRANULARITY is a client-side variable that can be adjusted automatically by the client itself based on its success on downloading blocks at different granularities. On the other hand, MIN_GRANULARITY and BASE are network-wide variables that have to be agreed network-wide as part of the p2p protocol.

Status message-based

An alternative to a subnet-based peer discovery approach is an approach where there's only one network of partial storage nodes, that have status messages that represent which blocks they have. Partial storage nodes would have the following interface:

For example, if a GetStatus(1M) is called in a chain with 10M blocks, and the partial storage node is only storing blocks 1M-2M, the bit field would be as follows:

0100000000
 ^
 |
blockrange 1M-2M
liamsi commented 3 years ago

A few questions high-level questions first and then some practical / implementation-specific ones:

(1) Why is the first approach called "Subnet-based"? Is this because the mentioned topics (e.g. 0-10K) are gossiped to a subset of the network? If yes, the assumption that this is gossip-based should be mentioned in the approach.

(2) Am I understanding correctly that the second approach is rather query-based? (You query for status of the nodes you are connected to and by doing so discover eventually the peer that has the data/range you are looking for)?

OK, of to the practical concerns:

Regarding (1), I don't think this works with gossip-sub (or any pub-sub-based gossiping library), as there would be an ever-growing amount of topics. My intuition is that a DHT might work better here.

On the other hand, MIN_GRANULARITY and BASE are network-wide variables that have to be agreed network-wide as part of the p2p protocol.

It's unclear to me how this agreement process could look like. Do you just mean it's a hardcoded constant in the software, or is there some actual agreement protocol involved?

In general, how is it guaranteed that not everyone stores the same 0-10K blocks and nothing else? How is it guaranteed that the network rather uniformly covers the whole chain?

liamsi commented 3 years ago

Looping in @adlerjohn, @Wondertan (for some practical libp2p considerations), and @mattdf as part of the specs/applied research team too.

musalbas commented 3 years ago

(1) Why is the first approach called "Subnet-based"? Is this because the mentioned topics (e.g. 0-10K) are gossiped to a subset of the network? If yes, the assumption that this is gossip-based should be mentioned in the approach.

Block data could be gossiped to nodes within a subnet, but this is orthogonal to the problem being addressed here, which is peer discovery for different network shards. The key requirement here is that different subnets have different peer lists. How the network shards actually receive or query that data is another topic, which could be via bitswap, graphsync, gossipsub, etc.

(2) Am I understanding correctly that the second approach is rather query-based? (You query for status of the nodes you are connected to and by doing so discover eventually the peer that has the data/range you are looking for)?

Yes. It assumes only one network with one "peer list".

Regarding (1), I don't think this works with gossip-sub (or any pub-sub-based gossiping library), as there would be an ever-growing amount of topics. My intuition is that a DHT might work better here.

That's the whole point of the GRANULARITY variable. If advertising to the subnet is too expensive, then the GRANULARITY should be quite high, to reduce the number of topics.

It's unclear to me how this agreement process could look like. Do you just mean it's a hardcoded constant in the software, or is there some actual agreement protocol involved?

It would be a hardcoded constant in the software, similar to how the network message formats are also hardcoded.

In general, how is it guaranteed that not everyone stores the same 0-10K blocks and nothing else? How is it guaranteed that the network rather uniformly covers the whole chain?

There's no guarantee of that, nor should we attempt to guarantee that formally as part of the specs, as that would be a fool's errand. The best we could do is (a) rely on off-chain analysis so that if we see that it's difficult to get data for certain ranges, partial storage nodes should take note of that and spin up nodes to provide the data (whether incentivized by professional node services or not), and (b) make it so that partial storage nodes by default, store a random part of the chain, using a random seed generated during initialization.

liamsi commented 3 years ago

Yes. It assumes only one network with one "peer list".

Practical concern: do we think the whole network's peer-list (including the mapping to the "topics") fits into RAM (of light clients)?

and (b) make it so that partial storage nodes by default, store a random part of the chain, using a random seed generated during initialization.

I disagree that this is just a fool's errand and if we go with this approach this should be part of the spec too. Even if it is optional and not guarantees are made. Assumptions like this should be a sentence in the spec, at the very least. The reason being is that Data Availability sampling implies that the whole chain is covered.

It would be a hardcoded constant in the software, similar to how the network message formats are also hardcoded.

OK, that makes sense.

musalbas commented 3 years ago

Practical concern: do we think the whole network's peer-list (including the mapping to the "topics") fits into RAM (of light clients)?

There's no need for any node to have the entire peer list, it just needs to know a few peers for each topic it's interested in.

liamsi commented 3 years ago

Yeah, just trying to understand what the max peer-list size would be and if that could still be gossiped as a whole (e.g. instead of a dht).

liamsi commented 3 years ago

Speaking of fool's errands:

Randomized to spread load better. Skewed load would be limited in case of attack (manipulating sample count), but inconvenient. Sample counts are also only powers of 2, so there is little room for manipulation there.

https://github.com/protolambda/eth2-das/blob/master/spec/das_validator.md#mapping-samples-to-das-subnets

musalbas commented 3 years ago

Speaking of fool's errands:

Randomized to spread load better. Skewed load would be limited in case of attack (manipulating sample count), but inconvenient. Sample counts are also only powers of 2, so there is little room for manipulation there.

https://github.com/protolambda/eth2-das/blob/master/spec/das_validator.md#mapping-samples-to-das-subnets

That's about mapping samples to subnets, not mapping nodes to subnets. This does nothing to address the problem of making sure that each subnet has a uniform number of nodes.

And I'm not saying we shouldn't include some note about this in the specs, but I'm saying that making this a formal guarantee of the protocol should definitely not be in scope, as we would be building something that resembles something more like Filecoin.

DAS assumes that at least one copy of the data is available, not that the data is stored uniformly by nodes. Yes, the latter helps with the former, but the latter is a much stronger requirement than is required, and thus is ultimately out of scope of the protocol's formally stated guarantees.

liamsi commented 3 years ago

DAS assumes that at least one copy of the data is available, not that the data is stored uniformly by nodes. Yes, the latter helps with the former, but the latter is a much stronger requirement than is required, and thus is ultimately out of scope of the protocol's formally stated guarantees.

Fair enough. But on the network level we have to think about this. If there was literally only one node that has the data, that would be trivial to DoS/eclipse.

Bidon15 commented 2 years ago

Grooming 01/11/2022:

Reopen discussion Post-mainnet