Portal Data Swarms - Concept design for fast access to high-demand data

Here I try to outline a new construct, a "swarm" for the future of the portal network that is aimed at delivering high speed access to a specific subset of the ethereum data that is in high demand.

Examples of this would be:

fetching the latest state data for a popular smart contract.
fetching large quantities of contiguous state or history data for purposes such as a "full node" doing a full sync.
acquiring the first few (1-3) layers of the main state trie for recent state roots
listening for logs/events

Each swarm would be given a "topic" which for simplicity we'll represent as a 32-byte hash. Each topic would represent a specific data set that someone might want, such as:

sha256(/execution-state/<unswap-address>/) -> the topic for the latest state for the uniswap contract..
sha256(/execution-history/epoch-<N>/) -> the topic for all of the block data in epoch N

For each topic, a swarm, can be thought of in much the same way that we think of the portal network sub-protocols. Each would be an overlay DHT, comprised of only nodes that have joined the swarm (the same way that a portal node joins the sub-protocol simply by sending PING/PONG messages. These overlay DHT's would likely have a steady stream of nodes joining and leaving the DHT/swarm. Content would likely be mostly gossip, though I can see some swarm topics such as a specific contract's storage where FINDCONTENT would be applicable. For some topics it may make sense to be gossip only such as listening for specific events being emitted by smart contracts.

General Thoughts and Ideas

Nodes that want to serve data.

This seems viable for nodes that want to be the "server" for a specific subset of the state. For example, there may be parties that are interested in hosting all of the uniswap contract state data. Those nodes would persistently make themselves part of the swarm(s) for the relevant topics, and would primarily be the providers for data requests in that topic.

A more complex version of this might be nodes that are willing to generate access-lists or even witnesses for execution. Those that operate certain protocols might be willing to run stateful nodes that are able to quickly return an access list for a specific set of transaction inputs in order to allow lightweight nodes to quickly gather the necessary state concurrently rather than sequentially.

Fast access to popular data sets of manageable size.

Some swarms may be constantly valuable. One example would be the first three(3) layers of the main account state trie for recent blocks. The dataset if small (a few MB at most) and would be easy to quickly replicate across all nodes in the network. All state lookups end up traversing through this section of the trie. It would be relatively easy for every node in this swarm to provide a full proof down to the 3rd layer of the trie in a single request to pretty much any node in the swarm.

Another example might be the most recent 256 headers. It would be relatively trivial to store all of this and to support fetching them in bulk.

Discovery is a problem

One problem that needs to be solved to enable this type of functionality is discovery. In order to join a swarm one would need to be able to discover existing nodes that are part of the swarm. In the event that there are no other nodes, one would need to be able to advertise to others that you are interested in a specific swarm topic (and thus that they could use you as a bootnode to find other nodes).

The base discovery v5 protocol intends to implement this functionality for us, but as of yet, it hasn't been done/released/finalized.

ethereum / portal-network-specs