ethereum / portal-network-specs

Official repository for specifications for the Portal Network
286 stars 79 forks source link

Support for nodes that wish to offer above average storage #283

Open pipermerriam opened 3 months ago

pipermerriam commented 3 months ago

What is the problem

Nodes on the network with storage sizes that are significantly larger than the average are semi problematic.

Our FINDCONTENT mechanism is statistically less likely to query any individual node on the network the further away they tey are from the content-id. A node that stores an above average amount of data will be effectively wasting that storage after a certain distance since very few queries will reach them

How can this be fixed.

Clients should probably approach this problem by operating multiple nodes on the network concurrently. The exact specifics of how this would be implemented are up to individual client teams but here are some highlights.

Linkable Identities

If a client is operating multiple node-ids on the network, they should probably use this HD Identity scheme so that we can take accurate census data on the network. Being able to cryptographicaly tie these multiple identities together gives us better visibility into the health of the network.

Dispersed Identities

When operating multiple node-ids on the network, a client should probably mine each node-id to avoid excessive overlap between the different node-ids.

Another approach might be to have a central identity that advertises a large radius and then to mine out additional node-ids that fall within the main zone of interest and advertise a radius that is appropriately small that it doesn't exceed the bounds of it's primary zone of interest. This approach would ensure that a node could respond to OFFER and FINDCONTENT messages from any of the identities that it operates since all of the secondary identities would be effectively covered by it's primary identity.

Intelligent Dispatch

If a node is operating two node-ids A and B and they receive a FINDCONTENT request under A for content that is technically stored under B, we should explore whether the appropriate response would be to forward them to B or whether they should just serve the content from A even though the content might be outside of their radius.

This same concept applies to things like gossip as well.

Prototyping

This is likely to pose some non-trivial architecture changes in client design and will need to be prototyped by a client team to figure out some of the nuance.

pipermerriam commented 2 months ago

@ScottyPoi and I did some whiteboarding on this yesterday. Here's some notes.

Maximum Effective Radius

This probably deserves it's own write-up.

There is a measurable mean/median radius size on the network. As a node's radius grows above this value, the likilihood that it will be targeted by a FINDCONTENT or OFFER decreases. I don't know the specific curve that this probability function takes, but I believe it is something like an exponential drop off. Thus, there is some distance that I'll define as the "maximum effective radius" where the increased radius is no longer really providing the network with any additional benefit because the node isn't likely to actually serve any of the content.

I believe that this should be a CLI configuration option --maxiumum-effective-radius in clients and it would be a value akin to the gas limit where the clients themselves ship with a default value.

Healthy Routing Tables

In an environment where some nodes operate multiple node-ids on the network, we want to avoid the situation where a node gets represented multiple times in another node's routing table, albeit under different node ids.

If we proceed with a multi-node-id architecture, client routing table implementations will need to be aware of this and implement logic to avoid this. A routing table should probably only ever contain a single node-id from the set of nodes operated by a single client.

Laying out multiple node ids

These models will all ignore the subtle nuances of how the xor distance metric doesn't allow for exact overlapping or adjacent radius coverage of the keyspace.

My initial thoughts for how a client might distribute their identities was simplistic.

Screenshot from 2024-04-04 11-39-24

However, I think there is potentially a superior model.

Screenshot from 2024-04-04 11-57-15

In this model we have a master node-id, which is the middle and largest circle. All of the other smaller circles are child node-ids and they are effectively ephemeral. This master node-id publishes a radius which is true to it's effective overall storage that it offers the network. All of the other child node-ids that the client uses are simply there to provide coverage of the keyspace, with each node-id publishing a radius that is derived from the master radius such that it's outermost edge aligns with the edge of the master node-ids keyspace coverage.

Under this model, the master node-id can be used in place of any of the child node-ids since it's radius should always have full overlap of any child node's radius. The client should also have an easier time managing it's storage under this model since it only has to track a single radius value from it's master identity, and all child identity radius values are just a function of this master value.

We probably need this for state network.

For us to launch state network we need to provide a large amount of base storage for the network. Something in the 100 TB range depending on how much redundancy we want. In order to do this we can either:

Disk space is cheaper than CPU time. The pragmatic choice for us in this situation is going to be to go with fewer nodes, and larger storage. That probably means nodes that offer 10/50/100GB of storage to the network. This is fine when we are the ones operating these nodes. The problem arises, as soon as we start onboarding users who want the promise of small easy to run nodes. At this point, if we've naively taken the route of deploying a network full of huge nodes, I think we get a weird network distribution between these mega nodes we operate and tiny nodes that our users are operating. Specifically, there will be huge disparities in the radius values in our network which isn't an optimal network topology.

So for us to get state truly launched and onboard users, I think that we need this ability to keep our network topology happy.