BrendanBenshoof / ideas

Place to log ideas for potential development.
0 stars 0 forks source link

Co-morbidity based redundant p2p-clustering #2

Open BrendanBenshoof opened 8 years ago

BrendanBenshoof commented 8 years ago

From the design of a p2p system (specifically a DHT) there is a core tradeoff that is not readily apparent:

Efficiency of routing vs Robustness/Reliability

Traditionally, DHTs focus on the reliability end of this tradeoff. But as usage of such networks increases we may want to seek higher performance and lower reliability.

The simplest way to decrease latency and increase performance is to ensure that each node's peers have low latency to it. (there are a lot of ways to do this, looks at CDNs). The problem this induces is that the robustness of a DHT is hinged on the independence of adjacent nodes failing (one node failing does not increase the odds of the next node failing (see bayes rule)). Once we are intentionally peering with nearer nodes, this independence disappears, and local disasters (weather, earthquakes, meteor or nuclear strikes) could cause the simultaneous failure of many contiguous nodes in the DHT.

This idea is a proposed solution/mitigation. We should make locally redundant clusters that each store a meaninful subset/majority of data stored on the global network, such that latency/performance is further improved and robustness is ALSO improved. As this is a trivial solution, the question of interest is "how should we position these clusters"

My proposed solution is that these clusters should maximize internal co-morbidity and minimize external co-morbidity (as with all multi-objective optimization, we would need a trade-off between these two objectives). This means, that within a cluster, if there is a disaster-level failure, most if not all nodes are likely to fail together. However between such clusters, the disaster-level failure in one cluster should have as little likelihood of failure in all other clusters as possible.

This is based on the painfully utilitarian premise, that if there is a disaster level failure, the entire affected cluster should be assumed to have failed. Providing infrastructure in such disaster areas while important, is not a focus of this solution.

Big problems: