flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

[long-term] Investigate alternate distributed storage/routing protocols for KVS/collectives #370

Closed trws closed 2 years ago

trws commented 9 years ago

I've run into a few places where this protocol has been used to great effect lately, and the main uses seem to be to reach consensus on uniqueness of appends or IDs. This may be a good way to handle distributing the KVS root if and when it comes to that, and I've actually run across a C implementation here that might be usable for the purpose.

trws commented 8 years ago

In relation to the reliability discussion lately, I was reminded of this option by a conversation with @lipari in Austin a little while ago, and decided to look back at some of the DHT protocols I studied as an undergrad. It occurs to me that rather than building something on raft, there is a pretty significant body of work on building much of what we have in our broker routing protocol and KVS now on top of the Pastry protocol (or tapestry/chimera, or chord, or what have you, but pastry and chimera have gotten a lot of followup research over the years). They're a bit longer in the tooth but provably correct, handle addition and removal of clients without issue, and even has some non-trivial research into treating its key-space as a hash-based file system and scalable multicast distribution network. When we look at doing something about distributing rank 0, something like this, where libraries algorithms and implementations are readily available, could seriously cut down on the cognitive and development load. In fact, chimera is a GPLv2 library in C, only downside is that it does not use ZMQ sockets, so it would need a refit or a way to access just the routing portion of the API without using the messaging itself.

Actually, even just using one of these as the routing strategy, rather than a manually wired tree, might be a neat place to start, leaving the rank-0 concept in place but replacing it with a node-id. Routing from any arbitrary 128-bit node-id to any other is probabilistically log_2^b(N) where N is number of nodes in the set, and the concept of "upstream" (toward a given node-id) would still apply without requiring any global tree structure to be maintained.

trws commented 8 years ago

@grondo 's question about number of active connections made me wonder a bit, the default for the algorithms I was looking at is to use a base of 4, so it's equivalent to log_16(N), or an average of 4 connections for 10,000 peers, or 5 for one million. If we were to run 10 brokers on every node of sequoia and vulcan combined, and put them all in the same overlay network, it would average just a bit over 6 connections per broker, and six hops to find a given target node ID or data element from all nodes (the average routing distance for a random key is identical from all nodes).

dongahn commented 8 years ago

FYI, this is the SC13 paper that explored distributed KVS for resource managers: 2013_SC13-KVS.pdf

garlick commented 2 years ago

Closing old issue.