flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

[idea] use binomial graph network instead of TBON for overlay #953

Closed garlick closed 3 years ago

garlick commented 7 years ago

One issue with the TBON is that nodes have non-uniform message routing responsibilities and no redundant links, so failure of any non-leaf node requires a self-healing action, and failures towards the root of the tree are very disruptive.

@dongahn long ago suggested we investigate binomial graph networks (BGN) as an overlay topology and more recently @morrone and @trws have suggested peer-to-peer key value stores like chord as a model for (or perhaps even a direct way to provide) resilient message routing.

This issue is a placeholder for a design activity starting with consideration of the BGN option.

Some interesting papers circa 2007 on BGN's in HPC:

The BGN topology has some nice scalability properties (low diameter, uniform) and can continue functioning (avoid bisection) with nodes down as long as no single node loses all its peers. In addition, each node can be thought of as the root of a binomial tree, so in theory some of the logarithmic scaling properties of our reductions and hierarchical caching could be preserved, with some redesign.

garlick commented 7 years ago

Some random ideas on how BGN could be integrated into existing design

A stepwise way to prototype this could be run both TBON and BGN initially and gradually move stuff over to the BGN (rank-addressed routing, event distribution, then the harder stuff).

garlick commented 7 years ago

Some correspondence with BGN paper authors with some more references

Hello Dr. Angskun,

I am working on a project to create a new resource manager called Flux at Lawrence Livermore National Laboratory and recently came across your 2007 papers on binomial graph networks. Our resource manager essentially has a runtime like an MPI implementation, and the BGN topology seems like the ideal way to introduce fault tolerance to our design.

I notice that OpenMPI seems to have adopted the BGN for ORTE. Do you know of other examples of BGN "in the wild", and also what if any issues were encountered with BGN with OpenMPI?

Also, in the intro to one of your papers I think you implied that DHT overlay networks have some undesirable characteristics for HPC. We have also been looking to the DHT world for solutions and I wondered if you might have some specific observations on their applicability in say MPI runtimes.

I just found your dissertation online so maybe I can answer some of these questions myself. Anway, thank you in advance and sorry to trouble out of nowhere with questions about work from 10 years ago!

Regards,

Jim Garlick LLNL

Received a redirect to George Bosilica from Thara Angskun, then from George:

After the 2007 paper we have continued to used the BMG topology in many other of our projects, because as you noticed it offers both speed and reliability.

To complement your statement about ORTE, the BMG is the core topology under the ULFM source code, and it is actually used as the underlying topology for the software used in these 2 papers [1], [2] and [3].

More recently, the BMG is also used as the basic overlay network for a branch of PMIx (branch that supports process fault and restart).

Thanks, George.

[1] Bouteiller, A., G. Bosilca, and J. Dongarra, "Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery", 22nd European MPI Users' Group Meeting, Bordeaux, France, ACM, September 2015.

[2] Herault, T., A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi, M. Parashar, and J. Dongarra, "Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof", Innovative Computing Laboratory Technical Report, no. ICL-UT-15-01, April 2015.

[3] Bosilca, G., A. Bouteiller, A. Guermouche, T. Herault, Y. Robert, P. Sens, and J. Dongarra, "Failure Detection and Propagation in HPC Systems", Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.

morrone commented 7 years ago

new zmq socket for each peer, dealer-router message flow as now. Rank addressed requests select peer according to BGN routing algorithm. Responses retrace route by unwinding address stack as now. (one dealer to many routers wouldn't work here as impossible to select peer)

Perhaps the reply could simply be sent through the normal network routing mechanism (eliminate the strict source-routed aproacch). That way the reply can also take advantage of the fault tolerance capabilities of the network without additional special-casing of the problem. In other words, a source-routed reply that hits a failure point needs to abort from the source-routing anyway for the message to reach its destination.

garlick commented 7 years ago

Great point.

I'd suggest simply not pushing intermediate hops onto the route stack when routing requests across the overlay, then when a response reaches a broker and it pops off a route frame for a broker that is not a direct peer, routing tables are consulted to compute the next hop, which might be different than the one used to get there.

This is useful because the sending id is not part of the base message format, and in any case the sending id may not be directly reachable from the local broker (it might go through a connector module, an ssh proxy, or even come from another instance).

garlick commented 7 years ago

My response was a bit vague above. In our meeting @morrone asked for clarification and I did a poor job of responding. He also suggested using router-router sockets instead of dealer-router in the BGN, and that makes sense to me after thinking about it a bit. Here's a more detailed explanation of how the BGN could work using router-router, without changing the message format for requests and responses. (events are another topic - though there are published algorithms for reliable multi/broadcast over BGN so I don't expect it to be difficult to sort out)

Refresher from zmq guide:

When receiving messages a ZMQ_ROUTER socket shall prepend a message part containing the identity of the originating peer to the message before passing it to the application. Messages received are fair-queued from among all connected peers. When sending messages a ZMQ_ROUTER socket shall remove the first part of the message and use it to determine the identity of the peer the message shall be routed to.

I asserted above that the BGN "routing engine" would leave the route stack unmodified as messages are routed through the BGN. It could be message type agnostic, except that the destination rank comes from the message body in requests and the top of the route stack in responses. Preserving the route stack preserves the ability to route "off broker" at either end of the RPC.

Using only router sockets, BGN peers can be represented as one zmq socket with multiple endpoints (peers). The routing engine at each broker uses the final destination rank to select the next hop in the BGN. That peer is pushed onto the message route stack prior to sending. The sending router socket internally pops it off and directs the message to the peer. The receiving router socket internally pushes the sending id - this can be popped off and discarded by receiving peer (or used for liveness tracking or whatever). Again the routing engine selects the next hop and this repeats until destination is reached. At the destination, there would be no routes contributed by the BGN to the route stack. Further routing may add some at that point.

It's a good idea to use router to router here as it will probably reduce the code needed to implemnt the BGN compared to one dealer socket per peer.

Hopefully the above makes sense?

(One thing that might be unclear is that the sending rank of a request would be at the top of the route stack, passing through the BGN, so when the response comes back to the BGN, that will remain at the top and be used to select the next hop).

garlick commented 7 years ago

One more reference:

Improving Scalability and Usability of Parallel Runtime Environments for High Availability and High Performance Systems, dissertation, Thara Angskun, The University of Tennessee, Knoxville, December 2007

garlick commented 7 years ago

I missed a huge benefit of router-router - this socket option:

ZMQ_ROUTER_MANDATORY: accept only routable messages on ROUTER sockets Sets the ROUTER socket behavior when an unroutable message is encountered. A value of 0 is the default and discards the message silently when it cannot be routed or the peers SNDHWM is reached. A value of 1 returns an EHOSTUNREACH error code if the message cannot be routed or EAGAIN error code if the SNDHWM is reached and ZMQ_DONTWAIT was used. Without ZMQ_DONTWAIT it will block until the SNDTIMEO is reached or a spot in the send queue opens up.

With this option set, we can get an immediate error on a socket if a message is rejected due to a disconnected peer, and can take appropriate action to reroute the message (without fear of duplication) and update the rest of the BGN about the down link. Dealer sockets normally block if there is no peer.

Further, it sounds like we could set fixed queue sizes (as suggested by #252) and try other peers if the queue for a particular peer is full (alternatively, credit based flow control might work here too).

garlick commented 7 years ago

Conversation with Ralph Castain:

Is the binomial graph network in ORTE used much, and does it deliver good scalability and fault tolerance as advertised? Has something else worked out better? I read the 2007 papers from the guys at UTK and got pretty excited about moving Flux over to that overlay topology from the TBON and wanted to find some places where it is actually used in HPC, which led to OpenMPI, where it's a little hard (for me) to determine which modules are used all the time and which not.

We don’t use the binomial routed component any more, and haven’t for years. It doesn’t scale adequately for large clusters as the fanout is too limited, so you get lots and lots of hops. Instead, we default to using either the radix component (with a default fanout of 64) or the Debruijn graph component (I think the depth defaults to 8). Both scale far better than the binomial one.

Debruijn has the advantage of already being fault tolerant since it is a mesh. The radix component isn’t fault tolerant in itself, though it is relatively easy to render it so. What I have done in the past for other projects is simply have the component failover to the next process up the tree - i.e., to the parent of the proc that failed. I added a little logic to randomize a tad, thus moving to the level of the parent but picking a member of that level at random to avoid creating bottlenecks. Then, once the parent returned, the network just “healed” itself by returning to the default arrangement.

Anyway, that worked very well for Cisco and Intel - I can provide some architecture info if you want to pursue that route. Personally, I found the doubled-over binomial design from UTK overly complex - just healing the routes worked fine.

garlick commented 7 years ago

More conversation with Ralph

That's really great information - if you have any non-NDA docs on either Debruijn or the self-healing radix to share, it would be most appreciated as we need to in the next few weeks the best architecture to tackle first to meet a resiliency milestone.

I don’t have a doc, but the self-healing radix is pretty simple. At startup, you do a collective operation to ensure that every process has contact info for everyone else, or at least has it for a few levels above themselves. Then, when you see a socket to your parent fail, you connect to your grandparent - since the radix tree is computable, every process knows its place in the tree and the ID of the procs above it, so this transition is easy to complete. For optimization, you can randomly select the grandparent from that level of the tree, thus ensuring that not every child of the failed parent attaches to the same grandparent.

Note that this is a “local hop” resilience strategy. In other words, we didn’t “ack” receipt from the eventual destination all the way back to the originator. The approach instead is based on the idea that the message will eventually get to its destination if we simply ensure that each hop succeeds. Each intermediate daemon is responsible for ensuring that happens, but it leaves the door open to the case where a daemon fails after it receives a message, but before it is able to successfully complete its relay.

There are several options for further hardening against that problem. You could add an “ack” from the destination back to the originator that flows backwards across the route. You then wind up setting timeouts and windows (so you don’t ack every message) - all of which adds complexity, but still leaves vulnerabilities (e.g., the ack is just another message and therefore vulnerable to loss - you can go into an infinite loop if you start ack’ing the acks!).

Another option (the one I employed) was to ack at the grandparent level - i.e., my grandparent sends me an ack, thereby indicating that my parent successfully transferred the message. This gives two-point failure protection, which is generally considered adequate. Obviously, you can go as deep down this path as you like.

Debruijn was done by Nathan Hjelmn at LANL - his email is hjelmn@lanl.gov. Basically, it works the same as the self-healing radix except that it uses a mesh - so if a failure occurs, there is always an alternative path that gets to the eventual destination. You still have to detect the failure and resend the message, so I’m not convinced it is any more resilient - but it is a different topology. Benefit of the radix tree is that you can align it across the network to minimize hops, while Debruijn is a purely mathematical construct, and so you will bounce across the switches.

What sort of faults is ORTE designed to tolerate? E.g. single node failure with limits or...? Do you panic if the overlay bisects?

ORTE can handle two-node failures, but the default is not to do so - since we generally are just supporting MPI, we default to simple termination if a failure occurs. UTK has stated an intention to commit their code as well, but it hasn’t happened yet - they told me last week that their ECP project requires them to do so by the end of this year. My failure recovery code has mostly been removed (kept on a branch here) for now, but can be readily returned if/when the community decides they want recovery again. Intel also implemented an end-to-end QoS support, and now keep that in a separate branch we can share if you like.

What's the testing story for fault handling in ORTE or elsewhere? This is going to be a challenge for us.

We had a very nice internal fault generator in our sensor framework - it generates random errors, and you can ask for single or multiple faults (basically, it either just kills itself or orders others up/down the tree to die as well). It would then leave the daemon dead for a random period of time, and then restart it so we could watch the tree recover. This has been removed from the OMPI master since they just want it to die, but I have it on a branch and can share it if you like.

garlick commented 3 years ago

Closing old (but interesting!) issue.

garlick commented 1 year ago

Support for a binomial tree topology in the overlay network was added in #4730 but this issue was apparently never linked.

morrone commented 1 year ago

Does #4730 implement BGN (with the resiliency implications), or just a single binomial tree?

garlick commented 1 year ago

Single tree.

morrone commented 1 year ago

OK, you might consider leaving this open for the rest of BGN, which is primarily about the self-healing properties that it offers. But my knowledge is out of date, so maybe you have alternate methods now for handling those resiliency issues.

garlick commented 1 year ago

It was closed in 2021 essentially as a WONTFIX and I don't think we have a reason to reopen it right now, but it's not lost (as suggested by this activity). Thanks for getting that comment in about BGN resiliency. Mine referencing the binomial PR is misleading without that clarification.