codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Link Failure and Routing Features #205

Closed nmcglo closed 1 year ago

nmcglo commented 4 years ago

This merge request contains features for link failures on the Dragonfly Dally and Dragonfly Plus models. With this is new routing capabilities in each model to be able to "smartly" route packets so as to only travel along legal paths (ones that don't violate VC assignment and deadlock avoidance rules) that don't have any failed links in them.

This feature is entirely experimental. It is subject to change drastically and so any workflow generated for it now may not work in the future. Also, while there has been testing to reduce the likelihood that this doesn't behave as it should, there is still some risk that that is the case, particularly with unique or unusual network balancing/mapping.

Use at your own risk.

This merge request also contains restructuring of the Dragonfly Plus routing to make it more modular, similar to what was done to Dragonfly Dally. This separates the routing algorithms implemented with a switch statement. The goal here is to make it so that adjustments to one routing method don't have a chance of changing how others behave.

A major change to Dragonfly Plus's core routing helper functionality is also included. Previously the get_legal_nonminimal_stops() function returned legal next stops that were the converse of the legal minimal stops. This was recommended during the original design but it seems to be overly aggressive and led to poor load balancing, relaxing this to allow for any stop that could go to an intermediate group appears to yield significant improvements in routing.

More testing is necessary to assure that any changes to routing did not adversely affect previous expectations.

nmcglo commented 4 years ago

I want to do a bit more testing on this before merging.

nmcglo commented 3 years ago

A test of an 8K DFP network running SWM+ low Synthetic interference has shown no notable difference in application performance with refactored prog_adaptive.