codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Dfdally Refactoring: Routing #183

Closed nmcglo closed 4 years ago

nmcglo commented 4 years ago

This PR brings in work that is focused on fixing the adaptive routing algorithm used by Dragonfly Dally. The previous, now known as 'legacy', progressive adaptive routing algorithm became difficult to maintain and expand upon. This PR includes a rewritten minimal, non-minimal (Valiant), and PAR algorithm. When this progress was discussed, the importance of the original (legacy) version of the PAR implementation was stressed.

Since, I have added the legacy PAR algorithm back in to the model (also included in this PR) but this time it uses the new Connection Manager class to make decisions. Because the Connection Manager class stores data differently from the original IntraGroupLinks' and InterGroupLinks' nested map structures, there was some behavior (which was possibly unintended to begin with) that was not replicated. An example is that in the original, a selected nonminimal channel could be the router itself which is invalid. This would cause an error except that get_output_port() would basically pick an intra group port based on the local group ID of the router itself. This again should throw an error eventually as the correct destination router local ID is almost never equal to the port number associated with said link. It ends up working out though as eventually minimal routes will be required and the bug didn't exist for those routes.

This legacy PAR algorithm is considered to be used 'as is'. There are potential bugs in the implementation but the documentation to tell the difference between intended and unintended behavior doesn't exist. As such, support for this algorithm is challenging and future work on it will simply be to make sure it still works with any future model changes.

This PR also includes work that fixes the synthetic workload generator within model-net-mpi-replay. Fixes include a data locality bug in determining, across all PEs, if there is a synthetic rank that needs to be informed of a "completed workload" by a primary workload rank. This was done, previously, by a global static variable is_synthetic. But depending on how LPs are mapped to PEs, it's possible for a PE to be filled with only primary workload ranks. As a result, that is_synthetic value would read false for that PE and none of the primary workload ranks would ever send notifications of their completion. This would mean that the simulation wouldn't stop until the ending timestamp was reached (read: a very long time) and because those ranks never "finished" the generated output files aren't valid. This was fixed via an MPI reduce between PEs to ensure that every PE knows if there exists a single synthetic workload rank.

This PR also allows max_gen_data command line argument for model-net-mpi-replay to work for all synthetic workloads, not just when QoS is turned on.

Finally, this PR also fixes the dragonfly dally specific synthetic workload generator. Before it would frequently only send a number of messages close but lesser than a multiple of the --num_messages= CLA on Uniform Random. This was because a rank could select itself as a destination and the dally model ignores those messages. This has been fixed.