codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Union online workload simulation #235

Open xwang149 opened 1 year ago

xwang149 commented 1 year ago

This document serves the following purposes:

= CODES updates

The code modifications are started with comment text "Xin:"

== Header file

Added parameters for collecting router traffic data, including:

== Makefile

Added checking for Union installation in the autoconf configure script configure.ac Added src/workload/methods/codes-conc-online-comm-wrkld.C to code base if compile with Union in Makefile.am

== Union online workload

We add a pluggable workload module "src/workload/methods/codes-conc-online-comm-wrkld.C" into CODES workload generator to hold the actual implementation of Union communication events, such that the messages from Union skeletons can be emitted as simulation events in CODES.

== Router status collection for dragonfly custom and dragonfly dally

Added supportive functions for collecting traffic data on router port on the following network models:

== Updates in MPI replay

Added Union online workload type in MPI workload replay at src/network-workloads/model-net-mpi-replay.c

== Configurations

We added the following items in the CODES configuration file for collecting router traffic information during simulation.

An example configuration can be found at: https://github.com/SPEAR-IIT/Union/blob/master/test/df1d-72-adp.conf

= Installation tutorial

Please follow the Readme at: https://github.com/SPEAR-IIT/Union to install Union and run test simulation of Union online workloads.

= Completed Experiments

We have completed the following experiments with Union online workload simulation:

The above experiments have been done on both dragonfly custom and dragonfly dally network models, with sequential mode and optimistic mode.

= Known Issues

Currently the rendezvous protocol in MPI replay cannot work with Union online workloads. The reverse function router_buf_update_rc() does not take care of the cross window reverses for aggregated busytime on port.

xwang149 commented 1 year ago

Push the merged changes to new branch kronos-development instead