codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

How to collect ongoing events logging #214

Closed FBF1118 closed 3 years ago

FBF1118 commented 3 years ago

CODES is a exciting project. LP-IO is a simple API for writing data collectively from a ROSS simulation. It is best suited for relatively compact final network statistics from each LP , but I need to collect ongoing events logging, such as, source-destination pairs,message path, message size, transfer delay, send time and receive time, etc. I read related tutorials, but I didn’t find a way to collect it. Is there a way to collect it?

Any help would be much appreciated!

nmcglo commented 3 years ago

There has been work for collecting ROSS model data previously but I'm uncertain of the current state of implementation in CODES. In Dragonfly Dally, there is some implementation of this ROSS stat collection mechanism, the callback structure is formed in the line containing: st_model_types custom_dally_dragonfly_model_types[]... Again, however, I cannot personally vouch for the validity of that mechanism.

If you were to write your own code that would enable you to write out simulation/event data to a file, there's some things to keep in mind. One concern would be that writing out to a file during the execution of a simulation, there may be rollbacks if executed optimistically. This leads to the over-eager output of data and the final file result would contain invalid data interspersed amongst valid data.

ROSS has a function type that can be implemented called a commit function which is called on an event once it is committed to the simulation history - where it cannot be rolled back and thus safe to output data from.

This commit function, for example on a dragonfly-dally model, could contain a switch statement checking the event type. If this event type was, say, a T_ARRIVE message, then this means that the event corresponds to a packet arriving at the destination terminal. From here we could safely collect data from the event structure and output to a file.

It's important to note, if I recall correctly, that it is not safe to handle LP state inside of a commit function as the time that an event is being committed is almost certainly not the time that the event is scheduled to occur so LP state has likely changed in the time between an events timestamp and the time of commit.

nmcglo commented 3 years ago

Having a robust system for outputting various CODES simulation data is definitely a valuable feature but unfortunately I'm spread a little thin - working on defending my thesis in the next month or so - and am thus unable to estimate a timeline for when this feature is verified and documented.

If I had personally participated in the development of said ROSS stat collection feature and its integration into CODES, I might have a better idea but unfortunately I wasn't involved and would have to tinker and study the code to figure it out.

FBF1118 commented 3 years ago

Having a robust system for outputting various CODES simulation data is definitely a valuable feature but unfortunately I'm spread a little thin - working on defending my thesis in the next month or so - and am thus unable to estimate a timeline for when this feature is verified and documented.

If I had personally participated in the development of said ROSS stat collection feature and its integration into CODES, I might have a better idea but unfortunately I wasn't involved and would have to tinker and study the code to figure it out.

Thank you for your detailed replay. I will try it.

nmcglo commented 3 years ago

Closing for now - feel free to open a CODES discussion for any future help.