codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Modelnet MPI Replay: Problem with multiple nonsynthetic traces + synthetic background #208

Open nmcglo opened 3 years ago

nmcglo commented 3 years ago

Current perceived intention of this workload file allows for multiple traces to be replayed (or online workloads) in addition to synthetic background traffic that is set to continue injecting until the trace based workloads finish. But the implementation does not allow this - it can have either 1 trace + multiple synthetic or multiple trace + no synthetic.

To allow for constant synthetic injection until the trace has completed, as ranks from trace based workloads finish, they notify their neighboring ranks in their workload that they've finished. Once all have finished, they notify the background traffic that they are all done and that signals the synthetic background traffic to stop generating traffic.

There are multiple issues with how this is implemented. Specifically of note is the notify_neighbor() function. What this appears to attempt to do (which works if there is only one trace) is check "Am I the last rank in my workload, have I finished, and has my preceding neighbor finished? If so, then we're done, notify background traffic of this fact.

But that's not what is actually implemented. Instead what is implemented is "Is my local rank within my workload equal to the number of trace ranks-1? Have I finished, and has my preceding neighbor finished? If so, then we're done, notify every other workload besides mine of this fact and have them change their rank "is finished" state to "finished".

  1. The local rank within a workload will only ever be equal to the number of trace ranks - 1 if there is only one dumpi trace being replayed. If you had two dumpi traces being replayed each with 1000 ranks, the maximum local rank of any one rank would be 999 but the number of trace ranks -1 would be 1999.

  2. As long as problem 1 is not addressed, problem 2 will never occur but for sake of documentation, the intention is to "once all workloads are complete, notify the background traffic" but the notify_background_traffic() function that is called makes no distinction about what workload ranks it is sending to so long as they aren't from the workload that is sending the notification. So if there is another, uncompleted, trace running, they will also receive a notification that forces each rank to set their is_finished flag to true, even if they're not finished. This means that should they receive a notification from their neighbor that they've finished, then they'll forward on that message despite themselves being actually incomplete. This problem is important to note because should problem 1 be addressed improperly, this could become a major - silent - issue.

Moving down in notify_neighbor(), we get to where the 'within a workload notification chain' is started/forwarded. It checks: "am I finished? Am I local rank 0 or has my preceding neighbor finished? If so, send to the next rank in my workload to start/forward this notification chain"

  1. The conditional would be fine if the conditional relating to problem 1 is addressed, but as it stands the last rank in a workload will eventually reach this statement and it will eventually be true. They then poll the jobmap to give them the global ID of their forward neighbor in their workload. There is zero error checking in the jobmap functions to make sure that you're requesting a rank that exists in the requested workload. This then polls from the jobmap to a spot on the global IDs array for that job that is OOB. (perhaps this is the source of a mysterious segfault occurrence i've noticed in extremely long running simulations). This results in a looping effect where the notification chain is endless. Even if this doesn't cause semantic problems, this will be a major performance problem as it is a lot of unnecessary ROSS events.

Anyway, I believe that my work on congestion control will require this to be addressed. Assigning to myself. My goal is to make it so that multiple traces can be replayed with synthetic traffic as well.

nmcglo commented 3 years ago

I believe that I've come up with a solution. If possible, I will put it in it's own branch and pull request.

Ultimately what I've done is added a new protocol to efficiently notify the background ranks that the primary workload. When non synthetic workloads finish (this is determined by a slightly modified notify_neighbor() method), they notify the last rank in each non-synthetic job with a new event type CLI_OTHER_FINISH. Upon receipt of this event, each receiving rank (the last in each workload) will update and check a list of jobs that it knows have been completed. If the number of completed non synthetic jobs equals the number of non-synthetic jobs, then the last rank in the highest ordered non-synthetic job will then notify all ranks from all synthetic jobs that the primary jobs have completed - thus prompting them to stop generating data.

I've implemented this and done some small tests but it's still pretty messy so it's not committed yet. Will do so tomorrow, reference this issue and close it as it will be set to be addressed in a coming pull request.

nmcglo commented 3 years ago

Some preliminary testing of my unpushed branch has worked out well. I've written it so that if --max_gen_data is used, the simulation will continue until all workloads, including synthetic have stopped.

nmcglo commented 3 years ago

Addressed in the current develop branch, will close when released.