codes-org / codes

The Co-Design of Exascale Storage Architectures (CODES) simulation framework builds upon the ROSS parallel discrete event simulation engine to provide high-performance simulation utilities and models for building scalable distributed systems simulations
Other
40 stars 16 forks source link

Model Net MPI Replay Determinism #189

Closed nmcglo closed 4 years ago

nmcglo commented 4 years ago

I have narrowed down the non determinism when running trace files to primarily be due to model-net-mpi-replay.c. I've made the dragonfly-dally.C model and its default synthetic workload generator completely deterministic in all execution modes (seq, conservative, optimistic) but as soon as I apply a trace file to it instead, it loses determinism.

It's been suspected that this is due to the RC stack and there may be some misuse of that - but the RC stack is also used by dragonfly-dally for some reverse computation work as well so it's not entirely flawed.

I'm going to attempt to tackle this and finally bring CODES fully deterministic.

nmcglo commented 4 years ago

Update: have tested the following on a 3k node 1D Dragonfly using model-net-mpi-replay (in my refactoring branch) 1000 synthetic ranks MG1000+1000 synthetic ranks And got deterministic results. Efficiency was incredibly high so maybe determinism is something that is just poorly implemented in an event that didn't happen to be rolled back. Will continue testing.

nmcglo commented 4 years ago

An online LAMMPS2048 + 2048 synthetic on an 8k 1D Dragonfly also produced deterministic results with efficiency of 95%.

I guess this should be marked tentatively closed for now.