Closed nmcglo closed 4 years ago
Update: have tested the following on a 3k node 1D Dragonfly using model-net-mpi-replay (in my refactoring branch) 1000 synthetic ranks MG1000+1000 synthetic ranks And got deterministic results. Efficiency was incredibly high so maybe determinism is something that is just poorly implemented in an event that didn't happen to be rolled back. Will continue testing.
An online LAMMPS2048 + 2048 synthetic on an 8k 1D Dragonfly also produced deterministic results with efficiency of 95%.
I guess this should be marked tentatively closed for now.
I have narrowed down the non determinism when running trace files to primarily be due to model-net-mpi-replay.c. I've made the dragonfly-dally.C model and its default synthetic workload generator completely deterministic in all execution modes (seq, conservative, optimistic) but as soon as I apply a trace file to it instead, it loses determinism.
It's been suspected that this is due to the RC stack and there may be some misuse of that - but the RC stack is also used by dragonfly-dally for some reverse computation work as well so it's not entirely flawed.
I'm going to attempt to tackle this and finally bring CODES fully deterministic.