Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like
for e in range(epochs):
xxx
some_collective_ops
Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.
Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation.
Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.
Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like for e in range(epochs): xxx some_collective_ops
Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.