Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

How to deal with the case that when one or some processes are much faster than others #21

Open BichengYing opened 4 years ago

BichengYing commented 4 years ago

Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like for e in range(epochs): xxx some_collective_ops

Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.

BichengYing commented 4 years ago

Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation.

  1. Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.