Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Tensorflow Parallel Execution #4

Closed BichengYing closed 4 years ago

BichengYing commented 4 years ago

Our implementation required the global collective ops between different processes are always the same order. If the order is the same, for example, MPI_allreduce for layer 1 weight at rank 0 and MPI_allreduce for layer 1 bais, the MPI will either abort or hang.

BichengYing commented 4 years ago

Maybe we can try thread pool?

BichengYing commented 4 years ago

Threading pool method is not feasible since the win_ops conflicted with multi-thread.

The only approach is like horovod, use rank 0 as the master to coordinate with others

BichengYing commented 4 years ago

Can be solved through adding negotiation stage