Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Proposal for local GPU communication merging #20

Open BichengYing opened 4 years ago

BichengYing commented 4 years ago

Current win_ops logics is

win_create -> gradient/iterate update -> win_put -> win_sync

The processing between all nodes/agents are almost decoupled and independent.

We want to further optimize our communication for multi machines cases. We know the communication between multiple GPUs within in same physical machine should be faster than communication between different machines. Further, we can utilize the NCCL, RDMA, etc technique to accelerate the speed. I suggest modifying the processes into

Local machine leader: win_create -> gradient/iterate update -> Local Allreduce -> win_put -> win_sync ocal machine worker1:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing local machine woker2:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing

BichengYing commented 3 years ago

the neighbor_allreduce version is done with machine id based