The processing between all nodes/agents are almost decoupled and independent.
We want to further optimize our communication for multi machines cases. We know the communication between multiple GPUs within in same physical machine should be faster than communication between different machines. Further, we can utilize the NCCL, RDMA, etc technique to accelerate the speed. I suggest modifying the processes into
Local machine leader:
win_create -> gradient/iterate update -> Local Allreduce -> win_put -> win_sync
ocal machine worker1:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing
local machine woker2:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing
Current win_ops logics is
win_create -> gradient/iterate update -> win_put -> win_sync
The processing between all nodes/agents are almost decoupled and independent.
We want to further optimize our communication for multi machines cases. We know the communication between multiple GPUs within in same physical machine should be faster than communication between different machines. Further, we can utilize the NCCL, RDMA, etc technique to accelerate the speed. I suggest modifying the processes into
Local machine leader: win_create -> gradient/iterate update -> Local Allreduce -> win_put -> win_sync ocal machine worker1:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing local machine woker2:
nothing ----> gradient/iterate update -> Local Allreduce ---- nothing