The data parallel training algorithm implemented here uses all_reduce to get global weight updates (by summing and averaging them), as such the global weight updates become available to all the ranks, not only the root rank (which would be the case with reduce). There is no need to broadcast the global weights after each update iteration, but we need to broadcast the initial weights to all workers to ensure identical starting point for training.
The data parallel training algorithm implemented here uses
all_reduce
to get global weight updates (by summing and averaging them), as such the global weight updates become available to all the ranks, not only the root rank (which would be the case withreduce
). There is no need to broadcast the global weights after each update iteration, but we need to broadcast the initial weights to all workers to ensure identical starting point for training.