Perform a single allreduce operation over all parameters, which significantly reduces overhead and gives much better performance, especially with many data-parallel replicas. On basic tests I ran, performance matched torch DistributedDataParallel implementation.
Current implementation hits a nice sweet spot of simplicity and performance. There are more opportunities for speedups (smartly grouping parameters and running allreduce in parallel with backward pass), but exploiting these is much more involved and would probably require pulling in horovod or somesuch framework.
Perform a single allreduce operation over all parameters, which significantly reduces overhead and gives much better performance, especially with many data-parallel replicas. On basic tests I ran, performance matched torch DistributedDataParallel implementation.
Current implementation hits a nice sweet spot of simplicity and performance. There are more opportunities for speedups (smartly grouping parameters and running allreduce in parallel with backward pass), but exploiting these is much more involved and would probably require pulling in horovod or somesuch framework.