entity-neural-network / incubator

Collection of in-progress libraries for entity neural networks.
Apache License 2.0
29 stars 10 forks source link

Batch allreduce ops #220

Closed cswinter closed 2 years ago

cswinter commented 2 years ago

Perform a single allreduce operation over all parameters, which significantly reduces overhead and gives much better performance, especially with many data-parallel replicas. On basic tests I ran, performance matched torch DistributedDataParallel implementation.

Current implementation hits a nice sweet spot of simplicity and performance. There are more opportunities for speedups (smartly grouping parameters and running allreduce in parallel with backward pass), but exploiting these is much more involved and would probably require pulling in horovod or somesuch framework.