Closed cliffwoolley closed 7 years ago
As far as I know, NCCL is meant to be used on a single node. Using an MPI-based allreduce allows you to use multiple nodes, and use Infiniband connections between the nodes and PCIe connections within the nodes.
I guess the debate is whether the GPU-aware part of the collective should be built on top of or underneath of MPI. In my experience, building on top of MPI (using MPI_Send/MPI_Recv) means that the achievable bandwidth for small collectives is fairly low due to the high latency of beginning each bulk transfer.
Does this match your experience?
Yes this is not the way to build collectives for small data. This is only used for large matrices - at least 1024 x 1024 float32
What's the benefit of using this implementation as opposed to using NCCL?