baidu-research / baidu-allreduce

Apache License 2.0
567 stars 113 forks source link

Why not NCCL? #1

Closed cliffwoolley closed 7 years ago

cliffwoolley commented 7 years ago

What's the benefit of using this implementation as opposed to using NCCL?

cliffwoolley commented 7 years ago

https://github.com/NVIDIA/nccl http://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf

gibiansky commented 7 years ago

As far as I know, NCCL is meant to be used on a single node. Using an MPI-based allreduce allows you to use multiple nodes, and use Infiniband connections between the nodes and PCIe connections within the nodes.

cliffwoolley commented 7 years ago

I guess the debate is whether the GPU-aware part of the collective should be built on top of or underneath of MPI. In my experience, building on top of MPI (using MPI_Send/MPI_Recv) means that the achievable bandwidth for small collectives is fairly low due to the high latency of beginning each bulk transfer.

Does this match your experience?

shubho commented 7 years ago

Yes this is not the way to build collectives for small data. This is only used for large matrices - at least 1024 x 1024 float32