Question about nccl - Githubissues

DCCL starts as a project where data lives in CPU memory with RDMA acceleration. Now it begins to support GPU (through GPU Direct RDMA) for AllReduce, ReduceScatter, AllGather, Send/Recv. There are a lot of optimizations to do including blocking/chunking, NVLink support, IB-SHArP support, etc... So far, its performance for those operations in GPU cannot beat highly optimized NCCL yet.

DCCL's upper-hands: DCCL's broadcast is highly optimized with an RDMA-based protocol called RDMC[1]. We haven't extended this to GPU support yet. Also, DCCL supports a hybrid setup where data can live in host memory on some nodes and in GPU memory on other nodes. In host memory, DCCL AllReduce performance beats OpenMPI.

Derecho-Project / dccl

Question about nccl #2