Open qdbkppkbdq opened 5 months ago
DCCL starts as a project where data lives in CPU memory with RDMA acceleration. Now it begins to support GPU (through GPU Direct RDMA) for AllReduce, ReduceScatter, AllGather, Send/Recv. There are a lot of optimizations to do including blocking/chunking, NVLink support, IB-SHArP support, etc... So far, its performance for those operations in GPU cannot beat highly optimized NCCL yet.
DCCL's upper-hands: DCCL's broadcast is highly optimized with an RDMA-based protocol called RDMC[1]. We haven't extended this to GPU support yet. Also, DCCL supports a hybrid setup where data can live in host memory on some nodes and in GPU memory on other nodes. In host memory, DCCL AllReduce performance beats OpenMPI.
How does DCCL compare to NCCL in terms of performance? In which scenarios does DCCL have advantages or disadvantages over NCCL? Are there any benchmark results or use cases available? Additionally, does DCCL support NVLINK interconnect? I truly appreciate any guidance you can offer on this topic.