Potential integration with NCCL and GDR

tddg commented 1 month ago

Thank you for the effort creating and maintaining the Derecho codebase.

RDMC could be very useful for GPU-based data replication as well. However, RDMC in its current form does not support GPUDirect RDMA based replication.

Question: Is it feasible, engineering-wise, to integrate NCCL GDR into the RDMC / Derecho scheduling and cluster communication infrastructure? Any comments or suggestions would be greatly appreciated!

songweijia commented 1 month ago

@tddg Thank you for being interested in this! Here I want to mention our Derecho Collectives Communication Library (DCCL) project which might fit your requirements better. But you are right about this: "RDMC in its current form does not support GPUDirect RDMA-based replication." We are working on the full support of GPU with DCCL. Right now, we only support GPU Direct RDMA at Derecho's peer-to-peer OOB data transfer (And the DCCL AllReduce/ReduceScatter/AllGather are all built upon that.) We do plan to upgrade the core of our data plane to support GPU Direct so that the replication in GPU could work -- not only the RDMC but also SST Multicast. We might address the latter first. Please check DCCL out. I'm looking forward to talking with you in detail.

Btw, can I ask about your use case of data replication in GPUs?

KenBirman commented 1 month ago

@songweijia has been looking into this for nine months now and has an unreleased code base with quite a lot of what you are asking about that will be part of our fall release of the system. He should comment on the technical details, but in a nutshell he does support GPUDirect.

Right now he has experimented with his own version of the CCL library, called DCCL. It is quite fast, and can beat Open MPI for many host compute parallel tasks… but this is not the same as a deep integration of NCCL, because NVIDIA is able to leverage proprietary hardware functionality that DCCL has no obvious way to access (but perhaps there are APIs we have not encountered…).

Also, we have not yet looked at running all of Derecho (and RDMC) on GPU memory objects. There is no obvious reason this would be a problem.

So, is it feasible to have a Derecho CCL? Without question, and in fact it exists in alpha form, and you will be able to access it later this year. Can it be faster than OMPI or NCCL? Yes : DDCL is significantly faster than OMPI for many host computing tasks like AllReduce on buffers in host memory, with no GPU involvement. But the further we push towards trying to outperform NCCL the more we run into this issue of NCCL seemingly leveraging undocumented functionality that may be quite specific to Mellanox hardware. Going down that path might require collaboration with Mellanox engineers.

Also, such work can involve costs and we have limited funding. Even in an open source university-centered efgort, people cost money… hardware costs money…

tddg commented 1 month ago

Thank you very much for the prompt and detailed response! @songweijia @KenBirman

Our use case is around scaling out model serving and model inference.

While investigating NCCL, we do have similar confusions, as 1) NCCL implementation is not well documented, and 2) NCCL seemingly leverages undocumented functionality.

Derecho-Project / derecho

Potential integration with NCCL and GDR #281