Closed RafalSiwek closed 4 months ago
Hi @RafalSiwek,
Sorry - even though RCCL is a port of NCCL, there's a lot of under-the-hood changes that prevent heterogenous usage of RCCL and NCCL across a cluster.
1) No. This would not be supported. 2) UCX might work if you have different nodes with different hardware, but I'm not sure if anyone has tested this. You'd likely get a better response from the UCX team about this use case. 3) Not applicable, due to lack of support.
Description
Hello,
I am currently working on configuring an HPC MLOps cluster that integrates both AMD and NVIDIA GPUs. I came across information suggesting that RCCL can be compatible with certain NCCL versions as mentioned in the RCCL changelog. I am trying to understand if it is possible to run a collective communication job leveraging both RCCL and NCCL.
Setup Details
The expected and received byte values vary with different combinations of NCCL and RCCL versions. (I tried the unreleased RCCL 2.20.5 with NCCL 2.20.5)
Questions
Thank you very much for your time and assistance.