ROCm / rccl

ROCm Communication Collectives Library (RCCL)
https://rocmdocs.amd.com/projects/rccl/en/latest/
Other
273 stars 122 forks source link

Compatibility and Setup of RCCL with NCCL for Mixed GPU HPC MLOps Cluster #1220

Closed RafalSiwek closed 4 months ago

RafalSiwek commented 5 months ago

Description

Hello,

I am currently working on configuring an HPC MLOps cluster that integrates both AMD and NVIDIA GPUs. I came across information suggesting that RCCL can be compatible with certain NCCL versions as mentioned in the RCCL changelog. I am trying to understand if it is possible to run a collective communication job leveraging both RCCL and NCCL.

Setup Details

Questions

  1. Can a collective communication job be run leveraging both RCCL and NCCL?
  2. Could you give me some insight on whether this approach heads a good direction and or if leveraging UCX and UCC would be more effective for such a heterogeneous setup?
  3. Any guidance or resources on configuring and troubleshooting this setup would be immensely helpful.

Thank you very much for your time and assistance.

gilbertlee-amd commented 4 months ago

Hi @RafalSiwek,

Sorry - even though RCCL is a port of NCCL, there's a lot of under-the-hood changes that prevent heterogenous usage of RCCL and NCCL across a cluster.

1) No. This would not be supported. 2) UCX might work if you have different nodes with different hardware, but I'm not sure if anyone has tested this. You'd likely get a better response from the UCX team about this use case. 3) Not applicable, due to lack of support.