ROCm / rccl

ROCm Communication Collectives Library (RCCL)
https://rocmdocs.amd.com/projects/rccl/en/latest/
Other
272 stars 122 forks source link

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

Open manver-iitk opened 5 months ago

manver-iitk commented 5 months ago

Problem Description

I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??

Operating System

SLES (Frontier)

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

edgargabriel commented 4 months ago

@manver-iitk a couple of questions:

corey-derochie-amd commented 2 months ago

Hello, @manver-iitk . Has this issue been resolved for you?

manver-iitk commented 1 month ago

Hello , @corey-derochie-amd my issue still persists.

@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes

thananon commented 3 weeks ago

Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.