Open manver-iitk opened 5 months ago
@manver-iitk a couple of questions:
Hello, @manver-iitk . Has this issue been resolved for you?
Hello , @corey-derochie-amd my issue still persists.
@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes
Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.
Problem Description
I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??
Operating System
SLES (Frontier)
CPU
AMD EPYC 7763 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response