I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??
Operating System
SLES (Frontier)
CPU
AMD EPYC 7763 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
how many nodes did you use ( 1 node vs. more than 1 node) ?
if the answer is more than 1 node, how did you configure RCCL on Frontier? Did you use the RCCL libfabric plugin for the inter-node communication? If not, RCCL will end up using tcp sockets as far as I know (since Frontier does not support verbs API), which might explain why RCCL is so much slower than MPI Alltoall.
Problem Description
I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??
Operating System
SLES (Frontier)
CPU
AMD EPYC 7763 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response