ROCm / rccl

ROCm Communication Collectives Library (RCCL)
https://rocmdocs.amd.com/projects/rccl/en/latest/
Other
231 stars 106 forks source link

[Issue]: Unexpected Behavior LL Protocol #1234

Open arkhadem opened 1 week ago

arkhadem commented 1 week ago

Problem Description

Hi Everyone,

I have found a very strange behavior in rccl-rocm-6.1.2 that I cannot understand based on my limited knowledge of LL implementation. The behavior is for AllGather - RING - LL test.

In the LL implementation, each channel has 256 threads. Each thread in each trip, sends/receives 8B data. So each trip of any primitive of LL transfers 256 threads x 8B = 2KB data. I found that if the data size is not divisible by 128B (16 threads), the latency is very high.

In the following experiment, I increase the data by 16B in each step, meaning that 2 more threads will transfer data. Every 8 steps (x 2 threads = 16 threads or 1/4 of warp size), the latency is low. Otherwise, latency is huge (~150us difference). Can anyone understand why this is happening?

3MI300X

Sincerely,

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

Using 3 fully-connected GPUs:

RCCL_MSCCL_ENABLE=0 NCCL_PROTO=LL NCCL_ALGO=RING NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16 LD_LIBRARY_PATH=rccl-rocm-6.1.2/build/release/:$LD_LIBRARY_PATH ./build/all_gather_perf -g 3 -b 50331648 -e 50334720 -i 48 -s 1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

gilbertlee-amd commented 4 days ago

Hi @arkhadem,

Thanks for reporting this - I've created an internal ticket to look into this and will update this ticket when we have some information about it.