NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

SendRecv Time #214

Open osayamenja opened 2 months ago

osayamenja commented 2 months ago

Is the time reported for sendrecv_perf $RTT$ (Round-Trip Time) or $\frac{RTT}{2}$?

That is, given the test does a send and a receive, the time could be $\frac{RTT}{2}$ if the send times overlap for both GPUs and they receive their awaited data at the same time. On the other hand, $RTT$ would occur when, theoretically, GPU 1 sends to GPU 2 and awaits GPU 2 to send their data or, concisely, 1:send <->2:recv then 1:recv<->2:send.

sjeaugey commented 2 months ago

The reported time is the time of the NCCL group call, i.e.

ncclGroupStart();
ncclRecv(...); // from prev rank
ncclSend(...); // to next rank
ncclGroupEnd();

It is not a ping-pong test, it's more like a single ring connecting previous rank and next rank.

osayamenja commented 2 months ago

@sjeaugey Thank you, assuming the alpha-beta cost model, would you agree that the following accurately describes the total time per rank $t_r$ and ideal reported time $t_R$? $$tr = \alpha{r-1, r} + n \cdot\beta_{r-1, r} = \frac{RTT}{2}$$ $$tR = \max{r \in W} tr$$ where $\alpha{ij}$ and $\beta_{ij}$ denote latency and bandwidth of sending from $i$ to $j$ and $n$ is data size and $W$ is process world.

That is, for GPU 0, $t0 = \alpha{1, 0} + n \cdot\beta_{1, 0}$ since both GPU 0 and 1 send simultaneously through the ring, and GPU 0 will have to wait until it receives from GPU 1, and vice versa.