Open osayamenja opened 2 months ago
The reported time is the time of the NCCL group call, i.e.
ncclGroupStart();
ncclRecv(...); // from prev rank
ncclSend(...); // to next rank
ncclGroupEnd();
It is not a ping-pong test, it's more like a single ring connecting previous rank and next rank.
@sjeaugey Thank you, assuming the alpha-beta cost model, would you agree that the following accurately describes the total time per rank $t_r$ and ideal reported time $t_R$? $$tr = \alpha{r-1, r} + n \cdot\beta_{r-1, r} = \frac{RTT}{2}$$ $$tR = \max{r \in W} tr$$ where $\alpha{ij}$ and $\beta_{ij}$ denote latency and bandwidth of sending from $i$ to $j$ and $n$ is data size and $W$ is process world.
That is, for GPU 0, $t0 = \alpha{1, 0} + n \cdot\beta_{1, 0}$ since both GPU 0 and 1 send simultaneously through the ring, and GPU 0 will have to wait until it receives from GPU 1, and vice versa.
Is the time reported for
sendrecv_perf
$RTT$ (Round-Trip Time) or $\frac{RTT}{2}$?That is, given the test does a send and a receive, the time could be $\frac{RTT}{2}$ if the send times overlap for both GPUs and they receive their awaited data at the same time. On the other hand, $RTT$ would occur when, theoretically, GPU 1 sends to GPU 2 and awaits GPU 2 to send their data or, concisely, 1:send <->2:recv then 1:recv<->2:send.