NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

NCCL fallback to Ring,LL on broadcast perf and NCCL_ALGO=Tree #1304

Open arttianezhu opened 5 months ago

arttianezhu commented 5 months ago

Hi, we recently observed that when running with NCCL_ALGO=Tree,NCCL_PROTO=Simple. NCCL fallback to Ring,LL with broadcast. It seems like NCCL_PROTO is ignored when there is no ALGO/PROTO pair found for the collective.

[1] NCCL INFO Broadcast: opCount a sendbuff 0x7f6cb9e00200 recvbuff 0x7f6cb9e00200 count 24 datatype 0 op 0 root 0 comm 0x55a2f87b6bb0 [nranks=1024] stream 0x55a2f87bbd40
[1] 15269.938482 topoGetAlgoInfo:1535 NCCL TRACE 24 Bytes -> Algo 1 proto 0 time 2771.401367

The above trace is from 2.21.

sjeaugey commented 5 months ago

Indeed, that was a long-standing issue when setting NCCL_ALGO=Tree; the fallback-to-ring code was re-enabling all protocols.

Here is a fix which will be in NCCL 2.22:

diff --git a/src/graph/tuning.cc b/src/graph/tuning.cc
index 24025ad17..4087b6acf 100644
--- a/src/graph/tuning.cc
+++ b/src/graph/tuning.cc
@@ -315,6 +315,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
     }
     if (pEnable == 0) comm->bandwidths[c][a][p] = 0;
     if (algoEnable[a] == 0) comm->bandwidths[c][a][p] = 0;
+    if (a == NCCL_ALGO_RING && pEnable == 0) comm->ringbdw[c][p] = 0;
   }

   for (int c = 0; c < NCCL_NUM_FUNCTIONS; c++) {