Open arttianezhu opened 5 months ago
Indeed, that was a long-standing issue when setting NCCL_ALGO=Tree
; the fallback-to-ring code was re-enabling all protocols.
Here is a fix which will be in NCCL 2.22:
diff --git a/src/graph/tuning.cc b/src/graph/tuning.cc
index 24025ad17..4087b6acf 100644
--- a/src/graph/tuning.cc
+++ b/src/graph/tuning.cc
@@ -315,6 +315,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
}
if (pEnable == 0) comm->bandwidths[c][a][p] = 0;
if (algoEnable[a] == 0) comm->bandwidths[c][a][p] = 0;
+ if (a == NCCL_ALGO_RING && pEnable == 0) comm->ringbdw[c][p] = 0;
}
for (int c = 0; c < NCCL_NUM_FUNCTIONS; c++) {
Hi, we recently observed that when running with NCCL_ALGO=Tree,NCCL_PROTO=Simple. NCCL fallback to Ring,LL with broadcast. It seems like NCCL_PROTO is ignored when there is no ALGO/PROTO pair found for the collective.
The above trace is from 2.21.