Open ezhang887 opened 2 weeks ago
Maybe cc @sjeaugey @KaimingOuyang?
Indeed, good catch. Thanks! We'll fix that.
Thanks for confirming @sjeaugey! Closing this issue.
Actually -- I'll leave this issue open until the next release (or whenever this is fixed up here since IIRC NCCL doesn't really accept PRs) in case others run into the same issue so they can easily find it.
Hi NCCL experts,
I was recently investigating some of the output of
NCCL_DEBUG_SUBSYS=Graph
with the latest NCCL version (NCCL 2.23).I was specifically looking at some of the output that
ncclTopoPrintGraph
prints out, and I was confused by some of the output whencrossNic
was 1.After tracing it back to the source code, I found this line. Should the last element be
NCCL_TOPO_ID_LOCAL_ID(graph->inter[2*c+1])
instead of2*c
? https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/graph/search.cc#L1145Looking at the commit history looks like it used to be
2*c+1
but got (accidentally?) changed due to additions in the lastest release? [link to commit]