NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 806 forks source link

Topology printing bug in `ncclTopoPrintGraph` in NCCL 2.23? #1468

Open ezhang887 opened 2 weeks ago

ezhang887 commented 2 weeks ago

Hi NCCL experts,

I was recently investigating some of the output of NCCL_DEBUG_SUBSYS=Graph with the latest NCCL version (NCCL 2.23).

I was specifically looking at some of the output that ncclTopoPrintGraph prints out, and I was confused by some of the output when crossNic was 1.

After tracing it back to the source code, I found this line. Should the last element be NCCL_TOPO_ID_LOCAL_ID(graph->inter[2*c+1]) instead of 2*c? https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/graph/search.cc#L1145

Looking at the commit history looks like it used to be 2*c+1 but got (accidentally?) changed due to additions in the lastest release? [link to commit] Image

ezhang887 commented 2 weeks ago

Maybe cc @sjeaugey @KaimingOuyang?

sjeaugey commented 2 weeks ago

Indeed, good catch. Thanks! We'll fix that.

ezhang887 commented 2 weeks ago

Thanks for confirming @sjeaugey! Closing this issue.

ezhang887 commented 2 weeks ago

Actually -- I'll leave this issue open until the next release (or whenever this is fixed up here since IIRC NCCL doesn't really accept PRs) in case others run into the same issue so they can easily find it.