Potential NVLS topology discovery error

bwbarrett commented 1 year ago

We've been chasing a repeatable crash with NCCL 1.18.3 on P5 (with one EFA device of 100 Gbps per GPU), and I suspect something is wrong with the topology code in the NVLS path. We end up with a fairly deep call stack (below), which ends up with an Invalid read (according to valgrind) in ncclTopoSearchRecNet() and goes south from there (eventually leading to a segmentation faulre). Nets is an array of system->nodes[NET].count entries (which is 8 in our case), but we end up dereferencing nets[graph->nChannels] as part of the call to ncclTopoSearchTryGPU() where graph->nChannels is also 8, meaning we read garbage past that calloc'ed array and it goes south from there.

I added a couple asserts, particularly one in ncclTopoSearchTryGPU() (patch attached), and that assert fires every time on our hardware when building the NVLS topology. It looks like we got there because when we get to ncclTopoSearchRecGPU() where step == ngpus, but the number of channels (8) is less than the graph's maxChannels (32).

I'm not entirely sure I know what the topology is supposed to look like in the NVLS (but not IB Sharp) case. Any thoughts on where to look next? nvls-discovery.tar.gz

sjeaugey commented 1 year ago

Sorry about that, that code if obviously wrong.

Can you check it's fixed with the commit I just pushed?

AmedeoSapio commented 1 year ago

I have confirmed that the tests that were failing due to this issue are completing successfully with the new commit. We really appreciate the help and the quick response. Thanks!

junpuf commented 1 year ago

Hi @sjeaugey, I wonder if you are planning to make a release to incorporate this fix?

NVIDIA / nccl

Potential NVLS topology discovery error #931