NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

all_reduce_perf hangs by going with dual-ports NIC #1035

Open yanminjia opened 11 months ago

yanminjia commented 11 months ago

When I tested the code 2.19.3 with dual-ports NIC, in case of NVLS acceleration, unfortunately, all_reduce_perf hung. The processes of all_reduce_perf were still there. And the utilization of GPUs went high (100%). In this case, NVLS is enabled (NCCL_NVLS_ENABLE = 1). And I'm sure the topology graph is generated based on NVLS_TREE (graph->pattern == NCCL_TOPO_PATTERN_NVLS). I'm not sure what went wrong. Additionally, if I shut down one port on each NIC, it could work and I did see the test result.

I noted that one line is added to search.cc:ncclTopoSearchRecNet(...) by commit "Fixes for H800" (commit id: bfd7bf8) as following:

if (graph->pattern == NCCL_TOPO_PATTERN_NVLS && i>0) continue;

With this new added line, in case of a dual-ports NIC, it looks only one port can be used as an outgoing channel when construct the topo graph. I don't quite understand why we added this new line.

It would be highly appreciated if any idea. Many thanks.

yanminjia commented 11 months ago

It looks _all_reduceperf dropped into an infinite loop somehow.

yanminjia commented 10 months ago

I understood the below line may be used to restrict the maximum number of nvlsheads.

if (graph->pattern == NCCL_TOPO_PATTERN_NVLS && i>0) continue;

But I'm not sure.

Additionally, I'm curious that if NCCL_ALGO_NVLS_TREE is enabled, how the kernel function complete intra-node ALLReduce with once scatter and gather by means of NVLS mechanism. I don't quite understand how the array nvls->up is used. It looks the value of nvls->up elements such as comm->nranks + h is greater than comm->nranks.