Open yanminjia opened 11 months ago
It looks _all_reduceperf dropped into an infinite loop somehow.
I understood the below line may be used to restrict the maximum number of nvlsheads.
if (graph->pattern == NCCL_TOPO_PATTERN_NVLS && i>0) continue;
But I'm not sure.
Additionally, I'm curious that if NCCL_ALGO_NVLS_TREE is enabled, how the kernel function complete intra-node ALLReduce with once scatter and gather by means of NVLS mechanism. I don't quite understand how the array nvls->up is used. It looks the value of nvls->up elements such as comm->nranks + h
is greater than comm->nranks.
When I tested the code 2.19.3 with dual-ports NIC, in case of NVLS acceleration, unfortunately, all_reduce_perf hung. The processes of all_reduce_perf were still there. And the utilization of GPUs went high (100%). In this case, NVLS is enabled (NCCL_NVLS_ENABLE = 1). And I'm sure the topology graph is generated based on NVLS_TREE (graph->pattern == NCCL_TOPO_PATTERN_NVLS). I'm not sure what went wrong. Additionally, if I shut down one port on each NIC, it could work and I did see the test result.
I noted that one line is added to search.cc:ncclTopoSearchRecNet(...) by commit "Fixes for H800" (commit id: bfd7bf8) as following:
if (graph->pattern == NCCL_TOPO_PATTERN_NVLS && i>0) continue;
With this new added line, in case of a dual-ports NIC, it looks only one port can be used as an outgoing channel when construct the topo graph. I don't quite understand why we added this new line.
It would be highly appreciated if any idea. Many thanks.