NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

There is something mismatch on ncclTopoTrimSystem and ncclTopoCompute #1406

Closed shanleo2024 closed 3 months ago

shanleo2024 commented 3 months ago

Hi Dear deloper,

Run rccl_test with NCCL_P2P_DISABLE=1 and NCCL_SHM_DISABLE=1 on two GPUs and an IB NIC. The final graph.xml dumped is as follows:

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="24" speedinter="24" latencyinter="0" typeintra="LOC" typeinter="PXB" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="0"/>
      <net dev="0"/>
    </channel>
  </graph>
  <graph id="1" pattern="3" crossnic="0" nchannels="1" speedintra="48" speedinter="24" latencyinter="0" typeintra="LOC" typeinter="PXB" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="0"/>
      <net dev="0"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
  <graph id="3" pattern="5" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
</graphs>

RANK1 has been removed in ncclTopoTrimSystem as the path type between RANK0 and RANK1 is PATH_NET.

But the Channel dumped by NCCL_DEBUG=TRACE as follows:

[0] NCCL INFO Channel 00/04 :    0   1
[0] NCCL INFO Channel 01/04 :    0   1
[0] NCCL INFO Channel 02/04 :    0   1
[0] NCCL INFO Channel 03/04 :    0   1
[0] NCCL INFO Channel 00/0 : 1[3a000] -> 0[37000] [receive] via NET/IB/0/GDRDMA comm 0x55a4258c1170
[0] NCCL INFO Channel 01/0 : 1[3a000] -> 0[37000] [receive] via NET/IB/0/GDRDMA comm 0x55a4258c1170
[0] NCCL INFO Channel 02/0 : 1[3a000] -> 0[37000] [receive] via NET/IB/0/GDRDMA comm 0x55a4258c1170
[0] NCCL INFO Channel 03/0 : 1[3a000] -> 0[37000] [receive] via NET/IB/0/GDRDMA comm 0x55a4258c1170
[0] NCCL INFO Channel 00/0 : 0[37000] -> 1[3a000] [send] via NET/IB/0 comm 0x55a4258c1170 nRanks 02
[0] NCCL INFO Channel 01/0 : 0[37000] -> 1[3a000] [send] via NET/IB/0 comm 0x55a4258c1170 nRanks 02
[0] NCCL INFO Channel 02/0 : 0[37000] -> 1[3a000] [send] via NET/IB/0 comm 0x55a4258c1170 nRanks 02
[0] NCCL INFO Channel 03/0 : 0[37000] -> 1[3a000] [send] via NET/IB/0 comm 0x55a4258c1170 nRanks 02
[1] NCCL INFO Channel 00/0 : 0[37000] -> 1[3a000] [receive] via NET/IB/1/GDRDMA comm 0x55a425839140
[1] NCCL INFO Channel 01/0 : 0[37000] -> 1[3a000] [receive] via NET/IB/1/GDRDMA comm 0x55a425839140
[1] NCCL INFO Channel 02/0 : 0[37000] -> 1[3a000] [receive] via NET/IB/1/GDRDMA comm 0x55a425839140
[1] NCCL INFO Channel 03/0 : 0[37000] -> 1[3a000] [receive] via NET/IB/1/GDRDMA comm 0x55a425839140
[1] NCCL INFO Channel 00/0 : 1[3a000] -> 0[37000] [send] via NET/IB/1 comm 0x55a425839140 nRanks 02
[1] NCCL INFO Channel 01/0 : 1[3a000] -> 0[37000] [send] via NET/IB/1 comm 0x55a425839140 nRanks 02
[1] NCCL INFO Channel 02/0 : 1[3a000] -> 0[37000] [send] via NET/IB/1 comm 0x55a425839140 nRanks 02
[1] NCCL INFO Channel 03/0 : 1[3a000] -> 0[37000] [send] via NET/IB/1 comm 0x55a425839140 nRanks 02

I have two questions: (1) Why removing the RANK1 in ncclTopoTrimSystem in this test case? (2) NCCL has removed the RANK1 in ncclTopoTrimSystem, but the final channel still incudes RANK1.

sjeaugey commented 3 months ago

(1) P2P and SHM have been disabled, hence NCCL cannot communicate between the two GPUs using intra-node code. Hence, the GPU is removed from the "intra-node" view. Which will make NCCL use the network to communicate between the two GPUs. (2) Once we've created intra-node channels, we connect the channels "inter-node" which will create the final channels.

shanleo2024 commented 3 months ago

Thank you. Do you mean when P2P and SHM have been disabled, we need to split a intra-node communication into inter-node communication by NET, so there seems two nodes and each GPU for one node. But the final XML file is somewhat misleading, as there is indeed no GPU be removed. I think the following graph.xml seems better:

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="24" speedinter="24" latencyinter="0" typeintra="LOC" typeinter="PXB" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <net dev="0"/>
    </channel>
  </graph>
  <graph id="1" pattern="3" crossnic="0" nchannels="1" speedintra="48" speedinter="24" latencyinter="0" typeintra="LOC" typeinter="PXB" samechannels="1">
    <channel>
      <net dev="0"/>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <net dev="0"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
  <graph id="3" pattern="5" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
</graphs>
kiskra-nvidia commented 3 months ago

The XML graph can vary from node to node, and includes what NCCL considers to be node-local resources only. Because with NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 NCCL thinks that it's running on two nodes, it's normal and expected that the GPU rank 1 from "the other" node is not included. If you pass NCCL_GRAPH_DUMP_FILE_RANK=1, you will get the graph from "the other" node, which will include the GPU rank 1 but not GPU rank 0.

shanleo2024 commented 3 months ago

Thank you @kiskra-nvidia I have learned a lot that I previously overlooked through your comments, this makes sense. I have test the NCCL_GRAPH_DUMP_FILE_RANK and it work, the answer is very helpful for me, thanks a lot.