Understand the tree topology

Dear NCCL Developer,

I'm confused about the Tree topology used in the 4 GPU DGX1-V100 (GPU 0,1,2,3) algorithm. My topology file looks like this

<system version="1">
  <cpu numaid="0" affinity="0000,0fffff00,000fffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="79">
    <pci busid="0000:03:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="8 GT/s" link_width="16">
      <pci busid="0000:06:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="8 GT/s" link_width="16">
        <gpu dev="0" sm="70" rank="0" gdr="1">
          <nvlink target="0000:07:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:0a:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:0b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:85:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:07:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="8 GT/s" link_width="16">
        <gpu dev="1" sm="70" rank="1" gdr="1">
          <nvlink target="0000:86:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:0a:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:0b:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:06:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:08:00.0" class="0x060400" vendor="0x10b5" device="0x8780" subsystem_vendor="0x10b5" subsystem_device="0x8780" link_speed="8 GT/s" link_width="16">
      <pci busid="0000:0a:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="8 GT/s" link_width="16">
        <gpu dev="2" sm="70" rank="2" gdr="1">
          <nvlink target="0000:89:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:0b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:06:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:07:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:0b:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="8 GT/s" link_width="16">
        <gpu dev="3" sm="70" rank="3" gdr="1">
          <nvlink target="0000:06:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:07:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:0a:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:8a:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:01:00.0" class="0x020000" vendor="0x8086" device="0x1528" subsystem_vendor="0x152d" subsystem_device="0x89b7" link_speed="5 GT/s" link_width="8">
      <nic>
        <net name="enp1s0f0" dev="0" speed="1000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
      </nic>
    </pci>
    <nic>
      <net name="lxcbr0" dev="1" speed="10000" port="0" latency="0.000000" guid="0x1" maxconn="65536" gdr="0"/>
      <net name="br-d774142ed332" dev="2" speed="10000" port="0" latency="0.000000" guid="0x2" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
</system>

The NCCL INFO output when running allreduce & set NCCL_ALGO=Tree is

como:60725:60725 [3] NCCL INFO Bootstrap : Using enp1s0f0:169.229.48.119<0>
como:60725:60725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
como:60725:60725 [3] NCCL INFO NET/IB : No device found.
como:60725:60725 [3] NCCL INFO NET/Socket : Using [0]enp1s0f0:169.229.48.119<0> [1]lxcbr0:10.0.3.1<0> [2]br-d774142ed332:172.18.0.1<0>
como:60725:60725 [3] NCCL INFO Using network Socket
como:60725:60725 [3] NCCL INFO NCCL version 2.12.7+cuda11.6
como:60725:60750 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff00,000fffff
como:60725:60749 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
como:60725:60751 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
como:60725:60748 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
como:60725:60751 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2
como:60725:60750 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
como:60725:60748 [0] NCCL INFO Channel 00/08 :    0   1   2   3
como:60725:60749 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
como:60725:60748 [0] NCCL INFO Channel 01/08 :    0   3   2   1
como:60725:60748 [0] NCCL INFO Channel 02/08 :    0   3   1   2
como:60725:60748 [0] NCCL INFO Channel 03/08 :    0   2   1   3
como:60725:60748 [0] NCCL INFO Channel 04/08 :    0   1   2   3
como:60725:60748 [0] NCCL INFO Channel 05/08 :    0   3   2   1
como:60725:60748 [0] NCCL INFO Channel 06/08 :    0   3   1   2
como:60725:60748 [0] NCCL INFO Channel 07/08 :    0   2   1   3
como:60725:60748 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
como:60725:60749 [1] NCCL INFO Channel 00 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 00 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 03 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 02 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 00 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 04 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 04 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 04 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 07 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 06 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 00 : 0[6000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 04 : 0[6000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 03 : 0[6000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 02 : 2[a000] -> 0[6000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 03 : 1[7000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 02 : 3[b000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 07 : 0[6000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 06 : 2[a000] -> 0[6000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 07 : 1[7000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 06 : 3[b000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 01 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 02 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 01 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 05 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 03 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 06 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 05 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 01 : 1[7000] -> 0[6000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 01 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 07 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 05 : 1[7000] -> 0[6000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 05 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Connected all rings
como:60725:60749 [1] NCCL INFO Channel 01 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Connected all rings
como:60725:60750 [2] NCCL INFO Connected all rings
como:60725:60751 [3] NCCL INFO Connected all rings
como:60725:60749 [1] NCCL INFO Channel 03 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 05 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Channel 07 : 1[7000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 01 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 02 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 03 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 01 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 05 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 02 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 06 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 05 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 07 : 2[a000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 06 : 3[b000] -> 0[6000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 00 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 03 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 04 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60748 [0] NCCL INFO Channel 07 : 0[6000] -> 3[b000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 00 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 02 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 00 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 03 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 02 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 04 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 04 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 06 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60750 [2] NCCL INFO Channel 06 : 2[a000] -> 1[7000] via P2P/direct pointer
como:60725:60751 [3] NCCL INFO Channel 07 : 3[b000] -> 2[a000] via P2P/direct pointer
como:60725:60749 [1] NCCL INFO Connected all trees
como:60725:60749 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
como:60725:60749 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
como:60725:60748 [0] NCCL INFO Connected all trees
como:60725:60748 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
como:60725:60748 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
como:60725:60750 [2] NCCL INFO Connected all trees
como:60725:60750 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
como:60725:60750 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
como:60725:60751 [3] NCCL INFO Connected all trees
como:60725:60751 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
como:60725:60751 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
como:60725:60749 [1] NCCL INFO comm 0x7f784f101d90 rank 1 nranks 4 cudaDev 1 busId 7000 - Init COMPLETE
como:60725:60750 [2] NCCL INFO comm 0x7f7853101cf0 rank 2 nranks 4 cudaDev 2 busId a000 - Init COMPLETE
como:60725:60751 [3] NCCL INFO comm 0x7f7844000f60 rank 3 nranks 4 cudaDev 3 busId b000 - Init COMPLETE
como:60725:60748 [0] NCCL INFO comm 0x7f785b102010 rank 0 nranks 4 cudaDev 0 busId 6000 - Init COMPLETE
como:60725:60725 [0] NCCL INFO Launch mode Parallel
como:60725:60725 [3] NCCL INFO comm 0x7f785b102010 rank 0 nranks 4 cudaDev 0 busId 6000 - Destroy COMPLETE
como:60725:60725 [3] NCCL INFO comm 0x7f784f101d90 rank 1 nranks 4 cudaDev 1 busId 7000 - Destroy COMPLETE
como:60725:60725 [3] NCCL INFO comm 0x7f7853101cf0 rank 2 nranks 4 cudaDev 2 busId a000 - Destroy COMPLETE
como:60725:60725 [3] NCCL INFO comm 0x7f7844000f60 rank 3 nranks 4 cudaDev 3 busId b000 - Destroy COMPLETE

I don't quite understand what do NCCL INFO Trees [0] ... part represent. How do channel NCCL INFO Channel 01/08 : 0 3 2 1 's tree topology looks like? 0 -> 3, 3 -> 2, 3->1 ? So every tree start with root at 0?

Thanks you so much for help.

Xiao

Extracting the tree info:

como:60725:60748 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
como:60725:60749 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
como:60725:60750 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
como:60725:60751 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2

We have 8 channels. The first one is:

[0] 3/-1/-1->0->-1
[0] -1/-1/-1->1->2
[0] 1/-1/-1->2->3
[0] 2/-1/-1->3->0

Given -1 means "no rank", it means the tree goes through the GPUs with this order: 1->2->3->0. You can do the same for the other channels, or you can also dump the graph XML and look at it; it should be consistent with that order.

The lines:

NCCL INFO Channel 01/08 : 0 3 2 1

represent the rings, which are rotated to always start at the local rank. Since rank 0 prints the rings, they all start at 0, but those are rings, there is no start or end.

Thanks a lot for the response.

Base on what you suggested. Is the channel shared by tree and ring? You mentioned there are 8 channel, does this means there are 8 tress ( if we choose to use tree ) or 8 ring ( if we choose to use ring )
If the tree go through the 4 GPU in order of 1->2->3->0, this won't be a binary tree any more right? this would be a tilted tree with only 1 child per node?
```
1
\
2
\
3
 \0
```
Why the tree is not someting like this, which are slightly more balance.
```
3
|
0
/   \
1    2
```

For the dumped graph XML file, are you saying using NCCL_GRAPH_DUMP_FILE ? I have the output below. I want to know how can I change this file to use the more balanced tree mentioned above? I have read the source code and there are multiple pattern, should I use pattern 3 / 2? Also how can I represent that node 0 have two child, one is 1 and one is 2?

#define NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP 1 // Split tree (send/recv from different ranks) always flowing in the same direction
#define NCCL_TOPO_PATTERN_SPLIT_TREE 2      // Split tree (send/recv from different ranks) flowing in both directions
#define NCCL_TOPO_PATTERN_TREE 3            // Simple tree (send/recv from same rank) flowing in both directions
#define NCCL_TOPO_PATTERN_RING 4            // Ring

graph.xml

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
      <gpu dev="3"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="0"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="3"/>
      <gpu dev="2"/>
      <gpu dev="1"/>
    </channel>
    <channel>
      <gpu dev="0"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
    </channel>
  </graph>
</graphs>

The number of channels is currently coupling the number of rings and the number of trees. It could change in the future though.
NCCL currently does not implement intra-node trees. It would require a specific intra-node topology search for perfect bandwidth and the performance gain may not materialize.
No you can't change the file to create an intra-node tree. Just different chains.

Thanks for the response!

Given -1 means "no rank", it means the tree goes through the GPUs with this order: 1->2->3->0. You can do the same for the other channels, or you can also dump the graph XML and look at it; it should be consistent with that order.

If I'm providing a graph.txt file for intra-node setting with the chain 1->2->3->0 (so building a chain not ring). How will NCCL do allreduce and broadcast in this case? or will NCCL work if all I have is intra-node and intra-node only form a chain ?

Based on my understanding of how broadcast algorithm work under the chain setting (please correct me if I'm wrong with how NCCL do the job). If the root is 2, message will be send 2->1, 2->3->0 right?

allreduce is implemented with reduce + broadcast. The data will be send 0->3(local reduction)-2(local reduction) and 1->2(local_reduction) and repeat the broadcats step above right?

So it is possible to using a graph.txt file and have NCCL using chain to communicate instead of ring inside node? (I understand this is not optimal bandwidth, just trying to using this for other ideas).

Also, if the things do work. what the setting I should use inside graph.txt ? graph id="2" pattern="3" ?

If you provide a graph XML with the order <gpu "id"="1"><gpu "id"="2"><gpu "id"="3"><gpu "id"="0"> then the allreduce will do the reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0.

Broadcast does not use the "tree" description (graph id 1). It uses the "ring" description (graph id 0). Assuming the ring is defined the same way, the ring would be 1->2->3->0->1. Depending on the root of the tree, we will rotate the ring to form a chain, e.g. if the root is 3 it will be 3->0->1->2.

Thanks for the response.

Just to clarify. If I have GPU 0,1,2,3 (within a DGX node). I want Broadcast to follow a chain order and have set the root to gpu 0. I want allreduce within GPU 0,1,2,3 to also follow a chain order (not ring) with reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0.

A few questions I have

I have to set the NCCL_GRAPH_FILE to be like below. I'm not sure if the <graph id="1" pattern="2" and <graph id="2" pattern="3" is needed, since you mentioned broadcast released on graph id 0. I felt like if we want to make allreduce using chain (which is a tree inside node), we need to set graph id = 1 for the tree? I'm just worry setting graph id = 0 & 1 & 2 at the same time would created multiple working channel for one particular communicator (e.g. allreduce) at the same time. Which is not what I needed ( I need to ensure only one chain is created and used by NCCL )

<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
  </graph>
  <graph id="1" pattern="2" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
    </channel>
  </graph>
</graphs>

Do I need to set NCCL_ALGO=Ring/Tree to enable AllReduce using the chain method ( cause technically this is just AllReduce with tree structure within a node ). I'm thinking maybe I should set NCCL_ALGO=Tree but worry this may influence how broadcast work (since as you mentioned, it work as ring). If I'm not setting NCCL_ALGO=Tree, how can I know allreduce is actually runing the chain not ring
Do I need to set NCCL_PROTO=Simple

I'm using NCCL v2.7.8

I have set NCCL_ALGO=Ring, set the NCCL_GRAPH_FILE to the content above ( so that only 1 channel is used by nccl), add a printf message at the begaining of __device__ void ncclAllReduceRingKernel(struct CollectiveArgs* args); I notice the collective kernel is called more than 1 times per GPU. I'm wondering why that's the case. I used to think that we would launch one kernel per each channel on each GPU, but in my case, I have set the NCCL to only consider one channel, why would their still be multiple kernel call?

I want allreduce within GPU 0,1,2,3 to also follow a chain order (not ring) with reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0.

Not sure why you would want that but I believe for that you should set:

<graph id="1" pattern="2" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="0"/>
    </channel>
  </graph>

I'm not sure if the <graph id="1" pattern="2" and <graph id="2" pattern="3" is needed, since you mentioned broadcast released on graph id 0. I felt like if we want to make allreduce using chain (which is a tree inside node), we need to set graph id = 1 for the tree?

Are we talking about ncclBroadcast and ncclReduce or about ncclAllreduce? I'm confused. Broadcast/reduce inside ncclAllreduce have nothing in common with ncclBroadcast or ncclReduce run by themselves.

I'm just worry setting graph id = 0 & 1 & 2 at the same time would created multiple working channel for one particular communicator (e.g. allreduce) at the same time.

Each graph definition corresponds to one algorithm. Each NCCL operation runs a single algorithm. If you set NCCL_ALGO=TREE, allreduce will only follow the graph with id 1.

I'm thinking maybe I should set NCCL_ALGO=Tree but worry this may influence how broadcast work

ncclBroadcast will only use the ring definition regardless of NCCL_ALGO because it does not have an implementation of the tree algorithm.

Do I need to set NCCL_PROTO=Simple

I don't see why you'd want to do that unless other protocols don't work well.

I'm using NCCL v2.7.8. I notice the collective kernel is called more than 1 times per GPU.

You should probably update to 2.12 if you're going to play around with the code. I could have forgotten how 2.7 works. If you want only one channel to run you should set NCCL_MAX_NCHANNELS=1.

Dear NCCL Developer, I'm wonderring if the nodes in "NCCL INFO Trees [0]" in different machines' log belong to the same channel [0] when i'm using multiple machines with multiple gpus?

NVIDIA / nccl

Understand the tree topology #671