Open XiaoSong9905 opened 2 years ago
Extracting the tree info:
como:60725:60748 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
como:60725:60749 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
como:60725:60750 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
como:60725:60751 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2
We have 8 channels. The first one is:
[0] 3/-1/-1->0->-1
[0] -1/-1/-1->1->2
[0] 1/-1/-1->2->3
[0] 2/-1/-1->3->0
Given -1
means "no rank", it means the tree goes through the GPUs with this order: 1->2->3->0. You can do the same for the other channels, or you can also dump the graph XML and look at it; it should be consistent with that order.
The lines:
NCCL INFO Channel 01/08 : 0 3 2 1
represent the rings, which are rotated to always start at the local rank. Since rank 0 prints the rings, they all start at 0, but those are rings, there is no start or end.
Thanks a lot for the response.
Base on what you suggested. Is the channel shared by tree and ring? You mentioned there are 8 channel, does this means there are 8 tress ( if we choose to use tree ) or 8 ring ( if we choose to use ring )
If the tree go through the 4 GPU in order of 1->2->3->0, this won't be a binary tree any more right? this would be a tilted tree with only 1 child per node?
1
\
2
\
3
\0
Why the tree is not someting like this, which are slightly more balance.
3
|
0
/ \
1 2
For the dumped graph XML file, are you saying using NCCL_GRAPH_DUMP_FILE
? I have the output below. I want to know how can I change this file to use the more balanced tree mentioned above? I have read the source code and there are multiple pattern, should I use pattern 3 / 2? Also how can I represent that node 0 have two child, one is 1 and one is 2?
#define NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP 1 // Split tree (send/recv from different ranks) always flowing in the same direction
#define NCCL_TOPO_PATTERN_SPLIT_TREE 2 // Split tree (send/recv from different ranks) flowing in both directions
#define NCCL_TOPO_PATTERN_TREE 3 // Simple tree (send/recv from same rank) flowing in both directions
#define NCCL_TOPO_PATTERN_RING 4 // Ring
graph.xml
<graphs version="1">
<graph id="0" pattern="4" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="2"/>
<gpu dev="1"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="1"/>
<gpu dev="2"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="2"/>
<gpu dev="1"/>
<gpu dev="3"/>
</channel>
</graph>
<graph id="1" pattern="1" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="2"/>
<gpu dev="1"/>
</channel>
<channel>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
<gpu dev="0"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="2"/>
<gpu dev="1"/>
</channel>
<channel>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
<gpu dev="0"/>
</channel>
</graph>
<graph id="2" pattern="3" crossnic="0" nchannels="4" speedintra="22" speedinter="22" latencyinter="0" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="2"/>
<gpu dev="1"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="3"/>
<gpu dev="2"/>
<gpu dev="1"/>
</channel>
<channel>
<gpu dev="0"/>
<gpu dev="2"/>
<gpu dev="3"/>
<gpu dev="1"/>
</channel>
</graph>
</graphs>
Thanks for the response!
Given
-1
means "no rank", it means the tree goes through the GPUs with this order: 1->2->3->0. You can do the same for the other channels, or you can also dump the graph XML and look at it; it should be consistent with that order.
If I'm providing a graph.txt file for intra-node setting with the chain 1->2->3->0 (so building a chain not ring). How will NCCL do allreduce and broadcast in this case? or will NCCL work if all I have is intra-node and intra-node only form a chain ?
Based on my understanding of how broadcast algorithm work under the chain setting (please correct me if I'm wrong with how NCCL do the job). If the root is 2, message will be send 2->1, 2->3->0 right?
allreduce is implemented with reduce + broadcast. The data will be send 0->3(local reduction)-2(local reduction) and 1->2(local_reduction) and repeat the broadcats step above right?
So it is possible to using a graph.txt file and have NCCL using chain to communicate instead of ring inside node? (I understand this is not optimal bandwidth, just trying to using this for other ideas).
Also, if the things do work. what the setting I should use inside graph.txt ? graph id="2" pattern="3" ?
If you provide a graph XML with the order <gpu "id"="1"><gpu "id"="2"><gpu "id"="3"><gpu "id"="0">
then the allreduce will do the reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0.
Broadcast does not use the "tree" description (graph id 1). It uses the "ring" description (graph id 0). Assuming the ring is defined the same way, the ring would be 1->2->3->0->1. Depending on the root of the tree, we will rotate the ring to form a chain, e.g. if the root is 3 it will be 3->0->1->2.
Thanks for the response.
Just to clarify. If I have GPU 0,1,2,3 (within a DGX node). I want Broadcast to follow a chain order and have set the root to gpu 0. I want allreduce within GPU 0,1,2,3 to also follow a chain order (not ring) with reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0
.
A few questions I have
NCCL_GRAPH_FILE
to be like below. I'm not sure if the <graph id="1" pattern="2"
and <graph id="2" pattern="3"
is needed, since you mentioned broadcast released on graph id 0. I felt like if we want to make allreduce using chain (which is a tree inside node), we need to set graph id = 1 for the tree? I'm just worry setting graph id = 0 & 1 & 2 at the same time would created multiple working channel for one particular communicator (e.g. allreduce) at the same time. Which is not what I needed ( I need to ensure only one chain is created and used by NCCL )<graphs version="1">
<graph id="0" pattern="4" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
</channel>
</graph>
<graph id="1" pattern="2" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
</channel>
</graph>
<graph id="2" pattern="3" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="0"/>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
</channel>
</graph>
</graphs>
Do I need to set NCCL_ALGO=Ring/Tree
to enable AllReduce using the chain method ( cause technically this is just AllReduce with tree structure within a node ). I'm thinking maybe I should set NCCL_ALGO=Tree
but worry this may influence how broadcast work (since as you mentioned, it work as ring). If I'm not setting NCCL_ALGO=Tree
, how can I know allreduce is actually runing the chain not ring
Do I need to set NCCL_PROTO=Simple
I'm using NCCL v2.7.8
NCCL_ALGO=Ring
, set the NCCL_GRAPH_FILE
to the content above ( so that only 1 channel is used by nccl), add a printf
message at the begaining of __device__ void ncclAllReduceRingKernel(struct CollectiveArgs* args);
I notice the collective kernel is called more than 1 times per GPU. I'm wondering why that's the case. I used to think that we would launch one kernel per each channel on each GPU, but in my case, I have set the NCCL to only consider one channel, why would their still be multiple kernel call? I want allreduce within GPU 0,1,2,3 to also follow a chain order (not ring) with reduce phase 0->3->2->1 and the broadcast phase 1->2->3->0.
Not sure why you would want that but I believe for that you should set:
<graph id="1" pattern="2" crossnic="0" nchannels="1" speedintra="21" speedinter="21" typeintra="NVL" typeinter="PIX" samechannels="0">
<channel>
<gpu dev="1"/>
<gpu dev="2"/>
<gpu dev="3"/>
<gpu dev="0"/>
</channel>
</graph>
I'm not sure if the <graph id="1" pattern="2" and <graph id="2" pattern="3" is needed, since you mentioned broadcast released on graph id 0. I felt like if we want to make allreduce using chain (which is a tree inside node), we need to set graph id = 1 for the tree?
Are we talking about ncclBroadcast and ncclReduce or about ncclAllreduce? I'm confused. Broadcast/reduce inside ncclAllreduce have nothing in common with ncclBroadcast or ncclReduce run by themselves.
I'm just worry setting graph id = 0 & 1 & 2 at the same time would created multiple working channel for one particular communicator (e.g. allreduce) at the same time.
Each graph definition corresponds to one algorithm. Each NCCL operation runs a single algorithm. If you set NCCL_ALGO=TREE, allreduce will only follow the graph with id 1.
I'm thinking maybe I should set NCCL_ALGO=Tree but worry this may influence how broadcast work
ncclBroadcast will only use the ring definition regardless of NCCL_ALGO because it does not have an implementation of the tree algorithm.
Do I need to set NCCL_PROTO=Simple
I don't see why you'd want to do that unless other protocols don't work well.
I'm using NCCL v2.7.8. I notice the collective kernel is called more than 1 times per GPU.
You should probably update to 2.12 if you're going to play around with the code. I could have forgotten how 2.7 works. If you want only one channel to run you should set NCCL_MAX_NCHANNELS=1.
Dear NCCL Developer, I'm wonderring if the nodes in "NCCL INFO Trees [0]" in different machines' log belong to the same channel [0] when i'm using multiple machines with multiple gpus?
Dear NCCL Developer,
I'm confused about the Tree topology used in the 4 GPU DGX1-V100 (GPU 0,1,2,3) algorithm. My topology file looks like this
The NCCL INFO output when running allreduce & set NCCL_ALGO=Tree is
I don't quite understand what do
NCCL INFO Trees [0] ...
part represent. How do channelNCCL INFO Channel 01/08 : 0 3 2 1
's tree topology looks like? 0 -> 3, 3 -> 2, 3->1 ? So every tree start with root at 0?Thanks you so much for help.
Xiao