Open XiaoSong9905 opened 2 years ago
I'm not sure I understand why you want to do that. Do you think you would get better performance? Can you explain why?
What you describe here seems to me to be a LOT of work. Sure the tree algorithm already has most of that functionality, but you'd need to change the XML representation to be able to represent trees and not just chains, and more importantly you'd need to add support for intermediate nodes which are not part of the communicator which is absolutely not supported today, since ranks outside the communicator are not even in the topology graph.
Two concurrent NCCL allreduce can work in parallel if using different CUDA asynchronous streams. They're not guaranteed to run in parallel though, and could even deadlock each other if different GPUs run them in a different order and they block each other. Most of the time it works, but there is no way to guarantee it will work always.
Thank you so much for the response.
we do this to get better bandwidth. We have the situation where 4 GPU resources is avalible but only the first 3 are actually running the model, we want to utilize the idle bandwidth between GPU0,1,2 and GPU 3 to improve the bandwidth for GPU0,1,2 collective operator.
I don't think there will be a issue with rank outside communicator are not inside topo graph
. We'll just create two communicator, one with GPU0,1,2 and one with GPU0,1,2,3. First 70% data will be send by communicator with GPU 0,1,2, and the rest 30% data will be send by the communicator with GPU0,1,2,3. The tree idea mentioned above can be done with a modifided tree kernel + communicator with GPU0,1,2,3.
2'. can you explain a little bit more on what "XML representation to be able to represent trees and not just chain" ? Does the NCCL_GRAPH_FILE
not support input a tree ?
2.' Given I'll use the 4 GPU communicator, GPU 3 would now be part of the communicator. Is there a way to add a customized kernel that only use NCCL internal buffer?
0 ---- 3
| \ / |
| / \ |
1 ---- 2
2.' Based on the feedback you provided in https://github.com/NVIDIA/nccl/issues/672, it seems that NCCL currently do not support intra-node tree and user can not input a intra-node tree using the graph.txt file. Do you have any suggestions on where I should modify to enable a intra-node tree ( just a single tree, not packing multiple binary tree to achieve max bandwidth ).
Creating a 4 GPUs communicator, you would get 44 GB/s already. You only need to have GPU 3 set its buffer to all zeroes and you have your 3-way allreduce at 44GB/s. That's less complicated than creating the second 4-GPU communicator and should have better performance.
Hi NCCL Developers,
I'm trying to extend the NCCL package to support a fancy functionality. Specifically I have 4 GPU connected (half the DGX1-V100) and I'm trying to build a communicator with only GPU 0,1,2. With the GPU 0,1,2, NCCL would choose either tree / ring.
My Plan:
I want to add another path 1 -> 3, 3 -> 2, 3 -> 0 ( and vise versa for the complementary tree ) to communicate part of the data (for a communicator with only 0,1,2. So that 3 do not have input / output buffer and is only used to increase bandwidth).
This can be considered as adding a additional tree root at 0, node at 3, leaf at 0,2. The node 3 (different from how NCCL implement tree algorithm) do not have input and output data, and is only using NCCL internal buffer (defined by
BUFFER_SIZE
) to take data from 1 and pass it to 2 and 3. (at least this is what I plan to achieve, but I'm not sure. Please see questions below)I'm planning to add support for broadcast and allreduce. Based on what I understand from other github issue, the tree allreduce in nccl is implemented with reduce to root + broadcast, with different channel have different root. Given the above tree I mentioned, leaf 0 and 2 will send data to 3, 3 will store the incoming data using its internal buffer and send the data to 1, 1 will do the reduce computation and broadcast data back.
I'm planning to run two communicator, one communicator with GPU 0,1,2 and run NCCL algorithm. The other communicator with GPU 0,1,2,3 and run the side channel tree algorithm mentioend above. So the code would be like
What i know
given the 4 gpu communicator, its tree structure can be set by external file (or computed based on topology, which is how nccl do it in most cases). I have change the logic here for it to only consider this single tree mentioned above.
What I'm not sure
Will two allreduce call to different communicator & stream block each other ( first allreduce need to finish until second allreduce can run ). I'm planning to send 70% or data using the communicator and algoriothm provided by NCCL and send 30% of data using the side channel tree idea mentioned above with a 4 GPU communicator and customized kernel (explain in above code).
For the tree to run correctly, GPU 3 need to realize it do not have input buffer and output buffer and should use internel buffer to store data. Also GPU3 should only do reduction with input from 0 and 2 (not itself, since it do not have input buffer) and send the reduction result to 1. I'm planning to have a if branch inside the allreduce tree kernel so that 3. 3.
I'm currently stucked at having GPU 3 using nccl internal buffer for send and recv data and doing reduction with result saved to internal buffer. I'm not sure how to implement this.
I think I should change the primative class, but all the API for primative class are using user provided input output buffer, which makes me more unsuered on where to change.
Where is NCCL internal buffer (defined by BUFFER_SIZE ) used during the collective call.
How to change graph.txt file to represent the above topology. I got this graph.txt file from another github issue. I think the patter=2 indicate using tree, but I'm not sure what do speedintra, speedinter, typeintra, typeinter should be set. And also how the gpu list inside
<channel></channel>
shoule be set?I'm using NCCL v2.7.8, which is slightly easier to add code on top.
Thanks you so much for help Xiao