changlan commented 4 years ago

NCCL 2.6 added a new algorithm called CollNet but I could not find any document about this. It seems to be related to SHARP, but it is not clear to me what's its relationship to https://github.com/Mellanox/nccl-rdma-sharp-plugins. Would you describe what CollNet is?

Thanks.

kwen2501 commented 4 years ago

CollNet is a new algorithm in NCCL that allows GPUs on multiple nodes to do in-network reductions. When NCCL_COLLNET_ENABLE is set to 1, NCCL will detect network plugins (libnccl-net.so) loaded through LD_LIBRARY_PATH and use the in-network reduction functionalities implemented therein.

The NCCL-SHARP plugin is such an example that connects NCCL with the SHARP reduction feature of Mellanox switches. The plugin's source code is hosted in the repo you mentioned. The binary is also available through the HPC-X toolkit provided by Mellanox.

changlan commented 4 years ago

Thanks!

changlan commented 4 years ago

To follow up: Besides SHARP, is there any other plugin that CollNet supports?

sjeaugey commented 4 years ago

I'm not aware of another plugin that implements collnet.

azuresol commented 4 years ago

Hi @sjeaugey and @kwen2501, I was also trying to understand how allreduce works in CollNet and have some questions:

Let's say we have two hosts and each host has 8 GPUs: rank 0-7 and rank 8-15. If I understand correctly, rank 0 and rank 8 will be the master ranks that communicate with the CollNet root and the other ranks communicate with each other via P2P. How does allreduce algorithm work in this case? A possible algorithm is (1) reduce buffers into the local master rank, (2) master ranks perform collNetIallreduce, (3) broadcast into the other local ranks. But I am not sure if this is the case.

kwen2501 commented 4 years ago

Hi @azuresol what you described is correct, if each host has only 1 NIC connecting to the reduction-capable network switch. If you have multiple NICs, then NCCL can create multiple channels, each channel having a distinct master rank.

hanyunfan commented 4 years ago

When should Collnet be used? Any data shows the performance comparison between Tree, Ring, and Collnet? Will NCCL pick the best one automatically?

sjeaugey commented 4 years ago

Yes, NCCL should pick the best algorithm automatically.

ConnollyLeon commented 1 year ago

@sjeaugey In https://github.com/NVIDIA/nccl/issues/457 you mention

for` collnet (network-accelerated allreduce) it would be 2 * S intra-node and S inter-node.

The S data in inter-node case is awesome. So far as I know, AllReduce need to transfer at least 2 * S volume of data to finish the jobs. Could you please tell us where we can find the documents that introduce CollNet AllReduce?

sjeaugey commented 1 year ago

Collnet is for systems which perform collective reductions in the network. For example, SHARP (NVIDIA IB switches), or parameter servers (PS) where you perform reductions on a pool of CPU instances (which have to match the total bandwidth of the GPU nodes).

In that case, you only need to send all your data to the switch (or PS) and receive the reduced values, once.

ConnollyLeon commented 1 year ago

Collnet is for systems which perform collective reductions in the network. For example, SHARP (NVIDIA IB switches), or parameter servers (PS) where you perform reductions on a pool of CPU instances (which have to match the total bandwidth of the GPU nodes).

In that case, you only need to send all your data to the switch (or PS) and receive the reduced values, once.

I think in this case, send and receive both need to transfer S data. Do I miss something? I guess you mean by applying SHARP, the send and receive can act like a pipeline, i.e., sending to switches, reducing on switches, then switches send data back to hosts. The key is to make full use of the duplex bandwidths of NICs. Then the bandwidth time cost might become S/B. It's kinda like BytePS, isn't it?

sjeaugey commented 1 year ago

Yes, as I mentioned, to use collnet you need to have your reductions done outside of the compute nodes, somewhere in the network: either in the switches, like SHARP, or on some CPU instances like BytePS. From the NCCL perspective it is the same: you send the values for your node, and get back the values summed for all nodes.

And indeed, this is fully pipelined with the intra-node communication, and the NIC is used in both directions continuously.

ConnollyLeon commented 1 year ago

@sjeaugey I found a CollNet nccl-tests result here.

For 128 nodes doing AllReduce for 2GB data, it achieves a algbw of 94.94GB/s. What's the network configuration of each node (type and number of NIC)? I didn't mention the busbw because nccl-tests calculate algbw based on ring algorithm here.

void AllReduceGetBw(size_t count, int typesize, double sec, double* algBw, double* busBw, int nranks) {
  double baseBw = (double)(count * typesize) / 1.0E9 / sec;

  *algBw = baseBw;
  double factor = ((double)(2*(nranks - 1)))/((double)nranks);
  *busBw = baseBw * factor;
}

If there are 8 HDR NICs on each node, I think the algbw should be up to 200GB/s according to our above discussion.

Did I miss something?

sjeaugey commented 1 year ago

My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.

Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.

ConnollyLeon commented 1 year ago

Thanks for your reply and correcting @sjeaugey .

Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.

Let's make this more clear.

My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.

In DGX-A100 servers that uses CollNet, the data in each channel flows like this: Channel 0-1: 0->1->2->3->4->5->6->7 -> NIC7 -> 7->6->5->4->3->2->1->0 Channel 2-3: 1->2->3->4->5->6->7->0 -> NIC0 -> 0->7->6->5->4->3->2->1 Channel 4-5: 2->3->4->5->6->7->0->1 -> NIC1 -> 1->0->7->6->5->4->3->2 Channel 6-7: 3->4->5->6->7->0->1->2 -> NIC2 -> 2->1->0->7->6->5->4->3 Channel 8-9: 4->5->6->7->0->1->2->3-> NIC3 -> 3->2->1->0 ->7->6->5->4 Channel 10-11: 5->6->7->0->1->2->3->4 -> NIC4 -> 4->3->2->1->0->7->6->5 Channel 12-13: 6->7->0->1->2->3->4->5 -> NIC5 -> 5->4->3->2->1->0->7->6 Channel 14-15: 7->0->1->2->3->4->5->6 -> NIC6 -> 6->5->4->3->2->1->0->7

Since this is fully pipelined, then in each GPU, there are 28 send transmissions and 28 recv transmissions simultaneously. In each channel, each GPU need to transmit up to 2x data (e.g, GPU 1 in channel 0, need to recv and send 2x data).

Globally, each GPU needs to send 28/16 S data and recv 28/16 S data and each NIC needs to send 2/16 S data and recv 2/16 S data. I think this means the bandwidth of p2p GPU communication need to be 14x of NIC to make full use of the NIC bandwidth, which is 14*20 = ~280GB/s. However, ~280GB/s exceeds the bandwidth of NVLinks in DGX-A100 servers, so NVLink becomes bottleneck.

In the case of 4 NICs, GPU send and recv ~7x more data. Since 7*20 =140 < 220, NVLink won't be the bottleneck here.

Is my understanding correct?

sjeaugey commented 1 year ago

Yes, I think this is all correct.

VxOvOxV commented 1 year ago

Yes, NCCL should pick the best algorithm automatically.

Hi @sjeaugey and @kwen2501 When I try to use nccl-rdma-sharp-plugin with hpcx packages, it will fail, can you help me solve this problem? The specific error message is as follows:

[Feb 23 02:57:53 144853][SD][46337][error] - no AM service record found [Feb 23 02:57:53 149665][SD][46337][error] - failed to connect to AM - error -1 received [Feb 23 02:57:53 150784][SD][46337][error] - unable to connect to AM [01:0:46115 unique id 8145462613505354939] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job. [01:0:46115 - context.c:706] ERROR sharp_create_job failed: Failed to connect to Aggregation Manager (sharp_am)(-53) 01:46116:46340 [1] sharp_plugin.c:320 NCCL WARN NET/IB : SHARP coll init error: Cannot create SHARP job(-11)

AddyLaddy commented 1 year ago

You have to have one [extra] node, which is connected to the SHARP capable IB switch over IB, running the SHARP Aggregation Manager daemon. https://docs.nvidia.com/networking/display/SHARPv200/Running+Mellanox+SHARP+Deamons

VxOvOxV commented 1 year ago

You have to have one [extra] node, which is connected to the SHARP capable IB switch over IB, running the SHARP Aggregation Manager daemon. https://docs.nvidia.com/networking/display/SHARPv200/Running+Mellanox+SHARP+Deamons

Thanks for your reply @AddyLaddy I tested on a network with two nodes and ran SHARP Aggregation Manager daemon and still get the above error. I run NCCL programs and register sharp_am on the master node and register sharpd on the compute node. Now reproduce the step, can you help me troubleshoot which step went wrong: master node[01]:

$HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am

                # service sharp_am start
                `[01]Redirecting to /bin/systemctl start sharp_am.service
                 [01]Running in chroot, ignoring request.`

compute node[02]:

$HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd

                 # service sharpd start
                `[02]Redirecting to /bin/systemctl start sharpd.service
                 [02]Running in chroot, ignoring request.`

By the way, the above operations are all carried out in Docker, I don't know if there is an impact.

bureddy commented 1 year ago

@VxOvOxV w is sharp_am is running ? check "service sharp_am status" ? please check /var/log/sharp_am.log if there is any errors. BTW, if you are using latest HPCX, sharpd service is not needed

Before running NCCL, can please verify if sharp setup is fine with sharp_hello $HPCX_SHARP_DIR/bin/sharp_hello -s mlx5_0:1 -v 3

VxOvOxV commented 1 year ago

hi @bureddy , thanks for your reply. HPCX VERSION is v2.13. I have checked "service sharp_am status",it shows "Running in chroot, ignoring request". BTW, there is no /var/log/sharp_am.log for this file. When I check whether SHARP is setup successfully, the following error message appears： root@-01:# $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_0:1 -v 3 [01:0:14644 - context.c:696] INFO job (ID: 8145405344325762139) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [01][Feb 27 02:14:53 441035][SD][14644][error] - no AM service record found [01][Feb 27 02:14:53 446362][SD][14644][error] - failed to connect to AM - error -1 received [01][Feb 27 02:14:53 447512][SD][14644][error] - unable to connect to AM [01:0:14644 unique id 8145405344325762139] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job.

bureddy commented 1 year ago

it seems sharp service is not enabled in the setup. sharp_am needs to run on the same server where UFM/opensm service is running in the fabric.

jiangxiaobin96 commented 1 year ago

Hi @sjeaugey. I met some confusion while reading collnet code.

where construct collnet topology which all GPUs are leaf nodes and switch aggregation nodes? Is the topology construct in SHARP plugin?
what's the differences between ALGO collnetChain and collnetDirect especially their up and down array. My understand is that GPU's up and down represent the next and prev communication GPU in intra node except master rank. collnetChain transfer data like ring and collnetDirect transfer data directly to master rank. All local aggregation gather in master rank and master rank sends them to switch.

VxOvOxV commented 1 year ago

it seems sharp service is not enabled in the setup. sharp_am needs to run on the same server where UFM/opensm service is running in the fabric.

Sorry, I didn't clearly understand the meaning of this paragraph. Can you elaborate a little more detailed? Thank you very much.

sjeaugey commented 1 year ago

@jiangxiaobin96

Not sure I understood the question, but the collnet setup is in init.cc/transport.cc, where we create the Collnet groups (separate the multiple rails and assign a GPU as "head" for each NIC). The plugin only creates the SHARP groups based on what we decided in NCCL core and runs the allreduce for each NIC, but it doesn't have a global view of the NVLink+SHARP operation.
That seems correct.

@VxOvOxV the sharp_am service needs to run on the same machine running the IB subnet manager, e.g. opensm or ufm. If your switch is managed (the subnet manager runs inside the switch), then you need to run opensm on the same system as sharp_am and it will take control over the switch's subnet manager.

Moson864 commented 5 months ago

After reading the source code v2.20 repeatedly, I still feel confused about how the collNet topo graph works. My questions are as follows:

How does the root rank works? Is it a virtual rank in the system or it actually works on the NIC or Switch?
Does root rank actually build connections with local master ranks or these local master ranks just use the connector in root rank to build connections with each other? It seems from the code in nccl-rdma-sharp-plugins that local master ranks assembly into a sharp coll group, and build connections with prev and next master. int next = (cComm->rank + 1) % nranks; do { if (cComm->sendComm == NULL) NCCLCHECK(ncclNetPlugin_v6.connect(lComm->dev, handles[next], &cComm->sendComm)); if (cComm->recvComm == NULL) NCCLCHECK(ncclNetPlugin_v6.accept(lComm->listenCommP2P, &cComm->recvComm)); // From prev } while(cComm->sendComm == NULL || cComm->recvComm == NULL);
How does the in-network computing works? I suppose there should be a "root" gathering all data and sending it into the switch. However, I could not find any trace of the information of switch. Is it included in the function "sharp_coll_comm_init"? Wish for your replies, many many thanks! @sjeaugey

echobinarybytes commented 1 month ago

Thanks for your reply and correcting @sjeaugey .

Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.

Let's make this more clear.

My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.

In DGX-A100 servers that uses CollNet, the data in each channel flows like this: Channel 0-1: 0->1->2->3->4->5->6->7 -> NIC7 -> 7->6->5->4->3->2->1->0 Channel 2-3: 1->2->3->4->5->6->7->0 -> NIC0 -> 0->7->6->5->4->3->2->1 Channel 4-5: 2->3->4->5->6->7->0->1 -> NIC1 -> 1->0->7->6->5->4->3->2 Channel 6-7: 3->4->5->6->7->0->1->2 -> NIC2 -> 2->1->0->7->6->5->4->3 Channel 8-9: 4->5->6->7->0->1->2->3-> NIC3 -> 3->2->1->0 ->7->6->5->4 Channel 10-11: 5->6->7->0->1->2->3->4 -> NIC4 -> 4->3->2->1->0->7->6->5 Channel 12-13: 6->7->0->1->2->3->4->5 -> NIC5 -> 5->4->3->2->1->0->7->6 Channel 14-15: 7->0->1->2->3->4->5->6 -> NIC6 -> 6->5->4->3->2->1->0->7

Since this is fully pipelined, then in each GPU, there are 28 send transmissions and 28 recv transmissions simultaneously. In each channel, each GPU need to transmit up to 2x data (e.g, GPU 1 in channel 0, need to recv and send 2x data).

Globally, each GPU needs to send 28/16 S data and recv 28/16 S data and each NIC needs to send 2/16 S data and recv 2/16 S data. I think this means the bandwidth of p2p GPU communication need to be 14x of NIC to make full use of the NIC bandwidth, which is 14*20 = ~280GB/s. However, ~280GB/s exceeds the bandwidth of NVLinks in DGX-A100 servers, so NVLink becomes bottleneck.

In the case of 4 NICs, GPU send and recv ~7x more data. Since 7*20 =140 < 220, NVLink won't be the bottleneck here.

Is my understanding correct?

hello，I am now studying detail about CollNet，but my environment not supported for CollNet。Could you please share you CollNet algo log to me!!!

Looking forward to your replay.

NVIDIA / nccl

What is CollNet? #320

$HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharp_am

$HPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh -s -d sharpd