Closed changlan closed 4 years ago
CollNet is a new algorithm in NCCL that allows GPUs on multiple nodes to do in-network reductions. When NCCL_COLLNET_ENABLE
is set to 1, NCCL will detect network plugins (libnccl-net.so
) loaded through LD_LIBRARY_PATH
and use the in-network reduction functionalities implemented therein.
The NCCL-SHARP plugin is such an example that connects NCCL with the SHARP reduction feature of Mellanox switches. The plugin's source code is hosted in the repo you mentioned. The binary is also available through the HPC-X toolkit provided by Mellanox.
Thanks!
To follow up: Besides SHARP, is there any other plugin that CollNet supports?
I'm not aware of another plugin that implements collnet.
Hi @sjeaugey and @kwen2501, I was also trying to understand how allreduce works in CollNet and have some questions:
Let's say we have two hosts and each host has 8 GPUs: rank 0-7 and rank 8-15. If I understand correctly, rank 0 and rank 8 will be the master ranks that communicate with the CollNet root and the other ranks communicate with each other via P2P. How does allreduce algorithm work in this case? A possible algorithm is (1) reduce buffers into the local master rank, (2) master ranks perform collNetIallreduce
, (3) broadcast into the other local ranks. But I am not sure if this is the case.
Hi @azuresol what you described is correct, if each host has only 1 NIC connecting to the reduction-capable network switch. If you have multiple NICs, then NCCL can create multiple channels, each channel having a distinct master rank.
When should Collnet be used? Any data shows the performance comparison between Tree, Ring, and Collnet? Will NCCL pick the best one automatically?
Yes, NCCL should pick the best algorithm automatically.
@sjeaugey In https://github.com/NVIDIA/nccl/issues/457 you mention
for` collnet (network-accelerated allreduce) it would be 2 * S intra-node and S inter-node.
The S data in inter-node case is awesome. So far as I know, AllReduce need to transfer at least 2 * S volume of data to finish the jobs. Could you please tell us where we can find the documents that introduce CollNet AllReduce?
Collnet is for systems which perform collective reductions in the network. For example, SHARP (NVIDIA IB switches), or parameter servers (PS) where you perform reductions on a pool of CPU instances (which have to match the total bandwidth of the GPU nodes).
In that case, you only need to send all your data to the switch (or PS) and receive the reduced values, once.
Collnet is for systems which perform collective reductions in the network. For example, SHARP (NVIDIA IB switches), or parameter servers (PS) where you perform reductions on a pool of CPU instances (which have to match the total bandwidth of the GPU nodes).
In that case, you only need to send all your data to the switch (or PS) and receive the reduced values, once.
I think in this case, send and receive both need to transfer S data. Do I miss something? I guess you mean by applying SHARP, the send and receive can act like a pipeline, i.e., sending to switches, reducing on switches, then switches send data back to hosts. The key is to make full use of the duplex bandwidths of NICs. Then the bandwidth time cost might become S/B. It's kinda like BytePS, isn't it?
Yes, as I mentioned, to use collnet you need to have your reductions done outside of the compute nodes, somewhere in the network: either in the switches, like SHARP, or on some CPU instances like BytePS. From the NCCL perspective it is the same: you send the values for your node, and get back the values summed for all nodes.
And indeed, this is fully pipelined with the intra-node communication, and the NIC is used in both directions continuously.
@sjeaugey I found a CollNet nccl-tests result here.
For 128 nodes doing AllReduce for 2GB data, it achieves a algbw of 94.94GB/s. What's the network configuration of each node (type and number of NIC)? I didn't mention the busbw because nccl-tests calculate algbw based on ring algorithm here.
void AllReduceGetBw(size_t count, int typesize, double sec, double* algBw, double* busBw, int nranks) {
double baseBw = (double)(count * typesize) / 1.0E9 / sec;
*algBw = baseBw;
double factor = ((double)(2*(nranks - 1)))/((double)nranks);
*busBw = baseBw * factor;
}
If there are 8 HDR NICs on each node, I think the algbw should be up to 200GB/s according to our above discussion.
Did I miss something?
My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.
Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.
Thanks for your reply and correcting @sjeaugey .
Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.
Let's make this more clear.
My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.
In DGX-A100 servers that uses CollNet, the data in each channel flows like this: Channel 0-1: 0->1->2->3->4->5->6->7 -> NIC7 -> 7->6->5->4->3->2->1->0 Channel 2-3: 1->2->3->4->5->6->7->0 -> NIC0 -> 0->7->6->5->4->3->2->1 Channel 4-5: 2->3->4->5->6->7->0->1 -> NIC1 -> 1->0->7->6->5->4->3->2 Channel 6-7: 3->4->5->6->7->0->1->2 -> NIC2 -> 2->1->0->7->6->5->4->3 Channel 8-9: 4->5->6->7->0->1->2->3-> NIC3 -> 3->2->1->0 ->7->6->5->4 Channel 10-11: 5->6->7->0->1->2->3->4 -> NIC4 -> 4->3->2->1->0->7->6->5 Channel 12-13: 6->7->0->1->2->3->4->5 -> NIC5 -> 5->4->3->2->1->0->7->6 Channel 14-15: 7->0->1->2->3->4->5->6 -> NIC6 -> 6->5->4->3->2->1->0->7
Since this is fully pipelined, then in each GPU, there are 28 send transmissions and 28 recv transmissions simultaneously. In each channel, each GPU need to transmit up to 2x data (e.g, GPU 1 in channel 0, need to recv and send 2x data).
Globally, each GPU needs to send 28/16 S data and recv 28/16 S data and each NIC needs to send 2/16 S data and recv 2/16 S data. I think this means the bandwidth of p2p GPU communication need to be 14x of NIC to make full use of the NIC bandwidth, which is 14*20 = ~280GB/s. However, ~280GB/s exceeds the bandwidth of NVLinks in DGX-A100 servers, so NVLink becomes bottleneck.
In the case of 4 NICs, GPU send and recv ~7x more data. Since 7*20 =140 < 220, NVLink won't be the bottleneck here.
Is my understanding correct?
Yes, I think this is all correct.
Yes, NCCL should pick the best algorithm automatically.
Hi @sjeaugey and @kwen2501 When I try to use nccl-rdma-sharp-plugin with hpcx packages, it will fail, can you help me solve this problem? The specific error message is as follows:
[Feb 23 02:57:53 144853][SD][46337][error] - no AM service record found [Feb 23 02:57:53 149665][SD][46337][error] - failed to connect to AM - error -1 received [Feb 23 02:57:53 150784][SD][46337][error] - unable to connect to AM [01:0:46115 unique id 8145462613505354939] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job. [01:0:46115 - context.c:706] ERROR sharp_create_job failed: Failed to connect to Aggregation Manager (sharp_am)(-53) 01:46116:46340 [1] sharp_plugin.c:320 NCCL WARN NET/IB : SHARP coll init error: Cannot create SHARP job(-11)
You have to have one [extra] node, which is connected to the SHARP capable IB switch over IB, running the SHARP Aggregation Manager daemon. https://docs.nvidia.com/networking/display/SHARPv200/Running+Mellanox+SHARP+Deamons
You have to have one [extra] node, which is connected to the SHARP capable IB switch over IB, running the SHARP Aggregation Manager daemon. https://docs.nvidia.com/networking/display/SHARPv200/Running+Mellanox+SHARP+Deamons
Thanks for your reply @AddyLaddy I tested on a network with two nodes and ran SHARP Aggregation Manager daemon and still get the above error. I run NCCL programs and register sharp_am on the master node and register sharpd on the compute node. Now reproduce the step, can you help me troubleshoot which step went wrong: master node[01]:
# service sharp_am start
`[01]Redirecting to /bin/systemctl start sharp_am.service
[01]Running in chroot, ignoring request.`
compute node[02]:
# service sharpd start
`[02]Redirecting to /bin/systemctl start sharpd.service
[02]Running in chroot, ignoring request.`
By the way, the above operations are all carried out in Docker, I don't know if there is an impact.
@VxOvOxV w is sharp_am is running ? check "service sharp_am status" ? please check /var/log/sharp_am.log if there is any errors. BTW, if you are using latest HPCX, sharpd service is not needed
Before running NCCL, can please verify if sharp setup is fine with sharp_hello $HPCX_SHARP_DIR/bin/sharp_hello -s mlx5_0:1 -v 3
hi @bureddy , thanks for your reply.
HPCX VERSION is v2.13. I have checked "service sharp_am status",it shows "Running in chroot, ignoring request". BTW, there is no /var/log/sharp_am.log for this file.
When I check whether SHARP is setup successfully, the following error message appears:
root@-01:# $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_0:1 -v 3 [01:0:14644 - context.c:696] INFO job (ID: 8145405344325762139) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [01][Feb 27 02:14:53 441035][SD][14644][error] - no AM service record found [01][Feb 27 02:14:53 446362][SD][14644][error] - failed to connect to AM - error -1 received [01][Feb 27 02:14:53 447512][SD][14644][error] - unable to connect to AM [01:0:14644 unique id 8145405344325762139] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job.
it seems sharp service is not enabled in the setup. sharp_am needs to run on the same server where UFM/opensm service is running in the fabric.
Hi @sjeaugey. I met some confusion while reading collnet code.
it seems sharp service is not enabled in the setup. sharp_am needs to run on the same server where UFM/opensm service is running in the fabric.
Sorry, I didn't clearly understand the meaning of this paragraph. Can you elaborate a little more detailed? Thank you very much.
@jiangxiaobin96
@VxOvOxV the sharp_am
service needs to run on the same machine running the IB subnet manager, e.g. opensm
or ufm
. If your switch is managed (the subnet manager runs inside the switch), then you need to run opensm on the same system as sharp_am and it will take control over the switch's subnet manager.
After reading the source code v2.20 repeatedly, I still feel confused about how the collNet topo graph works. My questions are as follows:
int next = (cComm->rank + 1) % nranks; do { if (cComm->sendComm == NULL) NCCLCHECK(ncclNetPlugin_v6.connect(lComm->dev, handles[next], &cComm->sendComm)); if (cComm->recvComm == NULL) NCCLCHECK(ncclNetPlugin_v6.accept(lComm->listenCommP2P, &cComm->recvComm)); // From prev } while(cComm->sendComm == NULL || cComm->recvComm == NULL);
Thanks for your reply and correcting @sjeaugey .
Note, the bus bandwidth is not based on the ring algorithm. It uses a theoretical perfect formula (2*(n-1)/n× factor) for algorithms based on point-to-point communication, which happens to be exactly what you get on a ring, but also with a direct all-to-all based allreduce algorithm, and potentially other optimal algorithms. The tree is not optimal (2× factor), but close to optimal at scale. But of course with SHARP and other HW-accelerated mechanisms, the BusBW is much harder to make sense of and we usually prefer the AlgBw -- except for comparing against the Ring/Tree performance.
Let's make this more clear.
My rule of thumb is ~20 GB/s per NIC, so the algorithm bandwidth could be up to 8x20GB/s = 160 GB/s. However, at this point, the bottleneck becomes NVLink which still has to transmit ~2x more data intra-node. So with 4 NICs we should be able to get ~80GB/s algorithm bandwidth, but with 8 NICs we're limited to ~110GB/s AlgBw or ~220 GB/s BusBw.
In DGX-A100 servers that uses CollNet, the data in each channel flows like this: Channel 0-1: 0->1->2->3->4->5->6->7 -> NIC7 -> 7->6->5->4->3->2->1->0 Channel 2-3: 1->2->3->4->5->6->7->0 -> NIC0 -> 0->7->6->5->4->3->2->1 Channel 4-5: 2->3->4->5->6->7->0->1 -> NIC1 -> 1->0->7->6->5->4->3->2 Channel 6-7: 3->4->5->6->7->0->1->2 -> NIC2 -> 2->1->0->7->6->5->4->3 Channel 8-9: 4->5->6->7->0->1->2->3-> NIC3 -> 3->2->1->0 ->7->6->5->4 Channel 10-11: 5->6->7->0->1->2->3->4 -> NIC4 -> 4->3->2->1->0->7->6->5 Channel 12-13: 6->7->0->1->2->3->4->5 -> NIC5 -> 5->4->3->2->1->0->7->6 Channel 14-15: 7->0->1->2->3->4->5->6 -> NIC6 -> 6->5->4->3->2->1->0->7
Since this is fully pipelined, then in each GPU, there are 28 send transmissions and 28 recv transmissions simultaneously. In each channel, each GPU need to transmit up to 2x data (e.g, GPU 1 in channel 0, need to recv and send 2x data).
Globally, each GPU needs to send 28/16 S data and recv 28/16 S data and each NIC needs to send 2/16 S data and recv 2/16 S data. I think this means the bandwidth of p2p GPU communication need to be 14x of NIC to make full use of the NIC bandwidth, which is 14*20 = ~280GB/s. However, ~280GB/s exceeds the bandwidth of NVLinks in DGX-A100 servers, so NVLink becomes bottleneck.
In the case of 4 NICs, GPU send and recv ~7x more data. Since 7*20 =140 < 220, NVLink won't be the bottleneck here.
Is my understanding correct?
hello,I am now studying detail about CollNet,but my environment not supported for CollNet。Could you please share you CollNet algo log to me!!!
Looking forward to your replay.
NCCL 2.6 added a new algorithm called CollNet but I could not find any document about this. It seems to be related to SHARP, but it is not clear to me what's its relationship to https://github.com/Mellanox/nccl-rdma-sharp-plugins. Would you describe what CollNet is?
Thanks.