Open ProHuper opened 2 months ago
4 x 49GB/s = 196 GB/s. That's your network bandwidth and what you should also see when setting NCCL_ALGO=RING. However, on 2 nodes, the Tree algorithm puts more traffic on NVLink and less on the network, allowing to reach a bandwidth that's a mix between NVLink and the network bandwidth, hence can be higher than the network bandwidth.
4 x 49GB/s = 196 GB/s. That's your network bandwidth and what you should also see when setting NCCL_ALGO=RING. However, on 2 nodes, the Tree algorithm puts more traffic on NVLink and less on the network, allowing to reach a bandwidth that's a mix between NVLink and the network bandwidth, hence can be higher than the network bandwidth.
Thanks for replying. As you mentioned, under the RING-ALGO, the busbw I measured is very close to the theoretical peak (196). However, under the TREE-ALG, the busbw I measured is 309, and I'm not quite sure if this is close to the theoretical bandwidth. Is there a way to determine the theoretical busbw for the tree algorithm with 2 nodes?
I'm not quite sure if this is close to the theoretical bandwidth.
The rings are close to theoretical, so your network hardware is functioning perfectly. You can check the intra-node NVLink performance is at 370 to ensure NVLink is functioning properly. If both are good, then the Tree performance is the best is can be.
I'm not quite sure if this is close to the theoretical bandwidth.
The rings are close to theoretical, so your network hardware is functioning perfectly. You can check the intra-node NVLink performance is at 370 to ensure NVLink is functioning properly. If both are good, then the Tree performance is the best is can be.
If the communication in the Tree-ALGO overlaps well, the algbw show be close to the network bandwidth? when i use one nic or 2 nics,it is so :
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 4294967296 float sum -1 347989 49.37 92.57 0 348000 49.37 92.56 0
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 4294967296 float sum -1 174239 98.60 184.87 0 174222 98.61 184.89 0
but when i use 4 nics,it it not:
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 4294967296 float sum -1 104090 165.05 309.47 0 103759 165.57 310.45
Also, if I set NCCL_MIN_NCHANNELS=24
(default is 16 in 2 node2 Tree-ALGO),the algbw increases,but it still does not meet expectations.
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 4294967296 float sum -1 96251 178.49 334.67 0 96260 178.47 334.64 0
The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.
In other words, that's all good for benchmarks, but not a good compromise for real applications.
I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.
The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.
In other words, that's all good for benchmarks, but not a good compromise for real applications.
I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.
Alright, thanks !
The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.
In other words, that's all good for benchmarks, but not a good compromise for real applications.
I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.
Hello, about this issue, I've made some further test. It seems that the incorrect logical topology can cause NCCL to select the wrong nic in certain scenarios. The physical topology of the nodes is shown in the diagram below, each node has 8GPUs and 4 nics, GPU0/GPU1/NIC0 is under the same PCIe switch:
So if I specify GPU1 on both nodes for allreduce, it should select NIC0 because it's the closest one. But log shows that NIC1 was selected actually:
qh100-gpu20:67697:67704 [0] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu19:87121:87137 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu20:67697:67712 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:87121:87129 [0] NCCL INFO Connected all rings
qh100-gpu19:87121:87129 [0] NCCL INFO Connected all trees
qh100-gpu19:87121:87129 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
And here's the result of lspci -tv
, in which the distance from GPU0 to both NIC0 and NIC1 is the same:
It seems like NCCL is building topo using this logical topology, which mismatches with the actual physical topology.
2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。
nccl log info