I have a bond RDMA device with two origin device, each bandwidth is 100Gb/s, the ib_write_bw can reach the 185Gb/s bandwidth, but the nccl-tests just can reach 95Gb/s, and our monitor shows only one of origin device has traffic. Is it expected?
Following is all my environment information. Please feel free to ask me any other informations.
Looking forward to your reply. Thanks!
the ib_write_bw result is following
ib_write_bw -x 3 -q 8 -d mlx5_bond_1 {serverip} --report_gbits --run_infinitely
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_bond_1
Number of qps : 8 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API: ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1765470 185.37 184.94 0.352743
the nccl-tests result is following (the highest busbw result in history is 11.xx GB/s)
the mlx5_bond_1 is a bond RDMA network device with two device/network_interface.
the cat /proc/net/bonding/bondpcie1 (corresponding to mlx5_bond_1) result shows the bond mode is 802.3ad, the Transmit Hash Policy: layer3+4(1), each network_interface's bandwidth is 100Gb/s.
the ibstatus result of mlx5_bond_1 shows rate is 100Gb/sec (4X EDR)
the nccl-test is 2.10.3, the nccl version is 2.14.3
the device is ConnectX-6 Dx, the ofed_driver is MLNX_OFED_LINUX-5.8-3.0.7.0-rhel7.9-x86_64.iso, the linux kernel is 3.10.0
I have a bond RDMA device with two origin device, each bandwidth is 100Gb/s, the ib_write_bw can reach the 185Gb/s bandwidth, but the nccl-tests just can reach 95Gb/s, and our monitor shows only one of origin device has traffic. Is it expected?
Following is all my environment information. Please feel free to ask me any other informations.
Looking forward to your reply. Thanks!
the ib_write_bw result is following
the nccl-tests result is following (the highest busbw result in history is 11.xx GB/s)
the mlx5_bond_1 is a bond RDMA network device with two device/network_interface.
the cat /proc/net/bonding/bondpcie1 (corresponding to mlx5_bond_1) result shows the bond mode is 802.3ad, the Transmit Hash Policy: layer3+4(1), each network_interface's bandwidth is 100Gb/s.
the ibstatus result of mlx5_bond_1 shows rate is 100Gb/sec (4X EDR)
the nccl-test is 2.10.3, the nccl version is 2.14.3
the device is ConnectX-6 Dx, the ofed_driver is MLNX_OFED_LINUX-5.8-3.0.7.0-rhel7.9-x86_64.iso, the linux kernel is 3.10.0