#
# Using devices
# Rank 0 Group 0 Pid 406222 on loving-insect device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 406222 on loving-insect device 1 [0x29] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 406222 on loving-insect device 2 [0x45] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 406222 on loving-insect device 3 [0x4e] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 406222 on loving-insect device 4 [0x1b] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 406222 on loving-insect device 5 [0x24] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 406222 on loving-insect device 6 [0x45] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 406222 on loving-insect device 7 [0x4e] NVIDIA H100 80GB HBM3
# Rank 8 Group 0 Pid 397262 on epic-skink device 0 [0x1b] NVIDIA H100 80GB HBM3
# Rank 9 Group 0 Pid 397262 on epic-skink device 1 [0x29] NVIDIA H100 80GB HBM3
# Rank 10 Group 0 Pid 397262 on epic-skink device 2 [0x45] NVIDIA H100 80GB HBM3
# Rank 11 Group 0 Pid 397262 on epic-skink device 3 [0x4e] NVIDIA H100 80GB HBM3
# Rank 12 Group 0 Pid 397262 on epic-skink device 4 [0x1b] NVIDIA H100 80GB HBM3
# Rank 13 Group 0 Pid 397262 on epic-skink device 5 [0x24] NVIDIA H100 80GB HBM3
# Rank 14 Group 0 Pid 397262 on epic-skink device 6 [0x45] NVIDIA H100 80GB HBM3
# Rank 15 Group 0 Pid 397262 on epic-skink device 7 [0x4e] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 348745 0.38 0.72 N/A 348365 0.39 0.72 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.722005
#
All the RDMA tests (including the perftest), show GID 3 has no problem
ib_write_bw -d mlx5_0 -i 1 -D 1 -m 4096 -q 1 -s 1000000 -x 3 -t 10 --report_gbits 11.0.0.72
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 10
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x20a3 PSN 0x653297 RKey 0x202100 VAddr 0x007feb4fcbd240
GID: 00:00:00:00:00:00:00:00:00:00:255:255:11:00:00:88
remote address: LID 0000 QPN 0x182a PSN 0xa5fa63 RKey 0x200f00 VAddr 0x007f44c98b9240
GID: 00:00:00:00:00:00:00:00:00:00:255:255:11:00:00:72
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 3800.000000 != 2000.000000. CPU Frequency is not max.
1000000 46672 0.00 373.37 0.046672
---------------------------------------------------------------------------------------
Below is the result of show_gids:
~$ show_gids
DEV PORT INDEX GID IPv4 VER DEV
When GID is not specified in the command, it work properly (From the NCCL Log, it shows that GID 0 is used in this case):
However, when NCCL_IB_GID_INDEX is specified as 3 or 1 (RoCEv2), it does not work anymore: /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun \ --show-progress \ -mca plm_rsh_force_rsh 1 \ -v -np 2 \ -H 10.248.0.10,10.248.0.12 \ -x NCCL_SOCKET_IFNAME=ens9f0np0 \ -x NCCL_IB_GID_INDEX=3 \ -x NCCL_IB_DISABLE=0 \ -x NCCL_IB_CUDA_SUPPORT=1 \ -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 \ -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ ./build/all_reduce_perf -b 128M -e 128M -f 2 -g 8 -t 1 -c 0 App launch reported: 2 (out of 2) daemons - 1 (out of 2) procs
nThread 1 nGpus 8 minBytes 134217728 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
All the RDMA tests (including the perftest), show GID 3 has no problem ib_write_bw -d mlx5_0 -i 1 -D 1 -m 4096 -q 1 -s 1000000 -x 3 -t 10 --report_gbits 11.0.0.72
Below is the result of show_gids: ~$ show_gids DEV PORT INDEX GID IPv4 VER DEV
Do we know what might be the cause of the issue