NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

NCCL Test Does not work with GID 3 or GID 1, but it works fine for GID 0 #192

Open chgdragon2023 opened 7 months ago

chgdragon2023 commented 7 months ago
  1. When GID is not specified in the command, it work properly (From the NCCL Log, it shows that GID 0 is used in this case):

              /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun  \
                         --show-progress \
                         -mca plm_rsh_force_rsh 1 \
                         -v  -np 2  \
                         -H 10.248.0.10,10.248.0.12 \
                         -x NCCL_SOCKET_IFNAME=ens9f0np0  \
                         -x NCCL_IB_CUDA_SUPPORT=1 \
                         -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 \
                         -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  \
                         ./build/all_reduce_perf -b 2G -e 2G -f 2 -g 8 -t 1 -c 0 
                    App launch reported: 2 (out of 2) daemons - 1 (out of 2) procs
                    # nThread 1 nGpus 8 minBytes 2147483648 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
                    #
                    # Using devices
                    #  Rank  0 Group  0 Pid 405910 on loving-insect device  0 [0x1b] NVIDIA H100 80GB HBM3
                    #  Rank  1 Group  0 Pid 405910 on loving-insect device  1 [0x29] NVIDIA H100 80GB HBM3
                    #  Rank  2 Group  0 Pid 405910 on loving-insect device  2 [0x45] NVIDIA H100 80GB HBM3
                    #  Rank  3 Group  0 Pid 405910 on loving-insect device  3 [0x4e] NVIDIA H100 80GB HBM3
                    #  Rank  4 Group  0 Pid 405910 on loving-insect device  4 [0x1b] NVIDIA H100 80GB HBM3
                    #  Rank  5 Group  0 Pid 405910 on loving-insect device  5 [0x24] NVIDIA H100 80GB HBM3
                    #  Rank  6 Group  0 Pid 405910 on loving-insect device  6 [0x45] NVIDIA H100 80GB HBM3
                    #  Rank  7 Group  0 Pid 405910 on loving-insect device  7 [0x4e] NVIDIA H100 80GB HBM3
                    #  Rank  8 Group  0 Pid 396902 on epic-skink device  0 [0x1b] NVIDIA H100 80GB HBM3
                    #  Rank  9 Group  0 Pid 396902 on epic-skink device  1 [0x29] NVIDIA H100 80GB HBM3
                    #  Rank 10 Group  0 Pid 396902 on epic-skink device  2 [0x45] NVIDIA H100 80GB HBM3
                    #  Rank 11 Group  0 Pid 396902 on epic-skink device  3 [0x4e] NVIDIA H100 80GB HBM3
                    #  Rank 12 Group  0 Pid 396902 on epic-skink device  4 [0x1b] NVIDIA H100 80GB HBM3
                    #  Rank 13 Group  0 Pid 396902 on epic-skink device  5 [0x24] NVIDIA H100 80GB HBM3
                    #  Rank 14 Group  0 Pid 396902 on epic-skink device  6 [0x45] NVIDIA H100 80GB HBM3
                    #  Rank 15 Group  0 Pid 396902 on epic-skink device  7 [0x4e] NVIDIA H100 80GB HBM3
                    #
                    #                                                              out-of-place                       in-place          
                    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
                    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
                      2147483648     536870912     float     sum      -1    13194  162.77  305.19    N/A    12916  166.27  311.75    N/A
                    # Out of bounds values : 0 OK
                    # Avg bus bandwidth    : 308.468 
  2. However, when NCCL_IB_GID_INDEX is specified as 3 or 1 (RoCEv2), it does not work anymore: /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun \ --show-progress \ -mca plm_rsh_force_rsh 1 \ -v -np 2 \ -H 10.248.0.10,10.248.0.12 \ -x NCCL_SOCKET_IFNAME=ens9f0np0 \ -x NCCL_IB_GID_INDEX=3 \ -x NCCL_IB_DISABLE=0 \ -x NCCL_IB_CUDA_SUPPORT=1 \ -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 \ -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ ./build/all_reduce_perf -b 128M -e 128M -f 2 -g 8 -t 1 -c 0 App launch reported: 2 (out of 2) daemons - 1 (out of 2) procs

    nThread 1 nGpus 8 minBytes 134217728 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0

                #
                # Using devices
                #  Rank  0 Group  0 Pid 406222 on loving-insect device  0 [0x1b] NVIDIA H100 80GB HBM3
                #  Rank  1 Group  0 Pid 406222 on loving-insect device  1 [0x29] NVIDIA H100 80GB HBM3
                #  Rank  2 Group  0 Pid 406222 on loving-insect device  2 [0x45] NVIDIA H100 80GB HBM3
                #  Rank  3 Group  0 Pid 406222 on loving-insect device  3 [0x4e] NVIDIA H100 80GB HBM3
                #  Rank  4 Group  0 Pid 406222 on loving-insect device  4 [0x1b] NVIDIA H100 80GB HBM3
                #  Rank  5 Group  0 Pid 406222 on loving-insect device  5 [0x24] NVIDIA H100 80GB HBM3
                #  Rank  6 Group  0 Pid 406222 on loving-insect device  6 [0x45] NVIDIA H100 80GB HBM3
                #  Rank  7 Group  0 Pid 406222 on loving-insect device  7 [0x4e] NVIDIA H100 80GB HBM3
                #  Rank  8 Group  0 Pid 397262 on epic-skink device  0 [0x1b] NVIDIA H100 80GB HBM3
                #  Rank  9 Group  0 Pid 397262 on epic-skink device  1 [0x29] NVIDIA H100 80GB HBM3
                #  Rank 10 Group  0 Pid 397262 on epic-skink device  2 [0x45] NVIDIA H100 80GB HBM3
                #  Rank 11 Group  0 Pid 397262 on epic-skink device  3 [0x4e] NVIDIA H100 80GB HBM3
                #  Rank 12 Group  0 Pid 397262 on epic-skink device  4 [0x1b] NVIDIA H100 80GB HBM3
                #  Rank 13 Group  0 Pid 397262 on epic-skink device  5 [0x24] NVIDIA H100 80GB HBM3
                #  Rank 14 Group  0 Pid 397262 on epic-skink device  6 [0x45] NVIDIA H100 80GB HBM3
                #  Rank 15 Group  0 Pid 397262 on epic-skink device  7 [0x4e] NVIDIA H100 80GB HBM3
                #
                #                                                              out-of-place                       in-place          
                #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
                #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
                   134217728      33554432     float     sum      -1   348745    0.38    0.72    N/A   348365    0.39    0.72    N/A
                # Out of bounds values : 0 OK
                # Avg bus bandwidth    : 0.722005 
                #
  3. All the RDMA tests (including the perftest), show GID 3 has no problem ib_write_bw -d mlx5_0 -i 1 -D 1 -m 4096 -q 1 -s 1000000 -x 3 -t 10 --report_gbits 11.0.0.72

                          RDMA_Write BW Test
       Dual-port       : OFF          Device         : mlx5_0
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 10
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : Ethernet
       GID index       : 3
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0000 QPN 0x20a3 PSN 0x653297 RKey 0x202100 VAddr 0x007feb4fcbd240
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:11:00:00:88
       remote address: LID 0000 QPN 0x182a PSN 0xa5fa63 RKey 0x200f00 VAddr 0x007f44c98b9240
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:11:00:00:72
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
      Conflicting CPU frequency values detected: 3800.000000 != 2000.000000. CPU Frequency is not max.
       1000000    46672            0.00               373.37             0.046672
      ---------------------------------------------------------------------------------------
  4. Below is the result of show_gids: ~$ show_gids DEV PORT INDEX GID IPv4 VER DEV


                  mlx5_0  1       0       fe80:0000:0000:0000:a288:c2ff:fe5b:03ec                 v1      ens3np0
                  mlx5_0  1       1       fe80:0000:0000:0000:a288:c2ff:fe5b:03ec                 v2      ens3np0
                  mlx5_0  1       2       0000:0000:0000:0000:0000:ffff:0b00:0048 11.0.0.72       v1      ens3np0
                  mlx5_0  1       3       0000:0000:0000:0000:0000:ffff:0b00:0048 11.0.0.72       v2      ens3np0
                  mlx5_1  1       0       fe80:0000:0000:0000:a288:c2ff:fe5b:0374                 v1      ens2np0
                  mlx5_1  1       1       fe80:0000:0000:0000:a288:c2ff:fe5b:0374                 v2      ens2np0
                  mlx5_1  1       2       0000:0000:0000:0000:0000:ffff:0b00:0049 11.0.0.73       v1      ens2np0
                  mlx5_1  1       3       0000:0000:0000:0000:0000:ffff:0b00:0049 11.0.0.73       v2      ens2np0
                  mlx5_2  1       0       fe80:0000:0000:0000:a288:c2ff:fe5b:05e4                 v1      ens4np0
                  mlx5_2  1       1       fe80:0000:0000:0000:a288:c2ff:fe5b:05e4                 v2      ens4np0
                  mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0b00:004a 11.0.0.74       v1      ens4np0
                  mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0b00:004a 11.0.0.74       v2      ens4np0
                  mlx5_3  1       0       fe80:0000:0000:0000:a288:c2ff:fe5b:05dc                 v1      ens1np0
                  mlx5_3  1       1       fe80:0000:0000:0000:a288:c2ff:fe5b:05dc                 v2      ens1np0
                  mlx5_3  1       2       0000:0000:0000:0000:0000:ffff:0b00:004b 11.0.0.75       v1      ens1np0
                  mlx5_3  1       3       0000:0000:0000:0000:0000:ffff:0b00:004b 11.0.0.75       v2      ens1np0
                  mlx5_4  1       0       fe80:0000:0000:0000:a288:c2ff:fe4d:3702                 v1      ens9f0np0
                  mlx5_4  1       1       fe80:0000:0000:0000:a288:c2ff:fe4d:3702                 v2      ens9f0np0
                  mlx5_4  1       2       0000:0000:0000:0000:0000:ffff:0af8:000a 10.248.0.10     v1      ens9f0np0
                  mlx5_4  1       3       0000:0000:0000:0000:0000:ffff:0af8:000a 10.248.0.10     v2      ens9f0np0
                  mlx5_5  1       0       fe80:0000:0000:0000:a288:c2ff:fe4d:3703                 v1      ens9f1np1
                  mlx5_5  1       1       fe80:0000:0000:0000:a288:c2ff:fe4d:3703                 v2      ens9f1np1
                  mlx5_6  1       0       fe80:0000:0000:0000:a288:c2ff:fe5a:e4f4                 v1      enP1s8np0
                  mlx5_6  1       1       fe80:0000:0000:0000:a288:c2ff:fe5a:e4f4                 v2      enP1s8np0
                  mlx5_6  1       2       0000:0000:0000:0000:0000:ffff:0b00:004c 11.0.0.76       v1      enP1s8np0
                  mlx5_6  1       3       0000:0000:0000:0000:0000:ffff:0b00:004c 11.0.0.76       v2      enP1s8np0
                  mlx5_7  1       0       fe80:0000:0000:0000:a288:c2ff:fe59:89bc                 v1      enP1s7np0
                  mlx5_7  1       1       fe80:0000:0000:0000:a288:c2ff:fe59:89bc                 v2      enP1s7np0
                  mlx5_7  1       2       0000:0000:0000:0000:0000:ffff:0b00:004d 11.0.0.77       v1      enP1s7np0
                  mlx5_7  1       3       0000:0000:0000:0000:0000:ffff:0b00:004d 11.0.0.77       v2      enP1s7np0
                  mlx5_8  1       0       fe80:0000:0000:0000:a288:c2ff:fe59:9324                 v1      enP1s6np0
                  mlx5_8  1       1       fe80:0000:0000:0000:a288:c2ff:fe59:9324                 v2      enP1s6np0
                  mlx5_8  1       2       0000:0000:0000:0000:0000:ffff:0b00:004e 11.0.0.78       v1      enP1s6np0
                  mlx5_8  1       3       0000:0000:0000:0000:0000:ffff:0b00:004e 11.0.0.78       v2      enP1s6np0
                  mlx5_9  1       0       fe80:0000:0000:0000:a288:c2ff:fe5b:05f4                 v1      enP1s5np0
                  mlx5_9  1       1       fe80:0000:0000:0000:a288:c2ff:fe5b:05f4                 v2      enP1s5np0
                  mlx5_9  1       2       0000:0000:0000:0000:0000:ffff:0b00:004f 11.0.0.79       v1      enP1s5np0
                  mlx5_9  1       3       0000:0000:0000:0000:0000:ffff:0b00:004f 11.0.0.79       v2      enP1s5np0

Do we know what might be the cause of the issue