NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 826 forks source link

Bandwidth is different for GPU 0,1 and GPU 6,7 #1435

Closed JuiceLemonLemon closed 2 months ago

JuiceLemonLemon commented 2 months ago

Hello, I have a problem about bandwidth when using GPU 0, 1 and GPU 6, 7. The bandwidth is different.

export CUDA_VISIBLE_DEVICES=0,1 ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 2 -d bfloat16

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    16777216       4194304  bfloat16    none      -1    108.8  154.20   77.10      0    98.19  170.86   85.43      0
    33554432       8388608  bfloat16    none      -1    183.0  183.33   91.66      0    156.1  214.95  107.48      0
    50331648      12582912  bfloat16    none      -1    237.9  211.57  105.78      0    200.0  251.62  125.81      0
    67108864      16777216  bfloat16    none      -1    292.3  229.59  114.79      0    256.5  261.63  130.82      0
    83886080      20971520  bfloat16    none      -1    354.2  236.81  118.41      0    307.0  273.22  136.61      0
   100663296      25165824  bfloat16    none      -1    400.9  251.11  125.56      0    348.1  289.19  144.59      0
   117440512      29360128  bfloat16    none      -1    490.8  239.27  119.64      0    418.7  280.48  140.24      0
   134217728      33554432  bfloat16    none      -1    545.8  245.92  122.96      0    469.0  286.18  143.09      0
   150994944      37748736  bfloat16    none      -1    595.1  253.72  126.86      0    516.2  292.51  146.25      0
   167772160      41943040  bfloat16    none      -1    669.6  250.57  125.29      0    573.0  292.82  146.41      0
   184549376      46137344  bfloat16    none      -1    717.8  257.12  128.56      0    626.4  294.60  147.30      0
   201326592      50331648  bfloat16    none      -1    770.8  261.19  130.59      0    666.0  302.29  151.14      0
   218103808      54525952  bfloat16    none      -1    868.7  251.06  125.53      0    738.5  295.35  147.68      0
   234881024      58720256  bfloat16    none      -1    911.7  257.62  128.81      0    783.1  299.95  149.98      0
   251658240      62914560  bfloat16    none      -1    958.0  262.69  131.34      0    828.8  303.64  151.82      0
   268435456      67108864  bfloat16    none      -1   1037.2  258.80  129.40      0    886.9  302.67  151.33      0
   285212672      71303168  bfloat16    none      -1   1071.9  266.08  133.04      0    936.4  304.59  152.30      0
   301989888      75497472  bfloat16    none      -1   1130.7  267.09  133.54      0    976.0  309.43  154.71      0
   318767104      79691776  bfloat16    none      -1   1236.4  257.82  128.91      0   1055.0  302.14  151.07      0
   335544320      83886080  bfloat16    none      -1   1267.8  264.67  132.33      0   1093.6  306.83  153.41      0
   352321536      88080384  bfloat16    none      -1   1314.0  268.13  134.06      0   1134.6  310.53  155.27      0
   369098752      92274688  bfloat16    none      -1   1395.6  264.47  132.23      0   1199.5  307.71  153.85      0
   385875968      96468992  bfloat16    none      -1   1425.1  270.78  135.39      0   1236.3  312.13  156.07      0
   402653184     100663296  bfloat16    none      -1   1483.5  271.41  135.71      0   1276.9  315.33  157.67      0
   419430400     104857600  bfloat16    none      -1   1594.5  263.06  131.53      0   1362.0  307.96  153.98      0
   436207616     109051904  bfloat16    none      -1   1609.7  270.99  135.49      0   1399.9  311.61  155.80      0
   452984832     113246208  bfloat16    none      -1   1669.3  271.36  135.68      0   1437.5  315.12  157.56      0
   469762048     117440512  bfloat16    none      -1   1750.1  268.42  134.21      0   1501.0  312.96  156.48      0
   486539264     121634816  bfloat16    none      -1   1771.3  274.68  137.34      0   1542.5  315.43  157.72      0
   503316480     125829120  bfloat16    none      -1   1831.1  274.88  137.44      0   1576.1  319.35  159.67      0
   520093696     130023424  bfloat16    none      -1   1951.0  266.57  133.29      0   1670.2  311.40  155.70      0
   536870912     134217728  bfloat16    none      -1   1942.6  276.37  138.18      0   1697.9  316.19  158.10      0
   553648128     138412032  bfloat16    none      -1   2008.0  275.73  137.86      0   1730.5  319.93  159.97      0
   570425344     142606336  bfloat16    none      -1   2099.3  271.72  135.86      0   1804.0  316.19  158.10      0
   587202560     146800640  bfloat16    none      -1   2111.3  278.13  139.06      0   1840.5  319.05  159.52      0
   603979776     150994944  bfloat16    none      -1   2173.0  277.95  138.98      0   1870.7  322.87  161.43      0
   620756992     155189248  bfloat16    none      -1   2297.0  270.25  135.12      0   1969.2  315.23  157.62      0
   637534208     159383552  bfloat16    none      -1   2291.0  278.27  139.14      0   1988.3  320.64  160.32      0
   654311424     163577856  bfloat16    none      -1   2349.4  278.50  139.25      0   2022.8  323.48  161.74      0
   671088640     167772160  bfloat16    none      -1   2446.9  274.26  137.13      0   2096.5  320.10  160.05      0
   687865856     171966464  bfloat16    none      -1   2454.2  280.28  140.14      0   2137.6  321.79  160.90      0
   704643072     176160768  bfloat16    none      -1   2508.4  280.92  140.46      0   2160.9  326.09  163.04      0
   721420288     180355072  bfloat16    none      -1   2626.9  274.63  137.31      0   2267.7  318.13  159.06      0
   738197504     184549376  bfloat16    none      -1   2633.6  280.30  140.15      0   2288.0  322.64  161.32      0
   754974720     188743680  bfloat16    none      -1   2687.0  280.97  140.49      0   2297.1  328.67  164.33      0
   771751936     192937984  bfloat16    none      -1   2786.5  276.96  138.48      0   2384.8  323.62  161.81      0
   788529152     197132288  bfloat16    none      -1   2787.1  282.92  141.46      0   2427.4  324.84  162.42      0
   805306368     201326592  bfloat16    none      -1   2847.5  282.81  141.40      0   2444.1  329.49  164.74      0
   822083584     205520896  bfloat16    none      -1   2966.7  277.10  138.55      0   2559.4  321.20  160.60      0
   838860800     209715200  bfloat16    none      -1   2957.0  283.69  141.84      0   2581.7  324.92  162.46      0
   855638016     213909504  bfloat16    none      -1   3013.0  283.98  141.99      0   2588.5  330.56  165.28      0
   872415232     218103808  bfloat16    none      -1   3097.3  281.67  140.84      0   2686.0  324.80  162.40      0
   889192448     222298112  bfloat16    none      -1   3117.3  285.25  142.62      0   2717.3  327.23  163.62      0
   905969664     226492416  bfloat16    none      -1   3181.6  284.75  142.38      0   2728.7  332.02  166.01      0
   922746880     230686720  bfloat16    none      -1   3292.5  280.25  140.13      0   2840.3  324.87  162.44      0
   939524096     234881024  bfloat16    none      -1   3287.6  285.78  142.89      0   2875.9  326.69  163.35      0
   956301312     239075328  bfloat16    none      -1   3336.9  286.58  143.29      0   2860.7  334.28  167.14      0
   973078528     243269632  bfloat16    none      -1   3427.7  283.89  141.94      0   2984.0  326.10  163.05      0
   989855744     247463936  bfloat16    none      -1   3445.7  287.28  143.64      0   3006.8  329.21  164.60      0
  1006632960     251658240  bfloat16    none      -1   3492.3  288.24  144.12      0   3015.7  333.80  166.90      0
  1023410176     255852544  bfloat16    none      -1   3631.4  281.83  140.91      0   3124.5  327.54  163.77      0
  1040187392     260046848  bfloat16    none      -1   3608.3  288.28  144.14      0   3153.1  329.90  164.95      0
  1056964608     264241152  bfloat16    none      -1   3651.1  289.50  144.75      0   3154.7  335.05  167.52      0
  1073741824     268435456  bfloat16    none      -1   3770.8  284.75  142.37      0   3248.2  330.56  165.28      0
swat1-05:2408678:2408678 [0] NCCL INFO comm 0x3481380 rank 0 nranks 2 cudaDev 0 busId 7000 - Destroy COMPLETE
swat1-05:2408678:2408678 [1] NCCL INFO comm 0x34b8b20 rank 1 nranks 2 cudaDev 1 busId b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 144.144
#

export CUDA_VISIBLE_DEVICES=6,7 ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 2 -d bfloat16

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    16777216       4194304  bfloat16    none      -1    113.3  148.10   74.05      0    98.51  170.30   85.15      0
    33554432       8388608  bfloat16    none      -1    184.2  182.20   91.10      0    150.6  222.85  111.42      0
    50331648      12582912  bfloat16    none      -1    223.8  224.88  112.44      0    191.6  262.63  131.32      0
    67108864      16777216  bfloat16    none      -1    274.1  244.84  122.42      0    242.6  276.67  138.33      0
    83886080      20971520  bfloat16    none      -1    335.0  250.38  125.19      0    295.5  283.86  141.93      0
   100663296      25165824  bfloat16    none      -1    377.4  266.75  133.37      0    335.6  299.96  149.98      0
   117440512      29360128  bfloat16    none      -1    451.8  259.94  129.97      0    396.1  296.52  148.26      0
   134217728      33554432  bfloat16    none      -1    503.1  266.79  133.40      0    447.8  299.73  149.86      0
   150994944      37748736  bfloat16    none      -1    557.0  271.08  135.54      0    489.1  308.74  154.37      0
   167772160      41943040  bfloat16    none      -1    621.6  269.90  134.95      0    541.6  309.78  154.89      0
   184549376      46137344  bfloat16    none      -1    666.9  276.72  138.36      0    594.6  310.38  155.19      0
   201326592      50331648  bfloat16    none      -1    718.8  280.10  140.05      0    633.3  317.92  158.96      0
   218103808      54525952  bfloat16    none      -1    799.0  272.96  136.48      0    697.0  312.91  156.46      0
   234881024      58720256  bfloat16    none      -1    837.3  280.54  140.27      0    740.3  317.27  158.64      0
   251658240      62914560  bfloat16    none      -1    889.1  283.05  141.53      0    782.8  321.50  160.75      0
   268435456      67108864  bfloat16    none      -1    955.6  280.92  140.46      0    835.2  321.40  160.70      0
   285212672      71303168  bfloat16    none      -1    987.4  288.85  144.43      0    881.1  323.71  161.86      0
   301989888      75497472  bfloat16    none      -1   1050.5  287.48  143.74      0    920.0  328.23  164.12      0
   318767104      79691776  bfloat16    none      -1   1131.4  281.75  140.87      0    984.9  323.64  161.82      0
   335544320      83886080  bfloat16    none      -1   1155.2  290.46  145.23      0   1028.3  326.31  163.16      0
   352321536      88080384  bfloat16    none      -1   1216.1  289.71  144.85      0   1061.5  331.91  165.96      0
   369098752      92274688  bfloat16    none      -1   1286.5  286.90  143.45      0   1118.7  329.94  164.97      0
   385875968      96468992  bfloat16    none      -1   1298.2  297.24  148.62      0   1165.0  331.22  165.61      0
   402653184     100663296  bfloat16    none      -1   1368.0  294.34  147.17      0   1191.4  337.97  168.99      0
   419430400     104857600  bfloat16    none      -1   1452.1  288.85  144.42      0   1265.8  331.36  165.68      0
   436207616     109051904  bfloat16    none      -1   1456.7  299.45  149.73      0   1307.8  333.53  166.77      0
   452984832     113246208  bfloat16    none      -1   1531.4  295.80  147.90      0   1332.9  339.84  169.92      0
   469762048     117440512  bfloat16    none      -1   1598.2  293.94  146.97      0   1391.6  337.58  168.79      0
   486539264     121634816  bfloat16    none      -1   1607.0  302.75  151.38      0   1430.6  340.08  170.04      0
   503316480     125829120  bfloat16    none      -1   1682.3  299.19  149.59      0   1457.3  345.37  172.69      0
   520093696     130023424  bfloat16    none      -1   1760.9  295.35  147.68      0   1538.2  338.11  169.05      0
   536870912     134217728  bfloat16    none      -1   1754.4  306.02  153.01      0   1582.0  339.37  169.68      0
   553648128     138412032  bfloat16    none      -1   1839.8  300.93  150.47      0   1597.8  346.50  173.25      0
   570425344     142606336  bfloat16    none      -1   1901.7  299.95  149.98      0   1659.4  343.76  171.88      0
   587202560     146800640  bfloat16    none      -1   1916.3  306.42  153.21      0   1703.9  344.61  172.31      0
   603979776     150994944  bfloat16    none      -1   1985.8  304.15  152.08      0   1724.4  350.26  175.13      0
   620756992     155189248  bfloat16    none      -1   2068.0  300.18  150.09      0   1797.4  345.36  172.68      0
   637534208     159383552  bfloat16    none      -1   2060.5  309.41  154.70      0   1843.5  345.83  172.91      0
   654311424     163577856  bfloat16    none      -1   2133.9  306.62  153.31      0   1859.0  351.97  175.98      0
   671088640     167772160  bfloat16    none      -1   2206.5  304.15  152.07      0   1925.2  348.58  174.29      0
   687865856     171966464  bfloat16    none      -1   2210.3  311.21  155.61      0   1966.9  349.72  174.86      0
   704643072     176160768  bfloat16    none      -1   2281.2  308.89  154.44      0   1996.1  353.01  176.50      0
   721420288     180355072  bfloat16    none      -1   2367.9  304.67  152.33      0   2065.2  349.32  174.66      0
   738197504     184549376  bfloat16    none      -1   2359.2  312.91  156.45      0   2111.7  349.58  174.79      0
   754974720     188743680  bfloat16    none      -1   2435.9  309.93  154.97      0   2108.3  358.10  179.05      0
   771751936     192937984  bfloat16    none      -1   2499.4  308.78  154.39      0   2189.9  352.42  176.21      0
   788529152     197132288  bfloat16    none      -1   2508.4  314.35  157.18      0   2224.6  354.45  177.23      0
   805306368     201326592  bfloat16    none      -1   2563.1  314.20  157.10      0   2264.3  355.65  177.83      0
   822083584     205520896  bfloat16    none      -1   2650.8  310.13  155.06      0   2336.2  351.89  175.94      0
   838860800     209715200  bfloat16    none      -1   2650.1  316.54  158.27      0   2355.5  356.13  178.07      0
   855638016     213909504  bfloat16    none      -1   2731.3  313.27  156.64      0   2390.7  357.90  178.95      0
   872415232     218103808  bfloat16    none      -1   2795.7  312.05  156.03      0   2453.7  355.55  177.78      0
   889192448     222298112  bfloat16    none      -1   2809.4  316.51  158.25      0   2485.0  357.83  178.91      0
   905969664     226492416  bfloat16    none      -1   2852.6  317.59  158.80      0   2517.1  359.92  179.96      0
   922746880     230686720  bfloat16    none      -1   2940.4  313.82  156.91      0   2598.5  355.11  177.55      0
   939524096     234881024  bfloat16    none      -1   2955.7  317.87  158.93      0   2617.4  358.95  179.48      0
   956301312     239075328  bfloat16    none      -1   3012.1  317.49  158.74      0   2653.4  360.41  180.21      0
   973078528     243269632  bfloat16    none      -1   3092.3  314.68  157.34      0   2728.0  356.70  178.35      0
   989855744     247463936  bfloat16    none      -1   3095.7  319.75  159.88      0   2746.3  360.43  180.21      0
  1006632960     251658240  bfloat16    none      -1   3145.3  320.05  160.02      0   2779.6  362.15  181.08      0
  1023410176     255852544  bfloat16    none      -1   3242.5  315.62  157.81      0   2857.3  358.18  179.09      0
  1040187392     260046848  bfloat16    none      -1   3246.4  320.41  160.21      0   2883.0  360.80  180.40      0
  1056964608     264241152  bfloat16    none      -1   3291.2  321.15  160.57      0   2926.9  361.12  180.56      0
  1073741824     268435456  bfloat16    none      -1   3378.7  317.79  158.90      0   2986.7  359.51  179.75      0
swat1-05:2409330:2409330 [0] NCCL INFO comm 0x3695380 rank 0 nranks 2 cudaDev 0 busId c8000 - Destroy COMPLETE
swat1-05:2409330:2409330 [1] NCCL INFO comm 0x36ccb20 rank 1 nranks 2 cudaDev 1 busId cb000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 156.41
#
sjeaugey commented 2 months ago

What is your node topology, i.e. the output of nvidia-smi topo -m?

JuiceLemonLemon commented 2 months ago

What is your node topology, i.e. the output of nvidia-smi topo -m?

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     24-31,88-95     3               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     24-31,88-95     3               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     SYS     SYS     8-15,72-79      1               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     SYS     SYS     8-15,72-79      1               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     PXB     56-63,120-127   7               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     PXB     56-63,120-127   7               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     SYS     SYS     40-47,104-111   5               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     SYS     SYS     40-47,104-111   5               N/A
NIC0    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS
NIC5    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
JuiceLemonLemon commented 2 months ago

Oh, maybe it's there is something wrong with our device, it's ok now after reboot.