NERSC / nccl-ofi-plugin

Repository for building the NCCL OFI plugin from AWS and NVIDIA
6 stars 2 forks source link

Request to compare NCCL Performance with NSF NCAR Derecho #13

Open dphow opened 2 months ago

dphow commented 2 months ago

Hello NERSC Team,

HPCD CISL staff at NSF NCAR would like to request a performance comparison of your use of this nccl-ofi-plugin to ours as copied below. We used a test suite that expands a bit on your test runs found in this forked Github by @benkirk but please let us know if this is something you can do and provide a turn around on in a couple weeks.

Primarily, we would like to compare settings placed in the https://github.com/benkirk/nccl-ofi-plugin/blob/main/env_nccl_derecho.sh in order to determine what should be optimal.

To note, the below test used NCCL 2.22.3-1 and AWS NCCL Plugin 1.7.4. I am comfortable using the NCCL latest version but suspect it would be difficult to adapt AWS plugin usage beyond 1.7.4 given their specific targeting of AWS machines after that version.

Avg Bus Bandwidth (GB/s) per test suite ran: all-gather Intra-node (2GPUs) - 45.3614 Intra-node (4GPUs) - 127.491 Inter-node (2GPUs) - 14.0373 Inter-node (4GPUs) - 29.4738 Inter-node (8GPUs) - 50.7769

all-reduce Intra-node (2GPUs) - 57.1399 Intra-node (4GPUs) - 154.882 Inter-node (2GPUs) - 16.6409 Inter-node (4GPUs) - 29.9176 Inter-node (8GPUs) - 51.8871

all-to-all Intra-node (2GPUs) - 49.2425 Intra-node (4GPUs) - 135.873 Inter-node (2GPUs) - 13.1876 Inter-node (4GPUs) - 18.9097 Inter-node (8GPUs) - 21.0073

send-recv Intra-node (2GPUs) - 56.3249 Intra-node (4GPUs) - 63.7496 Inter-node (2GPUs) - 14.6531 Inter-node (4GPUs) - 17.0042 Inter-node (8GPUs) - 16.3461

Thanks!


Tue Aug 13 13:24:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   25C    P0              49W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   24C    P0              49W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:82:00.0 Off |                    0 |
| N/A   25C    P0              51W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   24C    P0              51W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Tue Aug 13 13:24:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   25C    P0              52W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   25C    P0              52W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:82:00.0 Off |                    0 |
| N/A   26C    P0              51W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   25C    P0              54W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

    linux-vdso.so.1 (0x00007ffc5c9a5000)
    librt.so.1 => /lib64/librt.so.1 (0x0000150618c71000)
    libmpi_nvidia.so.12 => /opt/cray/pe/mpich/8.1.27/ofi/nvidia/20.7/lib/libmpi_nvidia.so.12 (0x0000150615efa000)
    libnccl.so.2 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/lib/libnccl.so.2 (0x0000150613015000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000150612df2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000150612bee000)
    libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x0000150610f91000)
    libcudart.so.12 => /glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/lib64/libcudart.so.12 (0x0000150610ce9000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00001506108c4000)
    libpmi.so.0 => /opt/cray/pe/pmi/6.1.12/lib/libpmi.so.0 (0x00001506106c2000)
    libpmi2.so.0 => /opt/cray/pe/pmi/6.1.12/lib/libpmi2.so.0 (0x00001506104a0000)
    libpals.so.0 => /opt/cray/pe/pals/1.2.12/lib/libpals.so.0 (0x0000150610298000)
    libmpi_gtl_cuda.so.0 => /opt/cray/pe/mpich/8.1.27/gtl/lib/libmpi_gtl_cuda.so.0 (0x0000150610052000)
    libacchost.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libacchost.so (0x000015060fded000)
    libaccdevaux.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libaccdevaux.so (0x000015060fbd1000)
    libaccdevice.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libaccdevice.so (0x000015060f8a5000)
    libcudadevice.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libcudadevice.so (0x000015060f68e000)
    libnvf.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvf.so (0x000015060ef6b000)
    libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000015060ed62000)
    libnvhpcatm.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvhpcatm.so (0x000015060eb57000)
    libnvomp.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvomp.so (0x000015060db56000)
    libnvcpumath.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvcpumath.so (0x000015060d712000)
    libnvc.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvc.so (0x000015060d4ae000)
    libc.so.6 => /lib64/libc.so.6 (0x000015060d0b9000)
    /lib64/ld-linux-x86-64.so.2 (0x0000150618e7a000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000015060cea0000)
    libm.so.6 => /lib64/libm.so.6 (0x000015060cb55000)
    libfabric.so.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libfabric.so.1 (0x000015060c856000)
    libjansson.so.4 => /usr/lib64/libjansson.so.4 (0x0000150619068000)
    libcxi.so.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libcxi.so.1 (0x000015060c631000)
    libcurl.so.4 => /glade/u/apps/derecho/23.09/opt/view/lib/libcurl.so.4 (0x000015060c388000)
    libjson-c.so.3 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libjson-c.so.3 (0x000015060c178000)
    libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x000015060bf56000)
    libnghttp2.so.14 => /glade/u/apps/derecho/23.09/spack/opt/spack/nghttp2/1.48.0/gcc/7.5.0/k6lz/lib/libnghttp2.so.14 (0x000015060bd2a000)
    libidn2.so.0 => /glade/u/apps/derecho/23.09/spack/opt/spack/libidn2/2.3.4/gcc/7.5.0/h3px/lib/libidn2.so.0 (0x000015060bafa000)
    libssh2.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/libssh2/1.10.0/gcc/7.5.0/qgbb/lib/libssh2.so.1 (0x000015060b8bc000)
    libmbedtls.so.14 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedtls.so.14 (0x000015060b68b000)
    libmbedx509.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedx509.so.1 (0x000015060b46a000)
    libmbedcrypto.so.7 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedcrypto.so.7 (0x000015060b1e7000)
    libssl.so.1.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libssl.so.1.1 (0x000015060af49000)
    libcrypto.so.1.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libcrypto.so.1.1 (0x000015060aa0a000)
    libz.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/zlib/1.2.13/gcc/7.5.0/g42i/lib/libz.so.1 (0x000015060a7f2000)
    libiconv.so.2 => /glade/u/apps/derecho/23.09/spack/opt/spack/libiconv/1.17/gcc/7.5.0/k4pm/lib/libiconv.so.2 (0x000015060a4e8000)
    libunistring.so.5 => /glade/u/apps/derecho/23.09/spack/opt/spack/libunistring/1.1/gcc/7.5.0/222j/lib/libunistring.so.5 (0x000015060a13c000)
    libjitterentropy.so.3 => /usr/lib64/libjitterentropy.so.3 (0x0000150609f35000)

========== RUNNING NCCL TESTS ==========
##### RUNNING all_gather_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 118922 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 118923 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    20.92   25.06   12.53      0    19.15   27.38   13.69      0
     2097152        262144     float    none      -1    48.89   42.89   21.45      0    44.49   47.13   23.57      0
     8388608       1048576     float    none      -1    112.3   74.72   37.36      0    97.67   85.88   42.94      0
    33554432       4194304     float    none      -1    371.4   90.33   45.17      0    316.7  105.95   52.98      0
   134217728      16777216     float    none      -1   1344.5   99.83   49.92      0   1139.1  117.83   58.91      0
   536870912      67108864     float    none      -1   4839.1  110.94   55.47      0   4160.2  129.05   64.52      0
  2147483648     268435456     float    none      -1    18651  115.14   57.57      0    16392  131.01   65.50      0
  8589934592    1073741824     float    none      -1    73438  116.97   58.48      0    65355  131.44   65.72      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 45.3614 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 118963 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33304 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    97.30    5.39    2.69      0    83.90    6.25    3.12      0
     2097152        262144     float    none      -1    117.1   17.91    8.96      0    112.5   18.64    9.32      0
     8388608       1048576     float    none      -1    330.3   25.40   12.70      0    294.4   28.49   14.24      0
    33554432       4194304     float    none      -1   1114.7   30.10   15.05      0    979.9   34.24   17.12      0
   134217728      16777216     float    none      -1   4347.1   30.88   15.44      0   3499.2   38.36   19.18      0
   536870912      67108864     float    none      -1    17119   31.36   15.68      0    13502   39.76   19.88      0
  2147483648     268435456     float    none      -1    68261   31.46   15.73      0    54079   39.71   19.86      0
  8589934592    1073741824     float    none      -1   271952   31.59   15.79      0   216612   39.66   19.83      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.0373 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119030 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119031 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119032 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119033 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    22.02   23.81   17.85      0    21.01   24.96   18.72      0
     2097152        131072     float    none      -1    60.22   34.83   26.12      0    58.17   36.05   27.04      0
     8388608        524288     float    none      -1    90.60   92.59   69.44      0    85.16   98.50   73.88      0
    33554432       2097152     float    none      -1    203.4  164.98  123.73      0    184.9  181.48  136.11      0
   134217728       8388608     float    none      -1    594.5  225.76  169.32      0    562.2  238.73  179.04      0
   536870912      33554432     float    none      -1   2181.6  246.09  184.57      0   2056.6  261.05  195.78      0
  2147483648     134217728     float    none      -1   8233.5  260.82  195.62      0   7734.4  277.65  208.24      0
  8589934592     536870912     float    none      -1    32217  266.63  199.97      0    30046  285.89  214.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 127.491 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119083 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119084 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33361 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33362 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    73.69    7.12    5.34      0    73.92    7.09    5.32      0
     2097152        131072     float    none      -1    100.7   20.83   15.62      0    99.37   21.10   15.83      0
     8388608        524288     float    none      -1    235.1   35.69   26.76      0    227.7   36.84   27.63      0
    33554432       2097152     float    none      -1    766.8   43.76   32.82      0    730.6   45.93   34.44      0
   134217728       8388608     float    none      -1   2834.6   47.35   35.51      0   2687.0   49.95   37.46      0
   536870912      33554432     float    none      -1    10535   50.96   38.22      0    10475   51.25   38.44      0
  2147483648     134217728     float    none      -1    41051   52.31   39.23      0    40678   52.79   39.59      0
  8589934592     536870912     float    none      -1   162909   52.73   39.55      0   161850   53.07   39.81      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.4738 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119126 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119127 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119128 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119129 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33394 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33395 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33396 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33397 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         16384     float    none      -1    77.06    6.80    5.95      0    74.85    7.00    6.13      0
     2097152         65536     float    none      -1    168.2   12.47   10.91      0    168.3   12.46   10.90      0
     8388608        262144     float    none      -1    249.1   33.67   29.47      0    244.4   34.33   30.04      0
    33554432       1048576     float    none      -1    470.4   71.33   62.41      0    457.6   73.32   64.16      0
   134217728       4194304     float    none      -1   1732.3   77.48   67.80      0   1729.3   77.61   67.91      0
   536870912      16777216     float    none      -1   6373.5   84.24   73.71      0   6512.3   82.44   72.13      0
  2147483648      67108864     float    none      -1    24386   88.06   77.05      0    24260   88.52   77.46      0
  8589934592     268435456     float    none      -1    96376   89.13   77.99      0    95850   89.62   78.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 50.7769 
#

# --> END execution (nccl-tests/build/all_gather_perf)

##### RUNNING all_reduce_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119186 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119187 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    26.37   19.89   19.89      0    24.82   21.13   21.13      0
     2097152        524288     float     sum      -1    61.80   33.94   33.94      0    61.62   34.04   34.04      0
     8388608       2097152     float     sum      -1    156.8   53.51   53.51      0    154.4   54.33   54.33      0
    33554432       8388608     float     sum      -1    536.4   62.55   62.55      0    534.5   62.78   62.78      0
   134217728      33554432     float     sum      -1   2010.2   66.77   66.77      0   2004.5   66.96   66.96      0
   536870912     134217728     float     sum      -1   7543.8   71.17   71.17      0   7507.6   71.51   71.51      0
  2147483648     536870912     float     sum      -1    29279   73.35   73.35      0    29127   73.73   73.73      0
  8589934592    2147483648     float     sum      -1   115969   74.07   74.07      0   115259   74.53   74.53      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 57.1399 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119232 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33450 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    151.9    3.45    3.45      0    142.7    3.68    3.68      0
     2097152        524288     float     sum      -1    160.6   13.05   13.05      0    170.0   12.34   12.34      0
     8388608       2097152     float     sum      -1    491.1   17.08   17.08      0    490.5   17.10   17.10      0
    33554432       8388608     float     sum      -1   1765.0   19.01   19.01      0   1754.8   19.12   19.12      0
   134217728      33554432     float     sum      -1   6851.3   19.59   19.59      0   6793.8   19.76   19.76      0
   536870912     134217728     float     sum      -1    26284   20.43   20.43      0    26916   19.95   19.95      0
  2147483648     536870912     float     sum      -1   104908   20.47   20.47      0   105654   20.33   20.33      0
  8589934592    2147483648     float     sum      -1   420075   20.45   20.45      0   419970   20.45   20.45      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.6409 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119269 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119270 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119271 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119272 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    22.35   23.46   35.19      0    20.69   25.34   38.01      0
     2097152        524288     float     sum      -1    50.06   41.89   62.84      0    49.76   42.14   63.21      0
     8388608       2097152     float     sum      -1    112.4   74.60  111.90      0    110.0   76.28  114.41      0
    33554432       8388608     float     sum      -1    298.3  112.50  168.75      0    296.5  113.16  169.74      0
   134217728      33554432     float     sum      -1   1024.9  130.96  196.44      0   1022.7  131.24  196.86      0
   536870912     134217728     float     sum      -1   3799.5  141.30  211.95      0   3800.6  141.26  211.89      0
  2147483648     536870912     float     sum      -1    14542  147.67  221.51      0    14527  147.83  221.75      0
  8589934592    2147483648     float     sum      -1    56759  151.34  227.01      0    56851  151.10  226.64      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 154.882 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119324 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119325 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33477 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33478 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    127.3    4.12    6.18      0    127.0    4.13    6.19      0
     2097152        524288     float     sum      -1    581.7    3.61    5.41      0    585.1    3.58    5.38      0
     8388608       2097152     float     sum      -1    375.2   22.36   33.54      0    380.1   22.07   33.11      0
    33554432       8388608     float     sum      -1   1368.0   24.53   36.79      0   1407.4   23.84   35.76      0
   134217728      33554432     float     sum      -1   5208.5   25.77   38.65      0   5153.3   26.05   39.07      0
   536870912     134217728     float     sum      -1    20337   26.40   39.60      0    20499   26.19   39.29      0
  2147483648     536870912     float     sum      -1    81013   26.51   39.76      0    80847   26.56   39.84      0
  8589934592    2147483648     float     sum      -1   321696   26.70   40.05      0   321560   26.71   40.07      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.9176 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119685 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119686 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119687 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119688 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33833 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33834 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33835 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33836 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    325.2    1.61    2.82      0    324.3    1.62    2.83      0
     2097152        524288     float     sum      -1    553.9    3.79    6.63      0    563.1    3.72    6.52      0
     8388608       2097152     float     sum      -1    515.4   16.28   28.48      0    513.7   16.33   28.58      0
    33554432       8388608     float     sum      -1    856.0   39.20   68.60      0    841.5   39.88   69.78      0
   134217728      33554432     float     sum      -1   3221.2   41.67   72.92      0   3224.2   41.63   72.85      0
   536870912     134217728     float     sum      -1    12202   44.00   77.00      0    12184   44.06   77.11      0
  2147483648     536870912     float     sum      -1    47830   44.90   78.57      0    47803   44.92   78.62      0
  8589934592    2147483648     float     sum      -1   189019   45.44   79.53      0   189400   45.35   79.37      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 51.8871 
#

# --> END execution (nccl-tests/build/all_reduce_perf)

##### RUNNING alltoall_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119746 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119747 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    24.60   21.31   10.66      0    23.81   22.02   11.01    N/A
     2097152        262144     float    none      -1    43.11   48.65   24.33      0    42.72   49.09   24.54    N/A
     8388608       1048576     float    none      -1    102.2   82.07   41.03      0    100.2   83.68   41.84    N/A
    33554432       4194304     float    none      -1    330.6  101.51   50.76      0    327.5  102.47   51.24    N/A
   134217728      16777216     float    none      -1   1210.5  110.88   55.44      0   1207.3  111.17   55.59    N/A
   536870912      67108864     float    none      -1   4180.6  128.42   64.21      0   3591.8  149.47   74.74    N/A
  2147483648     268435456     float    none      -1    16453  130.52   65.26      0    14141  151.86   75.93    N/A
  8589934592    1073741824     float    none      -1    65400  131.34   65.67      0    56778  151.29   75.64    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 49.2425 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119787 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33884 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    57.90    9.05    4.53      0    54.37    9.64    4.82    N/A
     2097152        262144     float    none      -1    109.1   19.23    9.62      0    94.37   22.22   11.11    N/A
     8388608       1048576     float    none      -1    309.5   27.11   13.55      0    271.0   30.96   15.48    N/A
    33554432       4194304     float    none      -1   1110.1   30.23   15.11      0    971.4   34.54   17.27    N/A
   134217728      16777216     float    none      -1   4853.4   27.65   13.83      0   4183.7   32.08   16.04    N/A
   536870912      67108864     float    none      -1    19656   27.31   13.66      0    16747   32.06   16.03    N/A
  2147483648     268435456     float    none      -1    77648   27.66   13.83      0    66890   32.10   16.05    N/A
  8589934592    1073741824     float    none      -1   307340   27.95   13.97      0   266708   32.21   16.10    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 13.1876 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119822 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119823 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119824 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119825 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    20.75   25.27   18.95      0    21.31   24.61   18.45    N/A
     2097152        131072     float    none      -1    37.37   56.11   42.08      0    36.59   57.31   42.98    N/A
     8388608        524288     float    none      -1    67.65  123.99   93.00      0    59.57  140.82  105.62    N/A
    33554432       2097152     float    none      -1    178.6  187.84  140.88      0    169.6  197.81  148.35    N/A
   134217728       8388608     float    none      -1    587.8  228.33  171.25      0    568.8  235.96  176.97    N/A
   536870912      33554432     float    none      -1   2059.2  260.72  195.54      0   1970.7  272.42  204.32    N/A
  2147483648     134217728     float    none      -1   8149.9  263.50  197.62      0   7668.4  280.04  210.03    N/A
  8589934592     536870912     float    none      -1    32633  263.23  197.42      0    30605  280.67  210.51    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 135.873 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119877 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119878 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33916 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33917 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    61.89    8.47    6.35      0    62.33    8.41    6.31    N/A
     2097152        131072     float    none      -1    101.9   20.58   15.44      0    103.2   20.32   15.24    N/A
     8388608        524288     float    none      -1    303.3   27.66   20.75      0    360.4   23.28   17.46    N/A
    33554432       2097152     float    none      -1   1159.5   28.94   21.70      0   1071.4   31.32   23.49    N/A
   134217728       8388608     float    none      -1   4638.3   28.94   21.70      0   4416.5   30.39   22.79    N/A
   536870912      33554432     float    none      -1    18245   29.43   22.07      0    18272   29.38   22.04    N/A
  2147483648     134217728     float    none      -1    73949   29.04   21.78      0    73748   29.12   21.84    N/A
  8589934592     536870912     float    none      -1   294923   29.13   21.84      0   296173   29.00   21.75    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 18.9097 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120212 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120213 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120214 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120215 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  34237 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  34238 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  34239 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  34240 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         16384     float    none      -1    67.18    7.80    6.83      0    67.10    7.81    6.84    N/A
     2097152         65536     float    none      -1    112.4   18.65   16.32      0    119.9   17.48   15.30    N/A
     8388608        262144     float    none      -1    313.7   26.74   23.40      0    315.0   26.63   23.30    N/A
    33554432       1048576     float    none      -1   1170.0   28.68   25.09      0   1146.3   29.27   25.61    N/A
   134217728       4194304     float    none      -1   4787.7   28.03   24.53      0   4766.6   28.16   24.64    N/A
   536870912      16777216     float    none      -1    19516   27.51   24.07      0    19309   27.80   24.33    N/A
  2147483648      67108864     float    none      -1    78499   27.36   23.94      0    78549   27.34   23.92    N/A
  8589934592     268435456     float    none      -1   313563   27.39   23.97      0   312835   27.46   24.03    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 21.0073 
#

# --> END execution (nccl-tests/build/alltoall_perf)

##### RUNNING sendrecv_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120274 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120275 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    31.15   16.83   16.83      0    27.97   18.74   18.74    N/A
     2097152        524288     float     sum      -1    64.45   32.54   32.54      0    66.30   31.63   31.63    N/A
     8388608       2097152     float     sum      -1    178.9   46.89   46.89      0    178.3   47.04   47.04    N/A
    33554432       8388608     float     sum      -1    634.3   52.90   52.90      0    616.8   54.40   54.40    N/A
   134217728      33554432     float     sum      -1   1819.0   73.78   73.78      0   1832.1   73.26   73.26    N/A
   536870912     134217728     float     sum      -1   7112.0   75.49   75.49      0   7133.1   75.26   75.26    N/A
  2147483648     536870912     float     sum      -1    28318   75.83   75.83      0    28428   75.54   75.54    N/A
  8589934592    2147483648     float     sum      -1   113443   75.72   75.72      0   114010   75.34   75.34    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 56.3249 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120315 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  34288 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    70.00    7.49    7.49      0    70.79    7.41    7.41    N/A
     2097152        524288     float     sum      -1    154.6   13.57   13.57      0    155.2   13.51   13.51    N/A
     8388608       2097152     float     sum      -1    481.3   17.43   17.43      0    512.1   16.38   16.38    N/A
    33554432       8388608     float     sum      -1   2070.6   16.21   16.21      0   2051.0   16.36   16.36    N/A
   134217728      33554432     float     sum      -1   8672.6   15.48   15.48      0   8568.4   15.66   15.66    N/A
   536870912     134217728     float     sum      -1    33960   15.81   15.81      0    34275   15.66   15.66    N/A
  2147483648     536870912     float     sum      -1   135842   15.81   15.81      0   135626   15.83   15.83    N/A
  8589934592    2147483648     float     sum      -1   538083   15.96   15.96      0   540866   15.88   15.88    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.6531 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120384 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120385 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120386 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120387 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    29.20   17.96   17.96      0    27.94   18.77   18.77    N/A
     2097152        524288     float     sum      -1    56.62   37.04   37.04      0    59.50   35.25   35.25    N/A
     8388608       2097152     float     sum      -1    145.1   57.80   57.80      0    146.1   57.41   57.41    N/A
    33554432       8388608     float     sum      -1    477.5   70.26   70.26      0    477.2   70.32   70.32    N/A
   134217728      33554432     float     sum      -1   1680.9   79.85   79.85      0   1684.0   79.70   79.70    N/A
   536870912     134217728     float     sum      -1   6495.4   82.65   82.65      0   6541.8   82.07   82.07    N/A
  2147483648     536870912     float     sum      -1    25837   83.12   83.12      0    25964   82.71   82.71    N/A
  8589934592    2147483648     float     sum      -1   104018   82.58   82.58      0   104120   82.50   82.50    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 63.7496 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120439 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120440 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  34351 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  34352 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    75.79    6.92    6.92      0    76.40    6.86    6.86    N/A
     2097152        524288     float     sum      -1    108.3   19.36   19.36      0    111.3   18.84   18.84    N/A
     8388608       2097152     float     sum      -1    458.1   18.31   18.31      0    467.8   17.93   17.93    N/A
    33554432       8388608     float     sum      -1   1831.2   18.32   18.32      0   1826.9   18.37   18.37    N/A
   134217728      33554432     float     sum      -1   7301.9   18.38   18.38      0   7295.2   18.40   18.40    N/A
   536870912     134217728     float     sum      -1    29134   18.43   18.43      0    29195   18.39   18.39    N/A
  2147483648     536870912     float     sum      -1   116847   18.38   18.38      0   116900   18.37   18.37    N/A
  8589934592    2147483648     float     sum      -1   465989   18.43   18.43      0   467635   18.37   18.37    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.0042 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120489 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120490 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120491 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120492 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  34391 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  34392 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  34393 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  34394 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    71.55    7.33    7.33      0    71.55    7.33    7.33    N/A
     2097152        524288     float     sum      -1    107.9   19.43   19.43      0    107.7   19.47   19.47    N/A
     8388608       2097152     float     sum      -1    499.6   16.79   16.79      0    487.9   17.19   17.19    N/A
    33554432       8388608     float     sum      -1   1923.8   17.44   17.44      0   1939.6   17.30   17.30    N/A
   134217728      33554432     float     sum      -1   7715.4   17.40   17.40      0   7729.7   17.36   17.36    N/A
   536870912     134217728     float     sum      -1    30753   17.46   17.46      0    30875   17.39   17.39    N/A
  2147483648     536870912     float     sum      -1   123493   17.39   17.39      0   123326   17.41   17.41    N/A
  8589934592    2147483648     float     sum      -1   491763   17.47   17.47      0   494281   17.38   17.38    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.3461 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
pzharrington commented 2 months ago

Hello! These numbers look broadly reasonable and actually a tiny bit better than what we observe on Perlmutter (e.g. 2-node, 8 GPU allreduce at large message sizes saturates around 76 GB/s of busbw for us).

It's interesting that it is working for you using the 1.7.4-aws branch of the NCCL plugin -- if you check the release notes for the releases tagged with -aws, they say:

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future.

As such we've avoided using them and stuck with 1.6.0, which is the most recent non -aws tag they've released. I guess the -aws branches are more broadly compatible than that message might indicate, but also maybe any potential problems aren't getting rooted out by simple nccl-tests. Have you tried any more complex (e.g. deep learning training) workloads on your system with this stack?

benkirk commented 2 months ago

I can add that with the 1.6.0 NCCL plugin we've recently had some success with pytorch. We've also compiled it from source with some very minimal patches so that we can use NCCL+AWS-OFI but also use the MPI backend and run with cray-mpich. That all holds together, with MPI and NCCL performance pretty comparable. We did not try that with any of the newer -aws tags, though.

sparticlesteve commented 2 months ago

We've also compiled it from source with some very minimal patches so that we can use NCCL+AWS-OFI but also use the MPI backend and run with cray-mpich

Can you share more detail about what needed to be patched in your case?

benkirk commented 2 months ago

Sure can - took a while to get all the build dependencies satisfied on our system, but once we did I was dismayed to see this runtime error:

[rank0]:   File "/glade/work/benkirk/conda-envs/pytorch-mpi-derecho-gcc-12.2.0-cray-mpich-8.1.27/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA tensor detected and the MPI used doesn't have CUDA-aware MPI support

git grep and following that message let me to one or two files that need patching - depending on pytorch version:

caffe2/mpi/mpi_ops_gpu.cc b/caffe2/mpi/mpi_ops_gpu.cc. # <-- gone in pytorch-v2.4.0
torch/csrc/distributed/c10d/ProcessGroupMPI.cpp

The root cause is all the Cuda-awareness is wrapped in some #if defined(OPEN_MPI) && OPEN_MPI guards. And the fallback is to assume whatever MPI is being used is not Cuda-aware.

So on Derecho with cray-mpich I simply hard-coded the fallbacks to assume the opposite, e.g. Cuda-Awareness.

https://github.com/benkirk/derecho-pytorch-mpi/tree/main/patches

This is all a work-in-progress but we've had success running it. Ultimately I'd like to turn those patches into a PR for pytorch so you can define the fallback assumptions with a configure argument or something, for non-OpenMPI builds

sparticlesteve commented 2 months ago

Ah, right, thanks! I build pytorch with cray-mpich support but never bothered with trying to enable cuda-awareness because of those issues, and figured NCCL would be the more performant option anyway. It's cool to know that it can be made to work, and it's interesting to hear you saw comparable performance. If you have any performance results to share I'd be interested to see them. If you manage to upstream the "fix" let us know. That'd be a nice pytorch contribution. Thanks again for sharing.