Request to compare NCCL Performance with NSF NCAR Derecho

Hello NERSC Team,
HPCD CISL staff at NSF NCAR would like to request a performance comparison of your use of this nccl-ofi-plugin to ours as copied below. We used a test suite that expands a bit on your test runs found in this forked Github by @benkirk but please let us know if this is something you can do and provide a turn around on in a couple weeks.
Primarily, we would like to compare settings placed in the https://github.com/benkirk/nccl-ofi-plugin/blob/main/env_nccl_derecho.sh in order to determine what should be optimal.
To note, the below test used NCCL 2.22.3-1 and AWS NCCL Plugin 1.7.4. I am comfortable using the NCCL latest version but suspect it would be difficult to adapt AWS plugin usage beyond 1.7.4 given their specific targeting of AWS machines after that version.
Avg Bus Bandwidth (GB/s) per test suite ran: all-gather Intra-node (2GPUs) - 45.3614 Intra-node (4GPUs) - 127.491 Inter-node (2GPUs) - 14.0373 Inter-node (4GPUs) - 29.4738 Inter-node (8GPUs) - 50.7769
all-reduce Intra-node (2GPUs) - 57.1399 Intra-node (4GPUs) - 154.882 Inter-node (2GPUs) - 16.6409 Inter-node (4GPUs) - 29.9176 Inter-node (8GPUs) - 51.8871
all-to-all Intra-node (2GPUs) - 49.2425 Intra-node (4GPUs) - 135.873 Inter-node (2GPUs) - 13.1876 Inter-node (4GPUs) - 18.9097 Inter-node (8GPUs) - 21.0073
send-recv Intra-node (2GPUs) - 56.3249 Intra-node (4GPUs) - 63.7496 Inter-node (2GPUs) - 14.6531 Inter-node (4GPUs) - 17.0042 Inter-node (8GPUs) - 16.3461
Thanks!
Tue Aug 13 13:24:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   25C    P0              49W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   24C    P0              49W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:82:00.0 Off |                    0 |
| N/A   25C    P0              51W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   24C    P0              51W / 400W |      4MiB / 40960MiB |      0%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Tue Aug 13 13:24:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   25C    P0              52W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   25C    P0              52W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:82:00.0 Off |                    0 |
| N/A   26C    P0              51W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   25C    P0              54W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

    linux-vdso.so.1 (0x00007ffc5c9a5000)
    librt.so.1 => /lib64/librt.so.1 (0x0000150618c71000)
    libmpi_nvidia.so.12 => /opt/cray/pe/mpich/8.1.27/ofi/nvidia/20.7/lib/libmpi_nvidia.so.12 (0x0000150615efa000)
    libnccl.so.2 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/lib/libnccl.so.2 (0x0000150613015000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000150612df2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000150612bee000)
    libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x0000150610f91000)
    libcudart.so.12 => /glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/lib64/libcudart.so.12 (0x0000150610ce9000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00001506108c4000)
    libpmi.so.0 => /opt/cray/pe/pmi/6.1.12/lib/libpmi.so.0 (0x00001506106c2000)
    libpmi2.so.0 => /opt/cray/pe/pmi/6.1.12/lib/libpmi2.so.0 (0x00001506104a0000)
    libpals.so.0 => /opt/cray/pe/pals/1.2.12/lib/libpals.so.0 (0x0000150610298000)
    libmpi_gtl_cuda.so.0 => /opt/cray/pe/mpich/8.1.27/gtl/lib/libmpi_gtl_cuda.so.0 (0x0000150610052000)
    libacchost.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libacchost.so (0x000015060fded000)
    libaccdevaux.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libaccdevaux.so (0x000015060fbd1000)
    libaccdevice.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libaccdevice.so (0x000015060f8a5000)
    libcudadevice.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libcudadevice.so (0x000015060f68e000)
    libnvf.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvf.so (0x000015060ef6b000)
    libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000015060ed62000)
    libnvhpcatm.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvhpcatm.so (0x000015060eb57000)
    libnvomp.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvomp.so (0x000015060db56000)
    libnvcpumath.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvcpumath.so (0x000015060d712000)
    libnvc.so => /glade/u/apps/common/23.08/spack/opt/spack/nvhpc/24.3/Linux_x86_64/24.3/compilers/lib/libnvc.so (0x000015060d4ae000)
    libc.so.6 => /lib64/libc.so.6 (0x000015060d0b9000)
    /lib64/ld-linux-x86-64.so.2 (0x0000150618e7a000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000015060cea0000)
    libm.so.6 => /lib64/libm.so.6 (0x000015060cb55000)
    libfabric.so.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libfabric.so.1 (0x000015060c856000)
    libjansson.so.4 => /usr/lib64/libjansson.so.4 (0x0000150619068000)
    libcxi.so.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libcxi.so.1 (0x000015060c631000)
    libcurl.so.4 => /glade/u/apps/derecho/23.09/opt/view/lib/libcurl.so.4 (0x000015060c388000)
    libjson-c.so.3 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libjson-c.so.3 (0x000015060c178000)
    libnl-3.so.200 => /usr/lib64/libnl-3.so.200 (0x000015060bf56000)
    libnghttp2.so.14 => /glade/u/apps/derecho/23.09/spack/opt/spack/nghttp2/1.48.0/gcc/7.5.0/k6lz/lib/libnghttp2.so.14 (0x000015060bd2a000)
    libidn2.so.0 => /glade/u/apps/derecho/23.09/spack/opt/spack/libidn2/2.3.4/gcc/7.5.0/h3px/lib/libidn2.so.0 (0x000015060bafa000)
    libssh2.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/libssh2/1.10.0/gcc/7.5.0/qgbb/lib/libssh2.so.1 (0x000015060b8bc000)
    libmbedtls.so.14 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedtls.so.14 (0x000015060b68b000)
    libmbedx509.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedx509.so.1 (0x000015060b46a000)
    libmbedcrypto.so.7 => /glade/u/apps/derecho/23.09/spack/opt/spack/mbedtls/2.28.2/gcc/7.5.0/xeqd/lib/libmbedcrypto.so.7 (0x000015060b1e7000)
    libssl.so.1.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libssl.so.1.1 (0x000015060af49000)
    libcrypto.so.1.1 => /glade/u/home/dhoward/work/nccl-plugin_derecho/install/plugin/deps/lib/libcrypto.so.1.1 (0x000015060aa0a000)
    libz.so.1 => /glade/u/apps/derecho/23.09/spack/opt/spack/zlib/1.2.13/gcc/7.5.0/g42i/lib/libz.so.1 (0x000015060a7f2000)
    libiconv.so.2 => /glade/u/apps/derecho/23.09/spack/opt/spack/libiconv/1.17/gcc/7.5.0/k4pm/lib/libiconv.so.2 (0x000015060a4e8000)
    libunistring.so.5 => /glade/u/apps/derecho/23.09/spack/opt/spack/libunistring/1.1/gcc/7.5.0/222j/lib/libunistring.so.5 (0x000015060a13c000)
    libjitterentropy.so.3 => /usr/lib64/libjitterentropy.so.3 (0x0000150609f35000)

========== RUNNING NCCL TESTS ==========
##### RUNNING all_gather_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 118922 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 118923 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    20.92   25.06   12.53      0    19.15   27.38   13.69      0
     2097152        262144     float    none      -1    48.89   42.89   21.45      0    44.49   47.13   23.57      0
     8388608       1048576     float    none      -1    112.3   74.72   37.36      0    97.67   85.88   42.94      0
    33554432       4194304     float    none      -1    371.4   90.33   45.17      0    316.7  105.95   52.98      0
   134217728      16777216     float    none      -1   1344.5   99.83   49.92      0   1139.1  117.83   58.91      0
   536870912      67108864     float    none      -1   4839.1  110.94   55.47      0   4160.2  129.05   64.52      0
  2147483648     268435456     float    none      -1    18651  115.14   57.57      0    16392  131.01   65.50      0
  8589934592    1073741824     float    none      -1    73438  116.97   58.48      0    65355  131.44   65.72      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 45.3614 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 118963 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33304 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    97.30    5.39    2.69      0    83.90    6.25    3.12      0
     2097152        262144     float    none      -1    117.1   17.91    8.96      0    112.5   18.64    9.32      0
     8388608       1048576     float    none      -1    330.3   25.40   12.70      0    294.4   28.49   14.24      0
    33554432       4194304     float    none      -1   1114.7   30.10   15.05      0    979.9   34.24   17.12      0
   134217728      16777216     float    none      -1   4347.1   30.88   15.44      0   3499.2   38.36   19.18      0
   536870912      67108864     float    none      -1    17119   31.36   15.68      0    13502   39.76   19.88      0
  2147483648     268435456     float    none      -1    68261   31.46   15.73      0    54079   39.71   19.86      0
  8589934592    1073741824     float    none      -1   271952   31.59   15.79      0   216612   39.66   19.83      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.0373 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119030 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119031 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119032 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119033 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    22.02   23.81   17.85      0    21.01   24.96   18.72      0
     2097152        131072     float    none      -1    60.22   34.83   26.12      0    58.17   36.05   27.04      0
     8388608        524288     float    none      -1    90.60   92.59   69.44      0    85.16   98.50   73.88      0
    33554432       2097152     float    none      -1    203.4  164.98  123.73      0    184.9  181.48  136.11      0
   134217728       8388608     float    none      -1    594.5  225.76  169.32      0    562.2  238.73  179.04      0
   536870912      33554432     float    none      -1   2181.6  246.09  184.57      0   2056.6  261.05  195.78      0
  2147483648     134217728     float    none      -1   8233.5  260.82  195.62      0   7734.4  277.65  208.24      0
  8589934592     536870912     float    none      -1    32217  266.63  199.97      0    30046  285.89  214.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 127.491 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119083 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119084 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33361 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33362 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    73.69    7.12    5.34      0    73.92    7.09    5.32      0
     2097152        131072     float    none      -1    100.7   20.83   15.62      0    99.37   21.10   15.83      0
     8388608        524288     float    none      -1    235.1   35.69   26.76      0    227.7   36.84   27.63      0
    33554432       2097152     float    none      -1    766.8   43.76   32.82      0    730.6   45.93   34.44      0
   134217728       8388608     float    none      -1   2834.6   47.35   35.51      0   2687.0   49.95   37.46      0
   536870912      33554432     float    none      -1    10535   50.96   38.22      0    10475   51.25   38.44      0
  2147483648     134217728     float    none      -1    41051   52.31   39.23      0    40678   52.79   39.59      0
  8589934592     536870912     float    none      -1   162909   52.73   39.55      0   161850   53.07   39.81      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.4738 
#

# --> END execution (nccl-tests/build/all_gather_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/all_gather_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_gather_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119126 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119127 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119128 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119129 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33394 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33395 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33396 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33397 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         16384     float    none      -1    77.06    6.80    5.95      0    74.85    7.00    6.13      0
     2097152         65536     float    none      -1    168.2   12.47   10.91      0    168.3   12.46   10.90      0
     8388608        262144     float    none      -1    249.1   33.67   29.47      0    244.4   34.33   30.04      0
    33554432       1048576     float    none      -1    470.4   71.33   62.41      0    457.6   73.32   64.16      0
   134217728       4194304     float    none      -1   1732.3   77.48   67.80      0   1729.3   77.61   67.91      0
   536870912      16777216     float    none      -1   6373.5   84.24   73.71      0   6512.3   82.44   72.13      0
  2147483648      67108864     float    none      -1    24386   88.06   77.05      0    24260   88.52   77.46      0
  8589934592     268435456     float    none      -1    96376   89.13   77.99      0    95850   89.62   78.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 50.7769 
#

# --> END execution (nccl-tests/build/all_gather_perf)

##### RUNNING all_reduce_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119186 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119187 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    26.37   19.89   19.89      0    24.82   21.13   21.13      0
     2097152        524288     float     sum      -1    61.80   33.94   33.94      0    61.62   34.04   34.04      0
     8388608       2097152     float     sum      -1    156.8   53.51   53.51      0    154.4   54.33   54.33      0
    33554432       8388608     float     sum      -1    536.4   62.55   62.55      0    534.5   62.78   62.78      0
   134217728      33554432     float     sum      -1   2010.2   66.77   66.77      0   2004.5   66.96   66.96      0
   536870912     134217728     float     sum      -1   7543.8   71.17   71.17      0   7507.6   71.51   71.51      0
  2147483648     536870912     float     sum      -1    29279   73.35   73.35      0    29127   73.73   73.73      0
  8589934592    2147483648     float     sum      -1   115969   74.07   74.07      0   115259   74.53   74.53      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 57.1399 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119232 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33450 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    151.9    3.45    3.45      0    142.7    3.68    3.68      0
     2097152        524288     float     sum      -1    160.6   13.05   13.05      0    170.0   12.34   12.34      0
     8388608       2097152     float     sum      -1    491.1   17.08   17.08      0    490.5   17.10   17.10      0
    33554432       8388608     float     sum      -1   1765.0   19.01   19.01      0   1754.8   19.12   19.12      0
   134217728      33554432     float     sum      -1   6851.3   19.59   19.59      0   6793.8   19.76   19.76      0
   536870912     134217728     float     sum      -1    26284   20.43   20.43      0    26916   19.95   19.95      0
  2147483648     536870912     float     sum      -1   104908   20.47   20.47      0   105654   20.33   20.33      0
  8589934592    2147483648     float     sum      -1   420075   20.45   20.45      0   419970   20.45   20.45      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.6409 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119269 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119270 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119271 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119272 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    22.35   23.46   35.19      0    20.69   25.34   38.01      0
     2097152        524288     float     sum      -1    50.06   41.89   62.84      0    49.76   42.14   63.21      0
     8388608       2097152     float     sum      -1    112.4   74.60  111.90      0    110.0   76.28  114.41      0
    33554432       8388608     float     sum      -1    298.3  112.50  168.75      0    296.5  113.16  169.74      0
   134217728      33554432     float     sum      -1   1024.9  130.96  196.44      0   1022.7  131.24  196.86      0
   536870912     134217728     float     sum      -1   3799.5  141.30  211.95      0   3800.6  141.26  211.89      0
  2147483648     536870912     float     sum      -1    14542  147.67  221.51      0    14527  147.83  221.75      0
  8589934592    2147483648     float     sum      -1    56759  151.34  227.01      0    56851  151.10  226.64      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 154.882 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119324 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119325 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33477 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33478 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    127.3    4.12    6.18      0    127.0    4.13    6.19      0
     2097152        524288     float     sum      -1    581.7    3.61    5.41      0    585.1    3.58    5.38      0
     8388608       2097152     float     sum      -1    375.2   22.36   33.54      0    380.1   22.07   33.11      0
    33554432       8388608     float     sum      -1   1368.0   24.53   36.79      0   1407.4   23.84   35.76      0
   134217728      33554432     float     sum      -1   5208.5   25.77   38.65      0   5153.3   26.05   39.07      0
   536870912     134217728     float     sum      -1    20337   26.40   39.60      0    20499   26.19   39.29      0
  2147483648     536870912     float     sum      -1    81013   26.51   39.76      0    80847   26.56   39.84      0
  8589934592    2147483648     float     sum      -1   321696   26.70   40.05      0   321560   26.71   40.07      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.9176 
#

# --> END execution (nccl-tests/build/all_reduce_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/all_reduce_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/all_reduce_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119685 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119686 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119687 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119688 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33833 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33834 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33835 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33836 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    325.2    1.61    2.82      0    324.3    1.62    2.83      0
     2097152        524288     float     sum      -1    553.9    3.79    6.63      0    563.1    3.72    6.52      0
     8388608       2097152     float     sum      -1    515.4   16.28   28.48      0    513.7   16.33   28.58      0
    33554432       8388608     float     sum      -1    856.0   39.20   68.60      0    841.5   39.88   69.78      0
   134217728      33554432     float     sum      -1   3221.2   41.67   72.92      0   3224.2   41.63   72.85      0
   536870912     134217728     float     sum      -1    12202   44.00   77.00      0    12184   44.06   77.11      0
  2147483648     536870912     float     sum      -1    47830   44.90   78.57      0    47803   44.92   78.62      0
  8589934592    2147483648     float     sum      -1   189019   45.44   79.53      0   189400   45.35   79.37      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 51.8871 
#

# --> END execution (nccl-tests/build/all_reduce_perf)

##### RUNNING alltoall_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119746 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119747 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    24.60   21.31   10.66      0    23.81   22.02   11.01    N/A
     2097152        262144     float    none      -1    43.11   48.65   24.33      0    42.72   49.09   24.54    N/A
     8388608       1048576     float    none      -1    102.2   82.07   41.03      0    100.2   83.68   41.84    N/A
    33554432       4194304     float    none      -1    330.6  101.51   50.76      0    327.5  102.47   51.24    N/A
   134217728      16777216     float    none      -1   1210.5  110.88   55.44      0   1207.3  111.17   55.59    N/A
   536870912      67108864     float    none      -1   4180.6  128.42   64.21      0   3591.8  149.47   74.74    N/A
  2147483648     268435456     float    none      -1    16453  130.52   65.26      0    14141  151.86   75.93    N/A
  8589934592    1073741824     float    none      -1    65400  131.34   65.67      0    56778  151.29   75.64    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 49.2425 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119787 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33884 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         65536     float    none      -1    57.90    9.05    4.53      0    54.37    9.64    4.82    N/A
     2097152        262144     float    none      -1    109.1   19.23    9.62      0    94.37   22.22   11.11    N/A
     8388608       1048576     float    none      -1    309.5   27.11   13.55      0    271.0   30.96   15.48    N/A
    33554432       4194304     float    none      -1   1110.1   30.23   15.11      0    971.4   34.54   17.27    N/A
   134217728      16777216     float    none      -1   4853.4   27.65   13.83      0   4183.7   32.08   16.04    N/A
   536870912      67108864     float    none      -1    19656   27.31   13.66      0    16747   32.06   16.03    N/A
  2147483648     268435456     float    none      -1    77648   27.66   13.83      0    66890   32.10   16.05    N/A
  8589934592    1073741824     float    none      -1   307340   27.95   13.97      0   266708   32.21   16.10    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 13.1876 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119822 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119823 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 119824 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 119825 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    20.75   25.27   18.95      0    21.31   24.61   18.45    N/A
     2097152        131072     float    none      -1    37.37   56.11   42.08      0    36.59   57.31   42.98    N/A
     8388608        524288     float    none      -1    67.65  123.99   93.00      0    59.57  140.82  105.62    N/A
    33554432       2097152     float    none      -1    178.6  187.84  140.88      0    169.6  197.81  148.35    N/A
   134217728       8388608     float    none      -1    587.8  228.33  171.25      0    568.8  235.96  176.97    N/A
   536870912      33554432     float    none      -1   2059.2  260.72  195.54      0   1970.7  272.42  204.32    N/A
  2147483648     134217728     float    none      -1   8149.9  263.50  197.62      0   7668.4  280.04  210.03    N/A
  8589934592     536870912     float    none      -1    32633  263.23  197.42      0    30605  280.67  210.51    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 135.873 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 119877 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 119878 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33916 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33917 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         32768     float    none      -1    61.89    8.47    6.35      0    62.33    8.41    6.31    N/A
     2097152        131072     float    none      -1    101.9   20.58   15.44      0    103.2   20.32   15.24    N/A
     8388608        524288     float    none      -1    303.3   27.66   20.75      0    360.4   23.28   17.46    N/A
    33554432       2097152     float    none      -1   1159.5   28.94   21.70      0   1071.4   31.32   23.49    N/A
   134217728       8388608     float    none      -1   4638.3   28.94   21.70      0   4416.5   30.39   22.79    N/A
   536870912      33554432     float    none      -1    18245   29.43   22.07      0    18272   29.38   22.04    N/A
  2147483648     134217728     float    none      -1    73949   29.04   21.78      0    73748   29.12   21.84    N/A
  8589934592     536870912     float    none      -1   294923   29.13   21.84      0   296173   29.00   21.75    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 18.9097 
#

# --> END execution (nccl-tests/build/alltoall_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/alltoall_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/alltoall_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120212 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120213 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120214 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120215 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  34237 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  34238 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  34239 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  34240 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288         16384     float    none      -1    67.18    7.80    6.83      0    67.10    7.81    6.84    N/A
     2097152         65536     float    none      -1    112.4   18.65   16.32      0    119.9   17.48   15.30    N/A
     8388608        262144     float    none      -1    313.7   26.74   23.40      0    315.0   26.63   23.30    N/A
    33554432       1048576     float    none      -1   1170.0   28.68   25.09      0   1146.3   29.27   25.61    N/A
   134217728       4194304     float    none      -1   4787.7   28.03   24.53      0   4766.6   28.16   24.64    N/A
   536870912      16777216     float    none      -1    19516   27.51   24.07      0    19309   27.80   24.33    N/A
  2147483648      67108864     float    none      -1    78499   27.36   23.94      0    78549   27.34   23.92    N/A
  8589934592     268435456     float    none      -1   313563   27.39   23.97      0   312835   27.46   24.03    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 21.0073 
#

# --> END execution (nccl-tests/build/alltoall_perf)

##### RUNNING sendrecv_perf PERFORMANCE TEST #####
# --> Intra-node (2GPUs)
# --> mpiexec -n 2 -ppn 2 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120274 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120275 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    31.15   16.83   16.83      0    27.97   18.74   18.74    N/A
     2097152        524288     float     sum      -1    64.45   32.54   32.54      0    66.30   31.63   31.63    N/A
     8388608       2097152     float     sum      -1    178.9   46.89   46.89      0    178.3   47.04   47.04    N/A
    33554432       8388608     float     sum      -1    634.3   52.90   52.90      0    616.8   54.40   54.40    N/A
   134217728      33554432     float     sum      -1   1819.0   73.78   73.78      0   1832.1   73.26   73.26    N/A
   536870912     134217728     float     sum      -1   7112.0   75.49   75.49      0   7133.1   75.26   75.26    N/A
  2147483648     536870912     float     sum      -1    28318   75.83   75.83      0    28428   75.54   75.54    N/A
  8589934592    2147483648     float     sum      -1   113443   75.72   75.72      0   114010   75.34   75.34    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 56.3249 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (2GPUs)
# --> mpiexec -n 2 -ppn 1 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120315 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  34288 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    70.00    7.49    7.49      0    70.79    7.41    7.41    N/A
     2097152        524288     float     sum      -1    154.6   13.57   13.57      0    155.2   13.51   13.51    N/A
     8388608       2097152     float     sum      -1    481.3   17.43   17.43      0    512.1   16.38   16.38    N/A
    33554432       8388608     float     sum      -1   2070.6   16.21   16.21      0   2051.0   16.36   16.36    N/A
   134217728      33554432     float     sum      -1   8672.6   15.48   15.48      0   8568.4   15.66   15.66    N/A
   536870912     134217728     float     sum      -1    33960   15.81   15.81      0    34275   15.66   15.66    N/A
  2147483648     536870912     float     sum      -1   135842   15.81   15.81      0   135626   15.83   15.83    N/A
  8589934592    2147483648     float     sum      -1   538083   15.96   15.96      0   540866   15.88   15.88    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.6531 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Intra-node (4GPUs)
# --> mpiexec -n 4 -ppn 4 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120384 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120385 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120386 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120387 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    29.20   17.96   17.96      0    27.94   18.77   18.77    N/A
     2097152        524288     float     sum      -1    56.62   37.04   37.04      0    59.50   35.25   35.25    N/A
     8388608       2097152     float     sum      -1    145.1   57.80   57.80      0    146.1   57.41   57.41    N/A
    33554432       8388608     float     sum      -1    477.5   70.26   70.26      0    477.2   70.32   70.32    N/A
   134217728      33554432     float     sum      -1   1680.9   79.85   79.85      0   1684.0   79.70   79.70    N/A
   536870912     134217728     float     sum      -1   6495.4   82.65   82.65      0   6541.8   82.07   82.07    N/A
  2147483648     536870912     float     sum      -1    25837   83.12   83.12      0    25964   82.71   82.71    N/A
  8589934592    2147483648     float     sum      -1   104018   82.58   82.58      0   104120   82.50   82.50    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 63.7496 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (4GPUs)
# --> mpiexec -n 4 -ppn 2 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120439 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120440 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  34351 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  34352 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    75.79    6.92    6.92      0    76.40    6.86    6.86    N/A
     2097152        524288     float     sum      -1    108.3   19.36   19.36      0    111.3   18.84   18.84    N/A
     8388608       2097152     float     sum      -1    458.1   18.31   18.31      0    467.8   17.93   17.93    N/A
    33554432       8388608     float     sum      -1   1831.2   18.32   18.32      0   1826.9   18.37   18.37    N/A
   134217728      33554432     float     sum      -1   7301.9   18.38   18.38      0   7295.2   18.40   18.40    N/A
   536870912     134217728     float     sum      -1    29134   18.43   18.43      0    29195   18.39   18.39    N/A
  2147483648     536870912     float     sum      -1   116847   18.38   18.38      0   116900   18.37   18.37    N/A
  8589934592    2147483648     float     sum      -1   465989   18.43   18.43      0   467635   18.37   18.37    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.0042 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
# --> Inter-node (8 GPUs)
# --> mpiexec -n 8 -ppn 4 --cpu-bind numa nccl-tests/build/sendrecv_perf -b 512K -e 8G -f 4 -g 1
# --> BEGIN execution (nccl-tests/build/sendrecv_perf)
# nThread 1 nGpus 1 minBytes 524288 maxBytes 8589934592 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120489 on    deg0021 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid 120490 on    deg0021 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid 120491 on    deg0021 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid 120492 on    deg0021 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  34391 on    deg0023 device  0 [0x03] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  34392 on    deg0023 device  1 [0x41] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  34393 on    deg0023 device  2 [0x82] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  34394 on    deg0023 device  3 [0xc1] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      524288        131072     float     sum      -1    71.55    7.33    7.33      0    71.55    7.33    7.33    N/A
     2097152        524288     float     sum      -1    107.9   19.43   19.43      0    107.7   19.47   19.47    N/A
     8388608       2097152     float     sum      -1    499.6   16.79   16.79      0    487.9   17.19   17.19    N/A
    33554432       8388608     float     sum      -1   1923.8   17.44   17.44      0   1939.6   17.30   17.30    N/A
   134217728      33554432     float     sum      -1   7715.4   17.40   17.40      0   7729.7   17.36   17.36    N/A
   536870912     134217728     float     sum      -1    30753   17.46   17.46      0    30875   17.39   17.39    N/A
  2147483648     536870912     float     sum      -1   123493   17.39   17.39      0   123326   17.41   17.41    N/A
  8589934592    2147483648     float     sum      -1   491763   17.47   17.47      0   494281   17.38   17.38    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.3461 
#

# --> END execution (nccl-tests/build/sendrecv_perf)
NERSC / nccl-ofi-plugin

Request to compare NCCL Performance with NSF NCAR Derecho #13