Benchmarking OpenMPI with OSU micro-benchmarks

From #44 we want to know what Spack variants to build our main OpenMPI with. We are going to use the C MPI benchmarks from https://mvapich.cse.ohio-state.edu/benchmarks/ to compare how well they perform on our OmniPath clusters.

Our existing mpi/openmpi/4.1.1/gnu-4.9.2 should be below acceptable performance (we assume!), using only vader.

Compiling the OSU microbenchmarks on Young

# wget couldn't validate cert
wget --no-check-certificate https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
mkdir openmpi-4.1.1_vader
cd openmpi-4.1.1_vader
tar -xvf ../osu-micro-benchmarks-7.2.tar.gz

# modules for existing install
module purge
module load gcc-libs/4.9.2
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
module load gnuplot

cd osu-micro-benchmarks-7.2
./configure CC=mpicc CXX=mpicxx --prefix=/home/cceahke/Scratch/mpi_benchmarks/openmpi-4.1.1_vader/osu-micro-benchmarks-7.2_install
make
make install

Now got directories full of benchmarks:

ls ../osu-micro-benchmarks-7.2_install/libexec/osu-micro-benchmarks/mpi/
collective/ one-sided/  pt2pt/      startup/ 
ls ../osu-micro-benchmarks-7.2_install/libexec/osu-micro-benchmarks/mpi/pt2pt/
osu_bibw  osu_bw  osu_latency  osu_latency_mp  osu_latency_mt  osu_mbw_mr  osu_multi_lat  persistent/

Going to start with point-to-point then look at some collectives.

mpi/openmpi/4.1.1/gnu-4.9.2

Point to point (2 processes, one on each node)

osu-latency - Latency Test

The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are used in the tests.

# OSU MPI Latency Test v7.2
# Size          Latency (us)
# Datatype: MPI_CHAR.
1                       0.30
2                       0.30
4                       0.30
8                       0.30
16                      0.30
32                      0.47
64                      0.45
128                     0.47
256                     0.48
512                     0.49
1024                    0.55
2048                    0.72
4096                    1.03
8192                    1.82
16384                   1.89
32768                   2.56
65536                   3.91
131072                  6.57
262144                 12.10
524288                 27.61
1048576                82.84
2097152               205.97
4194304               431.72

osu_latency3D0

osu_latency3D1

osu-bw - Bandwidth Test

The bandwidth tests are carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) are used in the test.

# OSU MPI Bandwidth Test v7.2
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       5.69
2                      13.04
4                      26.34
8                      53.91
16                    110.64
32                    131.22
64                    264.14
128                   519.38
256                   985.25
512                  1879.42
1024                 3116.54
2048                 5287.21
4096                 8696.34
8192                 9963.65
16384               11427.77
32768               15292.84
65536               18926.47
131072              21485.62
262144              22887.94
524288              20205.33
1048576             13497.64
2097152             10308.30
4194304              9819.37

osu_bw3D0

osu_bw3D1

There are also 2D graphs at each size, eg osu-bw-16, which has a nice repeating pattern.

osu_bw-16

Example script for osu-latency

#!/bin/bash -l

#$ -l h_rt=0:30:0
#$ -l mem=2G
# like -pe mpi but 'wants single switch'
#$ -pe wss 80
#$ -N osu_latency_2

#$ -P Test
#$ -A Test_allocation

#$ -cwd

module unload -f compilers mpi 
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
module load gnuplot/5.0.1 

#  * Additionally, the benchmarks offer following options:
#    * "-G" option can be used to output result in graphs
#    *      "-G tty" for graph output in terminal using ASCII characters
#    *      "-G png" for graph output in png format
#    *      "-G pdf" for graph output in pdf format (needs imagemagick's convert)

# test requires two processes, one on each node
sort -u "$TMPDIR/machines" > "$TMPDIR/machines.unique"

mpirun -np 2 --hostfile "$TMPDIR/machines.unique" ~/Scratch/mpi_benchmarks/openmpi-4.1.1_vader/osu-micro-benchmarks-7.2_install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -G png

These two only took 1-2 mins to run.

osu_mbw_mr - Multiple Bandwidth / Message Rate Test

The multi-pair bandwidth and message rate test evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.

This test can use all 80 cores across the two nodes. Test requires block (sequential) not round-robin assigned ranks, our $TMPDIR/machines machinefile is fine.

This one segfaulted, need to check if I'm running it correctly.

James has reminded me of -pe wss on Young to always run within a single switch, which we should for benchmarking.

mpi/openmpi/4.1.1/gnu-4.9.2

Collective non-blocking (two nodes)

The latest OMB version includes benchmarks for various MPI blocking collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

Ran with defaults atm.

osu-bcast - MPI_Bcast Latency Test

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       3.09
2                       2.92
4                       2.89
8                       2.90
16                      3.01
32                      4.88
64                      3.85
128                     3.84
256                     4.01
512                     4.32
1024                    5.41
2048                    6.95
4096                   10.16
8192                   16.11
16384                  29.23
32768                  52.23
65536                 103.47
131072                203.86
262144                408.88
524288                813.19
1048576              1657.63

osu_bcast3D0 osu_bcast3D1

osu_allgather - MPI_Allgather Latency Test

# OSU MPI Allgather Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       8.19
2                       8.45
4                       8.70
8                       9.20
16                     10.12
32                     12.98
64                     17.83
128                    29.07
256                    52.60
512                   133.40
1024                  192.84
2048                  314.02
4096                  563.12
8192                  759.27
16384                1355.26
32768                2491.07
65536                5976.11
131072              12510.91
262144              23720.27
524288              37485.95
1048576             68757.53

osu_allgather3D0 osu_allgather3D1

osu_alltoall - MPI_Alltoall Latency Test

# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      20.02
2                      18.70
4                      18.09
8                      18.43
16                     19.81
32                     24.23
64                     32.25
128                    54.86
256                    97.10
512                   186.57
1024                  256.20
2048                  418.41
4096                  848.88
8192                 1790.19
16384                3885.44
32768                4181.89
65536                8774.70
131072              17384.51
262144              35664.92
524288              70882.81
1048576            141618.11

osu_alltoall3D0 osu_alltoall3D1

Have built a set with mpi/openmpi/3.1.6/gnu-4.9.2 which ought to be a decently-performing MPI that knows about OmniPath, submitted one.

mpi/openmpi/3.1.6/gnu-4.9.2

Point to point (2 processes, one on each node)

osu-latency - Latency Test

# OSU MPI Latency Test v7.2
# Size          Latency (us)
# Datatype: MPI_CHAR.
1                       0.37
2                       0.36
4                       0.36
8                       0.36
16                      0.36
32                      0.54
64                      0.52
128                     0.52
256                     0.56
512                     0.57
1024                    0.64
2048                    0.81
4096                    1.11
8192                    1.92
16384                   1.90
32768                   2.56
65536                   3.91
131072                  6.59
262144                 12.12
524288                 26.61
1048576                82.30
2097152               212.73
4194304               426.75

osu_latency3D0 osu_latency3D1

osu-bw - Bandwidth Test

# OSU MPI Bandwidth Test v7.2
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       5.69
2                      12.34
4                      24.80
8                      48.95
16                     97.51
32                    133.80
64                    265.44
128                   437.24
256                   860.87
512                  1664.22
1024                 2852.03
2048                 4865.91
4096                 8305.70
8192                 9540.45
16384               11185.89
32768               15396.53
65536               18787.30
131072              21357.57
262144              22704.21
524288              20321.87
1048576             13489.96
2097152             10394.33
4194304             10096.41

osu_bw3D0 osu_bw3D1

Collective non-blocking (two nodes)

osu-bcast - MPI_Bcast Latency Test

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       2.59
2                       2.45
4                       2.52
8                       2.46
16                      3.62
32                      5.14
64                      4.09
128                     4.11
256                     4.55
512                     5.14
1024                    7.12
2048                   10.28
4096                   14.55
8192                   22.36
16384                  39.09
32768                  71.01
65536                 133.01
131072                256.31
262144                510.76
524288               1516.11
1048576              3181.80

osu_bcast3D0 osu_bcast3D1

osu_allgather - MPI_Allgather Latency Test

# OSU MPI Allgather Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       7.70
2                       7.94
4                       8.04
8                       8.91
16                      9.94
32                     13.22
64                     17.88
128                    29.24
256                    52.56
512                   112.08
1024                  223.72
2048                  376.52
4096                  677.56
8192                  769.14
16384                1357.01
32768                2451.52
65536                4859.00
131072               9563.56
262144              18630.84
524288              35178.58
1048576             70382.17

osu_allgather3D0 osu_allgather3D1

osu_alltoall - MPI_Alltoall Latency Test

# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      19.00
2                      18.88
4                      18.37
8                      20.28
16                     21.71
32                     26.65
64                     34.82
128                    57.28
256                   119.58
512                   159.12
1024                  250.46
2048                  428.38
4096                  774.50
8192                 1579.99
16384                3284.83
32768                6130.75
65536                9018.75
131072              17932.21
262144              35571.86
524288              70918.13
1048576            141613.42

osu_alltoall3D0 osu_alltoall3D1

With the exception of bcast, which is rather different for 3.1.6 in the larger messages of both graphs, they'rethesamepicture.gif (±jitter).

The average latency reported on the osu_bcast graph also doesn't seem to make sense with the graph pictured, for openmpi 3.1.6, going to try including -f option.

Results from rerunning bcast for openmpi 3.1.6 with -f

# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                       2.50              1.03              4.64        1000
2                       2.43              1.00              4.55        1000
4                       2.45              0.96              4.62        1000
8                       2.42              0.97              4.47        1000
16                      3.67              1.39              6.04        1000
32                      5.39              3.14              7.80        1000
64                      4.33              2.05              6.88        1000
128                     4.35              2.12              6.76        1000
256                     4.86              2.33              7.39        1000
512                     5.37              2.33              8.28        1000
1024                    7.29              2.66             11.26        1000
2048                    9.68              1.65             13.28        1000
4096                   13.78              2.90             18.03        1000
8192                   21.14              5.23             26.25        1000
16384                  36.56             12.41             44.05         100
32768                  71.62             50.57             82.76         100
65536                 133.35             98.69            152.44         100
131072                259.03            205.29            298.62         100
262144                514.35            413.84            588.67         100
524288               1121.96            144.24           1386.37         100
1048576              2179.01            275.34           2503.54         100

Big difference in max and min latency on the last two, and the graph it draws doesn't show the max points. (Also doesn't make sense with the minimums since those go below 275.34...)

osu_bcast-1048576 osu_bcast3D1

Small sizes

osu_bcast3D0

UCL-ARC / hpc-spack