Open heatherkellyucl opened 1 year ago
osu-latency - Latency Test
The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are used in the tests.
# OSU MPI Latency Test v7.2
# Size Latency (us)
# Datatype: MPI_CHAR.
1 0.30
2 0.30
4 0.30
8 0.30
16 0.30
32 0.47
64 0.45
128 0.47
256 0.48
512 0.49
1024 0.55
2048 0.72
4096 1.03
8192 1.82
16384 1.89
32768 2.56
65536 3.91
131072 6.57
262144 12.10
524288 27.61
1048576 82.84
2097152 205.97
4194304 431.72
osu-bw - Bandwidth Test
The bandwidth tests are carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) are used in the test.
# OSU MPI Bandwidth Test v7.2
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 5.69
2 13.04
4 26.34
8 53.91
16 110.64
32 131.22
64 264.14
128 519.38
256 985.25
512 1879.42
1024 3116.54
2048 5287.21
4096 8696.34
8192 9963.65
16384 11427.77
32768 15292.84
65536 18926.47
131072 21485.62
262144 22887.94
524288 20205.33
1048576 13497.64
2097152 10308.30
4194304 9819.37
There are also 2D graphs at each size, eg osu-bw-16, which has a nice repeating pattern.
Example script for osu-latency
#!/bin/bash -l
#$ -l h_rt=0:30:0
#$ -l mem=2G
# like -pe mpi but 'wants single switch'
#$ -pe wss 80
#$ -N osu_latency_2
#$ -P Test
#$ -A Test_allocation
#$ -cwd
module unload -f compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
module load gnuplot/5.0.1
# * Additionally, the benchmarks offer following options:
# * "-G" option can be used to output result in graphs
# * "-G tty" for graph output in terminal using ASCII characters
# * "-G png" for graph output in png format
# * "-G pdf" for graph output in pdf format (needs imagemagick's convert)
# test requires two processes, one on each node
sort -u "$TMPDIR/machines" > "$TMPDIR/machines.unique"
mpirun -np 2 --hostfile "$TMPDIR/machines.unique" ~/Scratch/mpi_benchmarks/openmpi-4.1.1_vader/osu-micro-benchmarks-7.2_install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -G png
These two only took 1-2 mins to run.
osu_mbw_mr - Multiple Bandwidth / Message Rate Test
The multi-pair bandwidth and message rate test evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.
This test can use all 80 cores across the two nodes. Test requires block (sequential) not round-robin assigned ranks, our $TMPDIR/machines
machinefile is fine.
This one segfaulted, need to check if I'm running it correctly.
James has reminded me of -pe wss on Young to always run within a single switch, which we should for benchmarking.
The latest OMB version includes benchmarks for various MPI blocking collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter and vector collectives). These benchmarks work in the following manner. Suppose users run the osu_bcast benchmark with N processes, the benchmark measures the min, max and the average latency of the MPI_Bcast collective operation across N processes, for various message lengths, over a large number of iterations. In the default version, these benchmarks report the average latency for each message length. Additionally, the benchmarks offer the following options: "-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations. "-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default "-x" can be used to set the number of warmup iterations to skip for each message length. "-i" can be used to set the number of iterations to run for each message length. "-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.
Ran with defaults atm.
osu-bcast - MPI_Bcast Latency Test
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 3.09
2 2.92
4 2.89
8 2.90
16 3.01
32 4.88
64 3.85
128 3.84
256 4.01
512 4.32
1024 5.41
2048 6.95
4096 10.16
8192 16.11
16384 29.23
32768 52.23
65536 103.47
131072 203.86
262144 408.88
524288 813.19
1048576 1657.63
osu_allgather - MPI_Allgather Latency Test
# OSU MPI Allgather Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 8.19
2 8.45
4 8.70
8 9.20
16 10.12
32 12.98
64 17.83
128 29.07
256 52.60
512 133.40
1024 192.84
2048 314.02
4096 563.12
8192 759.27
16384 1355.26
32768 2491.07
65536 5976.11
131072 12510.91
262144 23720.27
524288 37485.95
1048576 68757.53
osu_alltoall - MPI_Alltoall Latency Test
# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 20.02
2 18.70
4 18.09
8 18.43
16 19.81
32 24.23
64 32.25
128 54.86
256 97.10
512 186.57
1024 256.20
2048 418.41
4096 848.88
8192 1790.19
16384 3885.44
32768 4181.89
65536 8774.70
131072 17384.51
262144 35664.92
524288 70882.81
1048576 141618.11
Have built a set with mpi/openmpi/3.1.6/gnu-4.9.2
which ought to be a decently-performing MPI that knows about OmniPath, submitted one.
osu-latency - Latency Test
# OSU MPI Latency Test v7.2
# Size Latency (us)
# Datatype: MPI_CHAR.
1 0.37
2 0.36
4 0.36
8 0.36
16 0.36
32 0.54
64 0.52
128 0.52
256 0.56
512 0.57
1024 0.64
2048 0.81
4096 1.11
8192 1.92
16384 1.90
32768 2.56
65536 3.91
131072 6.59
262144 12.12
524288 26.61
1048576 82.30
2097152 212.73
4194304 426.75
osu-bw - Bandwidth Test
# OSU MPI Bandwidth Test v7.2
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 5.69
2 12.34
4 24.80
8 48.95
16 97.51
32 133.80
64 265.44
128 437.24
256 860.87
512 1664.22
1024 2852.03
2048 4865.91
4096 8305.70
8192 9540.45
16384 11185.89
32768 15396.53
65536 18787.30
131072 21357.57
262144 22704.21
524288 20321.87
1048576 13489.96
2097152 10394.33
4194304 10096.41
osu-bcast - MPI_Bcast Latency Test
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 2.59
2 2.45
4 2.52
8 2.46
16 3.62
32 5.14
64 4.09
128 4.11
256 4.55
512 5.14
1024 7.12
2048 10.28
4096 14.55
8192 22.36
16384 39.09
32768 71.01
65536 133.01
131072 256.31
262144 510.76
524288 1516.11
1048576 3181.80
osu_allgather - MPI_Allgather Latency Test
# OSU MPI Allgather Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 7.70
2 7.94
4 8.04
8 8.91
16 9.94
32 13.22
64 17.88
128 29.24
256 52.56
512 112.08
1024 223.72
2048 376.52
4096 677.56
8192 769.14
16384 1357.01
32768 2451.52
65536 4859.00
131072 9563.56
262144 18630.84
524288 35178.58
1048576 70382.17
osu_alltoall - MPI_Alltoall Latency Test
# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 19.00
2 18.88
4 18.37
8 20.28
16 21.71
32 26.65
64 34.82
128 57.28
256 119.58
512 159.12
1024 250.46
2048 428.38
4096 774.50
8192 1579.99
16384 3284.83
32768 6130.75
65536 9018.75
131072 17932.21
262144 35571.86
524288 70918.13
1048576 141613.42
With the exception of bcast, which is rather different for 3.1.6 in the larger messages of both graphs, they'rethesamepicture.gif (±jitter).
The average latency reported on the osu_bcast graph also doesn't seem to make sense with the graph pictured, for openmpi 3.1.6, going to try including -f
option.
Pretty close
Results from rerunning bcast for openmpi 3.1.6 with -f
# OSU MPI Broadcast Latency Test v7.2
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 2.50 1.03 4.64 1000
2 2.43 1.00 4.55 1000
4 2.45 0.96 4.62 1000
8 2.42 0.97 4.47 1000
16 3.67 1.39 6.04 1000
32 5.39 3.14 7.80 1000
64 4.33 2.05 6.88 1000
128 4.35 2.12 6.76 1000
256 4.86 2.33 7.39 1000
512 5.37 2.33 8.28 1000
1024 7.29 2.66 11.26 1000
2048 9.68 1.65 13.28 1000
4096 13.78 2.90 18.03 1000
8192 21.14 5.23 26.25 1000
16384 36.56 12.41 44.05 100
32768 71.62 50.57 82.76 100
65536 133.35 98.69 152.44 100
131072 259.03 205.29 298.62 100
262144 514.35 413.84 588.67 100
524288 1121.96 144.24 1386.37 100
1048576 2179.01 275.34 2503.54 100
Big difference in max and min latency on the last two, and the graph it draws doesn't show the max points. (Also doesn't make sense with the minimums since those go below 275.34...)
Small sizes
From #44 we want to know what Spack variants to build our main OpenMPI with. We are going to use the C MPI benchmarks from https://mvapich.cse.ohio-state.edu/benchmarks/ to compare how well they perform on our OmniPath clusters.
Our existing
mpi/openmpi/4.1.1/gnu-4.9.2
should be below acceptable performance (we assume!), using only vader.Compiling the OSU microbenchmarks on Young
Now got directories full of benchmarks:
Going to start with point-to-point then look at some collectives.