LLNL / Comb

Comb is a communication performance benchmarking tool.
MIT License
23 stars 7 forks source link
cpp performance proxy-application

Comb v0.3.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos_nvcc_gcc.sh 10.1.243 sm_70 8.3.1
cd build_lc_blueos-nvcc10.1.243-sm_70-gcc8.3.1
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

To run basic tests make a directory and make symlinks to the comb executable and scripts. The scripts expect a symlink to comb to exist in the run directory. The run_tests.bash script runs the basic_tests.bash script in 2^3 processes.

ln -s /path/to/comb/build_my_compiler_version/bin/comb .
ln -s /path/to/comb/scripts/* .
./run_tests.bash 2 basic_tests.bash

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

Example Script

The run_tests.bash is an example script that allocates resources and uses a script such as focused_tests.bash to run the code in a variety of configurations. The run_tests.bash script takes two arguments, the number of processes per side used to split the grid into an N x N x N decomposition, and the tests script.

mkdir 1_1_1
cd 1_1_1
ln -s path/to/comb/build/bin/comb .
ln -s path/to/comb/scripts/* .
./run_tests.bash 1 focused_tests.bash

The scale_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with multiple sets of arguments with mpi. The focused_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with one set of arguments with mpi.

Output

Comb outputs Comb_(number)_summary and Comb_(number)_proc(number) files. The summary file contains aggregated results from the proc files which contain per process results. The files contain the argument and code setup information and the results of multiple tests. The results for each test follow a line started with "Starting test" and the name of the test.

The first set of tests are memory copy tests with names of the following form.

Starting test memcpy (execution policy) dst (destination memory space) src (source memory space)"
copy_sync-(number of variables)-(elements per variable)-(bytes per element): num (number of repeats) avg (time) s min (time) s max (time) s

Example:

Starting test memcpy seq dst Host src Host
copy_sync-3-1061208-8: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which memory is copied via sequential cpu execution to one host memory buffer from another host memory buffer. The test involves one measurement.

copy_sync-3-1061208-8 Copying 3 buffers of 1061208 elements of size 8.

The second set of tests are the message passing tests with names of the following form.

Comm (message passing execution policy) Mesh (physics execution policy) (mesh memory space) Buffers (large message execution policy) (large message memory space) (small message execution policy) (small message memory space)
(test phase): num (number of repeats) avg (time) s min (time) s max (time) s
...

Example

Comm mpi Mesh seq Host Buffers seq Host seq Host
pre-comm:  num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-comm: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
start-up:   num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
test-comm:  num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
bench-comm: num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which a mesh is updated with physics running via sequential cpu execution using memory allocated in host memory. The buffers used for large messages are packed/unpacked via sequential cpu execution and allocated in host memory and the buffers used with MPI for small messages are packed/unpacked via sequential cpu execution and allocated in host memory. This test involves multiple measurements, the first six time individual parts of the physics cycle and communication.

Execution Policies

Note: The cudaGraph exec policy updates the graph each cycle. There is currently no option to use the same graph for every cycle.

Memory Spaces

Note: Some memory spaces are pooled. This is done to amortize the cost of allocation. After the first allocation the cost of allocating memory should be trivial for pooled memory spaces. The first allocation is done in a warmup step and is not be included in any timers.

Related Software

The RAJA Performance Suite contains a collection of loop kernels implemented in multiple RAJA and non-RAJA variants. We use it to monitor and assess RAJA performance on different platforms using a variety of compilers.

The RAJA Proxies repository contains RAJA versions of several important HPC proxy applications.

Contributions

The Comb team follows the GitFlow development model. Folks wishing to contribute to Comb, should include their work in a feature branch created from the Comb develop branch. Then, create a pull request with the develop branch as the destination. That branch contains the latest work in Comb. Periodically, we will merge the develop branch into the master branch and tag a new release.

Authors

Thanks to all of Comb's contributors.

Comb was created by Jason Burmark (burmark1@llnl.gov).

Release

Comb is released under an MIT license. For more details, please see the LICENSE, RELEASE, and NOTICE files.

LLNL-CODE-758885