Comb v0.3.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

Configurable structured mesh halo exchange communication.
A variety of communication patterns based on grouping messages.
A variety of execution patterns including serial, openmp threading, cuda, cuda fused kernels.
A variety of memory spaces including default system allocated memory, pinned host memory, cuda device memory, and cuda managed memory with different cuda memory advice.

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos_nvcc_gcc.sh 10.1.243 sm_70 8.3.1
cd build_lc_blueos-nvcc10.1.243-sm_70-gcc8.3.1
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

To run basic tests make a directory and make symlinks to the comb executable and scripts. The scripts expect a symlink to comb to exist in the run directory. The run_tests.bash script runs the basic_tests.bash script in 2^3 processes.

ln -s /path/to/comb/build_my_compiler_version/bin/comb .
ln -s /path/to/comb/scripts/* .
./run_tests.bash 2 basic_tests.bash

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

__ENABLE_MPI__ Allow use of mpi and enable test combinations using mpi
__ENABLE_OPENMP__ Allow use of openmp and enable test combinations using openmp
__ENABLE_CUDA__ Allow use of cuda and enable test combinations using cuda
__ENABLE_RAJA__ Allow use of RAJA performance portability library
__ENABLE_CALIPER__ Allow use of the Caliper performance profiling library
__ENABLE_ADIAK__ Allow use of the Adiak library for recording program metadata

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

#_#_# Grid size in each dimension (Required)
-divide #_#_# Number of subgrids in each dimension (Required)
-periodic #_#_# Periodicity in each dimension
-ghost #_#_# The halo width or number of ghost zones in each dimension
-vars # The number of grid variables
-comm option Communication options
- cutoff # Number of elements cutoff between large and small message packing kernels
- enable|disable option Enable or disable specific message passing execution policies
  - all all message passing execution patterns
  - mock mock message passing execution pattern (do not communicate)
  - mpi mpi message passing execution pattern
  - gdsync libgdsync message passing execution pattern (experimental)
  - gpump libgpump message passing execution pattern
  - mp libmp message passing execution pattern (experimental)
  - umr umr message passing execution pattern (experimental)
- __post_recv option__ Communication post receive (MPI_Irecv) options
  - __wait_any__ Post recvs one-by-one
  - __wait_some__ Post recvs in groups
  - __wait_all__ Post all recvs
  - __test_any__ Post recvs one-by-one
  - __test_some__ Post recvs in groups
  - __test_all__ Post all recvs
- __post_send option__ Communication post send (MPI_Isend) options
  - __wait_any__ pack and send messages one-by-one
  - __wait_some__ pack messages then send them in groups
  - __wait_all__ pack all messages then send them all
  - __test_any__ pack messages asynchronously and send when ready
  - __test_some__ pack multiple messages asynchronously and send when ready
  - __test_all__ pack all messages asynchronously and send when ready
- __wait_recv option__ Communication wait to recv and unpack (MPI_Wait, MPI_Test) options
  - __wait_any__ recv and unpack messages one-by-one (MPI_Waitany)
  - __wait_some__ recv messages then unpack them in groups (MPI_Waitsome)
  - __wait_all__ recv all messages then unpack them all (MPI_Waitall)
  - __test_any__ recv and unpack messages one-by-one (MPI_Testany)
  - __test_some__ recv messages then unpack them in groups (MPI_Testsome)
  - __test_all__ recv all messages then unpack them all (MPI_Testall)
- __wait_send option__ Communication wait on sends (MPI_Wait, MPI_Test) options
  - __wait_any__ Wait for each send to complete one-by-one (MPI_Waitany)
  - __wait_some__ Wait for all sends to complete in groups (MPI_Waitsome)
  - __wait_all__ Wait for all sends to complete (MPI_Waitall)
  - __test_any__ Wait for each send to complete one-by-one by polling (MPI_Testany)
  - __test_some__ Wait for all sends to complete in groups by polling (MPI_Testsome)
  - __test_all__ Wait for all sends to complete by polling (MPI_Testall)
- allow|disallow option Allow or disallow specific communications options
  - __per_message_pack_fusing__ Combine packing/unpacking kernels for boundaries communicated in the same message
  - __message_group_pack_fusing__ Fuse packing/unpacking kernels across messages (and variables) in the same message group
-cycles # Number of times the communication pattern is tested
__-omp_threads #__ Number of openmp threads requested
-exec option Execution options
- enable|disable option Enable or disable specific execution patterns
  - all all execution patterns
  - seq sequential CPU execution pattern
  - omp openmp threaded CPU execution pattern
  - cuda cuda GPU execution pattern
  - __cuda_graph__ cuda GPU batched via cuda graph API execution pattern
  - hip hip GPU execution pattern
  - __raja_seq__ RAJA sequential CPU execution pattern
  - __raja_omp__ RAJA openmp threaded CPU execution pattern
  - __raja_cuda__ RAJA cuda GPU execution pattern
  - __raja_hip__ RAJA hip GPU execution pattern
  - __mpi_type__ MPI datatypes MPI implementation execution pattern
-memory option Memory space options
- UseType enable|disable Optional UseType modifier for enable|disable, default is all. UseType specifies what uses to enable|disable, for example "-memory buffer disable cuda_pinned" disables cuda_pinned buffer allocations.
  - all all use types
  - mesh mesh use type
  - buffer buffer use type
- enable|disable option Enable or disable specific memory spaces for UseType allocations
  - all all memory spaces
  - host host CPU memory space
  - __cuda_hostpinned__ cuda pinned memory space (pooled)
  - __cuda_device__ cuda device memory space (pooled)
  - __cuda_managed__ cuda managed memory space (pooled)
  - __cuda_managed_host_preferred__ cuda managed with host preferred advice memory space (pooled)
  - __cuda_managed_host_preferred_device_accessed__ cuda managed with host preferred and device accessed advice memory space (pooled)
  - __cuda_managed_device_preferred__ cuda managed with device preferred advice memory space (pooled)
  - __cuda_managed_device_preferred_host_accessed__ cuda managed with device preferred and host accessed advice memory space (pooled)
  - __hip_hostpinned__ hip pinned memory space (pooled)
  - hip_hostpinned_coarse hip coarse grained (non-coherent) pinned memory space (pooled)
  - __hip_device__ hip device memory space (pooled)
  - hip_device_fine hip fine grained device memory space (pooled)
  - __hip_managed__ hip managed memory space (pooled)
-cuda_aware_mpi Assert that you are using a cuda aware mpi implementation and enable tests that pass cuda device or managed memory to MPI
-hip_aware_mpi Assert that you are using a hip aware mpi implementation and enable tests that pass hip device or managed memory to MPI
-cuda_host_accessible_from_device Assert that your system supports pageable host memory access from the device and enable tests that access pageable host memory on the device
-use_device_preferred_for_cuda_util_aloc Use device preferred host accessed memory for cuda utility allocations instead of host pinned memory, mainly affects fused kernels
__-use_device_for_hip_util_aloc__ Use device memory for hip utility allocations instead of host pinned memory, mainly affects fused kernels
-print_packing_sizes Print message and packing sizes to proc files
-print_message_sizes Print message sizes to proc files
__-caliper_config__ Caliper performance profiling config (e.g., "runtime-report")

Example Script

The run_tests.bash is an example script that allocates resources and uses a script such as focused_tests.bash to run the code in a variety of configurations. The run_tests.bash script takes two arguments, the number of processes per side used to split the grid into an N x N x N decomposition, and the tests script.

mkdir 1_1_1
cd 1_1_1
ln -s path/to/comb/build/bin/comb .
ln -s path/to/comb/scripts/* .
./run_tests.bash 1 focused_tests.bash

The scale_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with multiple sets of arguments with mpi. The focused_tests.bash script used with run_tests.bash which shows the options available and how the code may be run with one set of arguments with mpi.

Output

Comb outputs Comb_(number)_summary and Comb_(number)_proc(number) files. The summary file contains aggregated results from the proc files which contain per process results. The files contain the argument and code setup information and the results of multiple tests. The results for each test follow a line started with "Starting test" and the name of the test.

The first set of tests are memory copy tests with names of the following form.

Starting test memcpy (execution policy) dst (destination memory space) src (source memory space)"
copy_sync-(number of variables)-(elements per variable)-(bytes per element): num (number of repeats) avg (time) s min (time) s max (time) s

Example:

Starting test memcpy seq dst Host src Host
copy_sync-3-1061208-8: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which memory is copied via sequential cpu execution to one host memory buffer from another host memory buffer. The test involves one measurement.

copy_sync-3-1061208-8 Copying 3 buffers of 1061208 elements of size 8.

The second set of tests are the message passing tests with names of the following form.

Comm (message passing execution policy) Mesh (physics execution policy) (mesh memory space) Buffers (large message execution policy) (large message memory space) (small message execution policy) (small message memory space)
(test phase): num (number of repeats) avg (time) s min (time) s max (time) s
...

Example

Comm mpi Mesh seq Host Buffers seq Host seq Host
pre-comm:  num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-recv: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
wait-send: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
post-comm: num 200 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
start-up:   num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
test-comm:  num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s
bench-comm: num 8 avg 0.123456789 s min 0.123456789 s max 0.123456789 s

This is a test in which a mesh is updated with physics running via sequential cpu execution using memory allocated in host memory. The buffers used for large messages are packed/unpacked via sequential cpu execution and allocated in host memory and the buffers used with MPI for small messages are packed/unpacked via sequential cpu execution and allocated in host memory. This test involves multiple measurements, the first six time individual parts of the physics cycle and communication.

pre-comm "Physics" before point-to-point communication, in this case setting memory to initial values.
post-recv Allocating MPI receive buffers and calling MPI_Irecv.
post-send Allocating MPI send buffers, packing buffers, and calling MPI_Isend.
wait-recv Waiting to receive MPI messages, unpacking MPI buffers, and freeing MPI receive buffers
wait-send Waiting for MPI send messages to complete and freeing MPI send buffers.
post-comm "Physics" after point-to-point communication, in this case resetting memory to initial values. The final three measure problem setup, correctness testing, and total benchmark time.
start-up Setting up mesh and point-to-point communication.
test-comm Testing correctness of point-to-point communication.
bench-comm Running benchmark, starts after an initial MPI_Barrier and ends after a final MPI_Barrier.

Execution Policies

seq Sequential CPU execution
omp Parallel CPU execution via OpenMP
cuda Parallel GPU execution via cuda
cudaGraph Parallel GPU execution via cuda graphs
hip Parallel GPU execution via hip
__raja_seq__ RAJA Sequential CPU execution
__raja_omp__ RAJA Parallel CPU execution via OpenMP
__raja_cuda__ RAJA Parallel GPU execution via cuda
__raja_hip__ RAJA Parallel GPU execution via hip
__mpi_type__ Packing or unpacking execution done via mpi datatypes used with MPI_Pack/MPI_Unpack

Note: The cudaGraph exec policy updates the graph each cycle. There is currently no option to use the same graph for every cycle.

Memory Spaces

Host CPU memory (malloc)
HostPinned Cuda/Hip Pinned CPU memory pool (cudaHostAlloc/hipMallocHost)
Device Cuda/Hip GPU memory pool (cudaMalloc/hipMalloc)
Managed Cuda/Hip Managed GPU memory pool (cudaMallocManaged/hipMallocManaged)
ManagedHostPreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId)
ManagedHostPreferredDeviceAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation cudaCpuDeviceId + cudaMemAdviseSetAccessedBy 0)
ManagedDevicePreferred Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0)
ManagedDevicePreferredHostAccessed Cuda Managed CPU Pinned memory pool (cudaMallocManaged + cudaMemAdviseSetPreferredLocation 0 + cudaMemAdviseSetAccessedBy cudaCpuDeviceId)

Note: Some memory spaces are pooled. This is done to amortize the cost of allocation. After the first allocation the cost of allocating memory should be trivial for pooled memory spaces. The first allocation is done in a warmup step and is not be included in any timers.

Related Software

The RAJA Performance Suite contains a collection of loop kernels implemented in multiple RAJA and non-RAJA variants. We use it to monitor and assess RAJA performance on different platforms using a variety of compilers.

The RAJA Proxies repository contains RAJA versions of several important HPC proxy applications.

Contributions

The Comb team follows the GitFlow development model. Folks wishing to contribute to Comb, should include their work in a feature branch created from the Comb develop branch. Then, create a pull request with the develop branch as the destination. That branch contains the latest work in Comb. Periodically, we will merge the develop branch into the master branch and tag a new release.

Authors

Thanks to all of Comb's contributors.

Comb was created by Jason Burmark (burmark1@llnl.gov).

Release

Comb is released under an MIT license. For more details, please see the LICENSE, RELEASE, and NOTICE files.

LLNL-CODE-758885

LLNL / Comb

readme