cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

Standalone Patatrack pixel tracking

Table of contents

Introduction

The purpose of this package is to explore various performance portability solutions with the Patatrack pixel tracking application. The version here corresponds to CMSSW_11_2_0_pre8_Patatrack.

The application is designed to require minimal dependencies on the system. All programs require

In addition, the individual programs assume the following be found from the system

Application CMake (>= 3.16) CUDA 11.2 ROCm 5.0 Intel oneAPI Base Toolkit
cudatest :heavy_check_mark:
cuda :heavy_check_mark:
cudadev :heavy_check_mark:
cudauvm :heavy_check_mark:
cudacompat :heavy_check_mark:
hiptest :heavy_check_mark:
hip :heavy_check_mark:
kokkostest :heavy_check_mark: :white_check_mark: (1) :white_check_mark: (2)
kokkos :heavy_check_mark: :white_check_mark: (1) :white_check_mark: (2)
alpakatest :white_check_mark: (3) :white_check_mark: (4)
alpaka :white_check_mark: (3) :white_check_mark: (4)
sycltest :heavy_check_mark:
sycl (5) (6) :heavy_check_mark: (7)
stdpar :heavy_check_mark:
  1. kokkos and kokkostest have an optional dependence on CUDA, by default it is required (see kokkos and kokkostest for more details)
  2. kokkos and kokkostest have an optional dependence on ROCm, by default it is not required (see kokkos and kokkostest for more details)
  3. alpaka and alpakatest have an optional dependence on CUDA, by default it is required (see alpaka and alpakatest for more details)
  4. alpaka and alpakatest have an optional dependence on ROCm, by default it is not required (see alpaka and alpakatest for more details)
  5. sycl has an optional dependence on CUDA, by default it is not required (see sycl and sycltest for more details)
  6. sycl has an optional dependence on ROCm, by default it is not required (see sycl and sycltest for more details)
  7. As an alternative, the open source llvm compiler can be used (see sycl and sycltest for more details) All other dependencies (listed below) are downloaded and built automatically
Application TBB Eigen Kokkos Boost (1) Alpaka libbacktrace hwloc
fwtest :heavy_check_mark:
serial :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudatest :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cuda :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudadev :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudauvm :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudacompat :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
hiptest :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
hip :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
kokkostest :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: (2)
kokkos :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: (2)
alpakatest :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
alpaka :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
sycltest :heavy_check_mark:
sycl :heavy_check_mark: (3) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
stdpar :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
  1. Boost libraries from the system can also be used, but they need to be version 1.73.0 or newer
  2. kokkos and kokkostest have an optional dependence on hwloc, by default it is not required (see kokkos and kokkostest for more details)
  3. When OneAPI is used, TBB is taken from the OneAPI folder instead of cloning it in the external

The input data set consists of a minimal binary dump of 1000 events of ttbar+PU events from of /TTToHadronic_TuneCP5_13TeV-powheg-pythia8/RunIIAutumn18DR-PUAvg50IdealConditions_IdealConditions_102X_upgrade2018_design_v9_ext1-v2/FEVTDEBUGHLT dataset from the CMS Open Data. The data are downloaded automatically during the build process.

Newer GCC versions

RHEL 7.x / CentOS 7.x use GCC 4.8 as their system compiler. More recent versions can be used from the "Developer Toolset" software collections:

# list available software collections
$ scl -l
devtoolset-9

# load the GCC 9.x environment
$ source scl_source enable devtoolset-9
$ gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Various versions of GCC are also available from the SFT CVMFS area, for example:

$ source /cvmfs/sft.cern.ch/lcg/contrib/gcc/8.3.0/x86_64-centos7/setup.sh
$ $ gcc --version
gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

RHEL 8.x / CentOS 8.x use GCC 8 as their system compiler.

Status

Application Description Framework Device framework Test code Raw2Cluster RecHit Pixel tracking Vertex Transfers to CPU Validation code Validated
fwtest Framework test :heavy_check_mark: :heavy_check_mark:
serial CPU version (via cudaCompat) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudatest CUDA FW test :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cuda CUDA version (frozen) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudadev CUDA version (development) :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
cudauvm CUDA version with managed memory :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :heavy_check_mark: :heavy_check_mark:
cudacompat cudaCompat version :heavy_check_mark: :heavy_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :heavy_check_mark:
hiptest HIP FW test :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
hip HIP version :heavy_check_mark: :heavy_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
kokkostest Kokkos FW test :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
kokkos Kokkos version :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
alpakatest Alpaka FW test :heavy_check_mark: :white_check_mark:
alpaka Alpaka version :white_check_mark: :white_check_mark:
sycltest SYCL/oneAPI FW test :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
sycl SYCL/oneAPI version :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
stdpar std::execution::par version :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

The "Device framework" refers to a mechanism similar to cms::cuda::Product and cms::cuda::ScopedContext to support chains of modules to use the same device and the same work queue.

The column "Validated" means that the program produces the same histograms as the reference cuda program within numerical precision (judged "by eye").

Quick recipe

# Build application using all available CPUs
$ make -j`nproc` cuda

# For CUDA installations elsewhere than /usr/local/cuda
$ make -j`nproc` cuda CUDA_BASE=/path/to/cuda

# Source environment
$ source env.sh

# Process 1000 events in 1 thread
$ ./cuda

# Command line arguments
$ ./cuda -h
./cuda: [--numberOfThreads NT] [--numberOfStreams NS] [--maxEvents ME] [--data PATH] [--transfer] [--validation] [--empty]

Options
 --numberOfThreads   Number of threads to use (default 1)
 --numberOfStreams   Number of concurrent events (default 0=numberOfThreads)
 --maxEvents         Number of events to process (default -1 for all events in the input file)
 --data              Path to the 'data' directory (default 'data' in the directory of the executable)
 --transfer          Transfer results from GPU to CPU (default is to leave them on GPU)
 --validation        Run (rudimentary) validation at the end (implies --transfer)
 --empty             Ignore all producers (for testing only)

Additional make targets

Note that the contents of all, test, and all test_<arch> targets are filtered based on the availability of compilers/toolchains. Essentially

Target Description
all (default) Build all programs
print_targets Print the programs that would be built with all
test Run all tests
test_cpu Run tests that use only CPU
test_nvidiagpu Run tests that require NVIDIA GPU
test_amdgpu Run tests that require AMD GPU
test_intelgpu Run tests that require Intel GPU
test_auto Run tests that auto-discover the available hardware
test_<program> Run tests for program <program>
test_<program>_<arch> Run tests for program <program> that require <arch>
format Format the code with clang-format
clean Remove all build artifacts
distclean clean and remove all externals
dataclean Remove downloaded data files
external_kokkos_clean Remove Kokkos build and installation directory

Test program specific notes (if any)

fwtest

The printouts can be disabled at compile with with

make fwtest ... USER_CXXFLAGS="-DFWTEST_SILENT"

serial

This program is a fork of cudacompat by removing all dependencies to CUDA in order to be a "pure CPU" version. Note that the name refers to (the absence of) intra-algorithm parallelization and is thus comparable to the Serial backend of Alpaka or Kokkos. The event-level parallelism is implemented as in fwtest.

cudatest

The use of caching allocator can be disabled at compile time setting the CUDATEST_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make cudatest ... USER_CXXFLAGS="-DCUDATEST_DISABLE_CACHING_ALLOCATOR"

If the caching allocator is disabled and CUDA version is 11.2 or greater is detected, device allocations and deallocations will use the stream-ordered CUDA functions cudaMallocAsync and cudaFreeAsync. Their use can be disabled explicitly at compile time setting also the CUDATEST_DISABLE_ASYNC_ALLOCATOR preprocessor symbol:

make cudatest ... USER_CXXFLAGS="-DCUDATEST_DISABLE_CACHING_ALLOCATOR -DCUDATEST_DISABLE_ASYNC_ALLOCATOR"

cuda

This program is frozen to correspond to CMSSW_11_2_0_pre8_Patatrack.

The location of CUDA 11 libraries can be set with CUDA_BASE variable.

The use of caching allocator can be disabled at compile time setting the CUDA_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make cuda ... USER_CXXFLAGS="-DCUDA_DISABLE_CACHING_ALLOCATOR"

If the caching allocator is disabled and CUDA version is 11.2 or greater is detected, device allocations and deallocations will use the stream-ordered CUDA functions cudaMallocAsync and cudaFreeAsync. Their use can be disabled explicitly at compile time setting also the CUDA_DISABLE_ASYNC_ALLOCATOR preprocessor symbol:

make cuda ... USER_CXXFLAGS="-DCUDA_DISABLE_CACHING_ALLOCATOR -DCUDA_DISABLE_ASYNC_ALLOCATOR"

cudadev

This program corresponds to the updated version of the pixel tracking software integrated in CMSSW_12_0_0_pre3.

The use of caching allocator can be disabled at compile time setting the CUDADEV_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make cudadev ... USER_CXXFLAGS="-DCUDADEV_DISABLE_CACHING_ALLOCATOR"

If the caching allocator is disabled and CUDA version is 11.2 or greater is detected, device allocations and deallocations will use the stream-ordered CUDA functions cudaMallocAsync and cudaFreeAsync. Their use can be disabled explicitly at compile time setting also the CUDADEV_DISABLE_ASYNC_ALLOCATOR preprocessor symbol:

make cudadev ... USER_CXXFLAGS="-DCUDADEV_DISABLE_CACHING_ALLOCATOR -DCUDADEV_DISABLE_ASYNC_ALLOCATOR"

cudauvm

The purpose of this program is to test the performance of the CUDA managed memory. There are various macros that can be used to switch on and off various behaviors. The default behavior is to use use managed memory only for those memory blocks that are used for memory transfers, call cudaMemPrefetchAsync(), and cudaMemAdvise(cudaMemAdviseSetReadMostly). The macros can be set at compile time along

make cudauvm ... USER_CXXFLAGS="-DCUDAUVM_DISABLE_ADVISE"
Macro Effect
-DCUDAUVM_DISABLE_ADVISE Disable cudaMemAdvise(cudaMemAdviseSetReadMostly)
-DCUDAUVM_DISABLE_PREFETCH Disable cudaMemPrefetchAsync
-DCUDAUVM_DISABLE_CACHING_ALLOCATOR Disable caching allocator, use cudaMallocAsync
-DCUDAUVM_DISABLE_ASYNC_ALLOCATOR Disable cudaMallocAsync, use cudaMalloc
-DCUDAUVM_MANAGED_TEMPORARY Use managed memory also for temporary data structures
-DCUDAUVM_DISABLE_MANAGED_BEAMSPOT Disable managed memory in BeamSpotToCUDA
-DCUDAUVM_DISABLE_MANAGED_CLUSTERING Disable managed memory in SiPixelRawToClusterCUDA
-DCUDAUVM_DISABLE_MANAGED_RECHIT Disable managed memory in SiPixelRecHitCUDA
-DCUDAUVM_DISABLE_MANAGED_TRACK Disable managed memory in CAHitNtupletCUDA
-DCUDAUVM_DISABLE_MANAGED_VERTEX Disable managed memory in PixelVertexProducerCUDA

To use managed memory also for temporary device-only allocations, compile with

make cudauvm ... USER_CXXFLAGS="-DCUDAUVM_MANAGED_TEMPORARY"

cudacompat

This program is a fork of cuda by extending the use of cudaCompat to clustering and RecHits. The aim is to run the same code on CPU. Currently, however, the program requires a GPU because of (still) using pinned host memory in a few places. In the future the program could be extended to provide both CUDA and CPU flavors.

The program contains the changes from following external PRs on top of cuda

hip and hiptest

hip and hiptest are ports of the cuda and cudatest programs to HIP, built for the AMD ROCm backend.

The path to ROCm can be set with ROCM_BASE variable.

The use of caching allocator can be disabled at compile time setting the HIP_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make hip ... USER_CXXFLAGS="-DHIP_DISABLE_CACHING_ALLOCATOR"

If the caching allocator is disabled and HIP version is 5.2.0 or greater is detected, device allocations and deallocations will use the stream-ordered HIP functions hipMallocAsync and hipFreeAsync. Their use can be disabled explicitly at compile time setting also the HIP_DISABLE_ASYNC_ALLOCATOR preprocessor symbol:

make hip ... USER_CXXFLAGS="-DHIP_DISABLE_CACHING_ALLOCATOR -DHIP_DISABLE_ASYNC_ALLOCATOR"

kokkos and kokkostest

# If nvcc is not in $PATH, create environment file and source it
$ make environment [CUDA_BASE=...]
$ source env.sh

# Actual build command
$ make -j N kokkos [CUDA_BASE=...] [KOKKOS_CUDA_ARCH=...] [...]
$ ./kokkos --cuda

# If changing KOKKOS_HOST_PARALLEL or KOKKOS_DEVICE_PARALLEL, clean up existing build first
$ make clean external_kokkos_clean
$ make kokkos ...
Make variable Description
CMAKE Path to CMake executable (by default assume cmake is found in $PATH))
KOKKOS_HOST_PARALLEL Host-parallel backend (default empty, possible values: empty, PTHREAD)
KOKKOS_DEVICE_PARALLEL Device-parallel backend (default CUDA, possible values: empty, CUDA, HIP)
CUDA_BASE Path to CUDA installation. Relevant only if KOKKOS_DEVICE_PARALLEL=CUDA.
KOKKOS_CUDA_ARCH Target CUDA architecture for Kokkos build (default: 70, possible values: 50, 70, 75; trivial to extend). Relevant only if KOKKOS_DEVICE_PARALLEL=CUDA.
ROCM_BASE Path to ROCm installation. Relevant only if KOKKOS_DEVICE_PARALLEL=HIP.
KOKKOS_HIP_ARCH Target AMD GPU architecture for Kokkos build (default: VEGA900, possible values: VEGA900, VEGA909; trivial to extend). Relevant only if KOKKOS_DEVICE_PARALLEL=HIP.
KOKKOS_KOKKOS_PTHREAD_DISABLE_HWLOC If defined, do not use hwloc. Relevant only if KOKKOS_HOST_PARALLEL=PTHREAD.
Macro Effect
-DKOKKOS_SERIALONLY_DISABLE_ATOMICS Disable Kokkos (real) atomics, can be used with Serial-only build

alpaka and alpakatest

Supported backends

The alpaka code base is loosely based on the cuda code base, with some minor changes introduced during the porting.

The alpaka and alpakatest always support the CPU backends (serial synchronous and oneTBB asynchronous). They can be built with either the CUDA backend or the HIP/ROCm backend, with

make alpaka ... CUDA_BASE=path_to_cuda ROCM_BASE=

or

make alpaka ... CUDA_BASE= ROCM_BASE=path_to_rocm

Due to conflicting symbols in the two backends and in Alpaka itself, rnabling both backends at the same time results in compilation errors or undefined behaviour.

Memory allocation strategy

The use of caching allocator can be disabled at compile time setting the ALPAKA_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make alpaka ... USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR"

If the caching allocator is disabled and CUDA version is 11.2 or greater is detected, device allocations and deallocations will use the stream-ordered CUDA functions cudaMallocAsync and cudaFreeAsync. Their use can be disabled explicitly at compile time setting also the ALPAKA_DISABLE_ASYNC_ALLOCATOR preprocessor symbol:

make alpaka ... USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"

sycl and sycltest

Compiler

To compile sycl and sycltest there are a few choices of compiler:

The default installation of Intel oneAPI supports only x86 CPUs (using the Intel OpenCL runtime) and Intel GPUs (using the Intel OpenCL and Level Zero back-ends). For these targets the recommended compiler is icpx; it can be selected in the Makefile setting SYCL_USE_INTEL_ONEAPI to any non-empty value and SYCL_CXX to $(SYCL_BASE)/bin/icpx.

The plugins to support NVIDIA and AMD GPUs can be downloaded separately from the Codeplay web site, and installed on top of the corresponding oneAPI installation. When targetting NVIDIA or AMD GPUs it is recommended to use the clang++ compiler instead of icpx; it can be selected in the Makefile setting SYCL_USE_INTEL_ONEAPI to any non-empty value and SYCL_CXX to $(SYCL_BASE)/bin-llvm/clang++.

The open source Intel LLVM compiler can be built with support for x86 CPUs, Intel GPUs, NVIDIA GPUs and AMD GPUs. It can be selected in the Makefile leaving SYCL_USE_INTEL_ONEAPI undefined or empty, and setting SYCL_CXX to $(SYCL_BASE)/bin/clang++.

All compilers should support multiple targets at the same time, e.g. x86 CPUs and different Intel GPUs. This does not seem to work consistently, so it is recommended to enable a single back-end at a time, setting only one of JIT_TARGETS or AOT_..._TARGETS variables in the Makefile.

To help testing different back-ends, the sycl and sycltest target support being built with arbitrary names, using the syntax

make sycl TARGET_NAME=sycl_cpu

This affects

Device choice

The device can be chosen at runtime with the argument --device and the device:

Memory allocation strategy

The use of caching allocator can be disabled at compile time setting the SYCL_DISABLE_CACHING_ALLOCATOR preprocessor symbol:

make sycl ... USER_CXXFLAGS="-DSYCL_DISABLE_CACHING_ALLOCATOR"

The queue-ordered memory allocations are not available in SYCL.

stdpar

The stdpar program is cloned from cudauvm and currently intended to experiment the use of NVIDIA's implementation of std::execution::par with nvc++ and in conjunction with direct CUDA code.

stdpar implementation requires a c++20 implementation of the c++ standard library (atomic_ref, ranges). It has only been tested with the GCC 11.2.0 implementation, libstdc++.

As it is work-in-progress and contains CUDA Kernels, it currently only supports nvc++. Other compilers will eventually be supported once Kernels have been ported to their stdpar equivalent.

stdpar implementation only supports a single GPU. A multi-gpus implementation would require either multiple processes or using vendor-specific APIs.

Code structure

The project is split into several programs, one (or more) for each test case. Each test case has its own directory under src directory. A test case contains the full application: framework, data formats, device tooling, plugins for the algorithmic modules ran by the framework, and the executable.

Each test program is structured as follows within src/<program name> (examples point to cuda

For more detailed description of the application structure (mostly plugins) see CodeStructure.md

Build system

The build system is based on pure GNU Make. There are two levels of Makefiles. The top-level Makefile handles the building of the entire project: it defines general build flags, paths to external dependencies in the system, recipes to download and build the externals, and targets for the test programs.

For more information see BuildSystem.md.

Contribution guide

Given that the approach of this project is to maintain many programs in a single branch, in order to keep the commit history readable, each commit should contain changes only for one test program, and the short commit message should start with the program name, e.g. [cuda]. A pull request may touch many test programs. General commits (e.g. top-level Makefile or documentation) can be left without such a prefix.

When starting work for a new portability technology, the first steps are to figure out the installation of the necessary external software packages and the build rules (both can be adjusted later). It is probably best to start by cloning the fwtest code for the new program (e.g. footest for a technology foo), adjust the test modules to exercise the API of the technology (see cudatest for examples), and start crafting the tools package (CUDACore in cuda).

Pull requests are expected to build (make all succeeds) and pass tests (make test). Programs to have build errors should primarily be filtered out from $(TARGETS), and failing tests should primarily be removed from the set of tests run by default. Breakages can, however, be accepted for short periods of time with a good justification.

The code is formatted with clang-format version 10.