ANL-CESAR / RSBench

A mini-app to represent the multipole resonance representation lookup cross section algorithm.
MIT License
21 stars 28 forks source link

RSBench

Latest Github release Build Status Published in Proceedings of EASC 2014

RSBench is a mini-app representing a key computational kernel of the Monte Carlo neutron transport algorithm. Specifically, RSBench represents the multipole method of performing continuous energy macroscopic neutron cross section lookups. The mulitpole method is a recently developed strategy for building microscopic cross section data "on-the-fly" that requires orders of magnitude less memory storage as compared to traditional methods (e.g., those represented in XSBench). RSBench serves as a useful performance stand-in for full neutron transport applications like OpenMC that support multipole cross section representations.

Table of Contents

  1. Selecting a Programming Model
  2. Compilation
  3. Running RSBench / Command Line Interface
  4. Verification Support
  5. Theory & Algorithms
  6. Optimized Kernels
  7. Citing RSBench
  8. Development Team

Selecting A Programming Model

RSBench has been implemented in multiple different languages to target a variety of computational architectures and accelerators. The available implementations can be found in their own directories:

  1. RSBench/openmp-threading This is the "default" version of RSBench that is appropriate for serial and multicore CPU architectures. The method of parallelism is via the OpenMP threading model.

  2. RSBench/openmp-offload This method of parallelism uses OpenMP 4.5 (or newer) to map program data to a remote accelerator memory space and run targeted kernels on the accelerator. This method of parallelism could be used for a wide variety of architectures (besides CPUs) that support OpenMP 4.5 targeting. NOTE: The Makefile will likely not work by default and will need to be adjusted to utilize your OpenMP accelerator compiler.

  3. RSBench/cuda This version of RSBench is written in CUDA for use with NVIDIA GPU architectures. NOTE: You will likely want to specify in the makefile the SM version for the card you are running on.

  4. RSBench/opencl This version of RSBench is written in OpenCL, and can be used for CPU, GPU, FPGA, or other architectures that support OpenCL. It was written with GPUs in mind, so if running on other architectures you may need to heavily re-optimize the code. You will also likely need to edit the makefile to supply the path to your OpenCL compiler.

  5. RSBench/sycl This version of RSBench is written in SYCL, and can be used for CPU, GPU, FPGA, or other architectures that support OpenCL and SYCL. It was written with GPUs in mind, so if running on other architectures you may need to heavily re-optimize the code. You will also likely need to edit the makefile to supply the path to your SYCL compiler.

  6. RSBench/hip This version of RSBench is written in HIP for use with GPU architectures. This version is derived from CUDA using an automatic conversion tool with only a few small manual changes.

Compilation

To compile RSBench with default settings, navigate to your selected source directory and use the following command:

make

You can alter compiler settings in the included Makefile. Alternatively, for the OpenMP threading version of RSBench you may specify a compiler via the CC environment variable and then making as normal, e.g.:

export CC=clang
make

Debugging, Optimization & Profiling

There are also a number of switches that can be set in the makefile. Here is a sample of the control panel at the top of the makefile:

OPTIMIZE = yes
DEBUG    = no
PROFILE  = no

Running RSBench

To run RSBench with default settings, use the following command:

./RSBench

For non-default settings, RSBench supports the following command line options:

Argument Description Options Default
-t # of OpenMP threads integer value System Default
-m Simulation method history, event history
-s Problem Size small, large large
-p # of particle histories (if running using "history" method) integer value 500,000
-l # of Cross-section (XS) lookups. If using using history based method, this is lookups per particle history. If using event-based method, this is total lookups. integer value (History: 34) (Event: 17,000,000)
-p # of avg poles per nuclide integer value 1,000
-w # of windows per nuclide integer value 100
-d Flag to disable Doppler broadening
-k Optimized kernel ID integer value 0

Verification Support

Legacy versions of RSBench had a special "Verification" compiler flag option to enable verification of the results. However, a much more performant and portable verification scheme was developed and is now used for all configurations -- therefore, it is not necessary to compile with or without the verification mode as it is always enabled by default. RSBench generates a hash of the results at the end of the simulation and displays it with the other data once the code has completed executing. This hash can then be verified against hashes that other versions or configurations of the code generate. For instance, running RSBench with 4 threads vs 8 threads (on a machine that supports that configuration) should generate the same hash number. Running on GPU vs CPU should not change the hash number. However, changing the model / run parameters is expected to generate a totally different hash number (i.e., increasing the number of particles, number of gridpoints, etc, will result in different hashes). However, changing the type of lookup performed (e.g., nuclide, unionized, or hash) should result in the same hash being generated. Changing the simulation mode (history or event) will generate different hashes.

Theory & Algorithms

Transport Simulation Styles

History-Based Transport

The default simulation model used in RSBench is the "history-based" model. In this model, parallelism is expressed over independent particle histories, with each particle being simulated in a serial fashion from birth to death:

for each particle do           // Independent
    while particle is alive do // Dependent
        Move particle to collision site
        Process particle collision

This method of parallelism is very memory efficient, as the total number of particles that must be kept in memory at once is equivalent to the total number of active threads being run in the simulation. However, as there are many different types of collision events, the history-based model means that there is no natural SIMD style parallelism available for work happening between different threads.

Event-Based Transport

An alternative simulation model is the "event-based" model. In this model, parallelism is instead expressed over different collision (or "event") types. To facilitate this, all particles in the simulation are stored in memory at once. Each event kernel is executed in parallel on vectors of particles that currently require that event to be executed:

Get vector of source particles
while any particles are alive do         // Dependent
    for each living particle do          // Independent
        Move particle to collision site
    for each living particle do          // Independent
        Process particle collision
    Sort/consolidate surviving particles

This method of parallelism is requires more memory and requires an extra stream compaction kernel to sort and organize the particles periodically to ready them for the different event kernels. The benefit of this model is that kernels can potentially be execute in a SIMD manner and with higher cache efficiency due to the potential to sort particles by material and energy. On CPU architectures, the costs of sorting and buffering particles typically outweigh the benefits of the event-based model, but on accelerator architectures the tradeoff has been found to usually be more favorable.

The Multipole Cross Section (XS) Lookup Method

RSBench represents the multipole macroscopic cross section lookup kernel. This kernel is responsible for adding together microscopic cross section data from all nuclides present in the material the neutron is travelling through, given a certain energy:

XS_Lookup_EQ

Macroscopic cross section data is typically required for multiple reaction channels "c", such as the total cross section, fission cross section, etc.

Historically, cross section data has been stored in pointwise format, sometimes requiring in excess of 100,000 energy level data points be stored for a single nuclide. There are a variety of methods for performing cross section lookups on traditional pointwise data, as represented in the mini-app XSBench. However, a more memory and bandwidth efficient cross section representation method has recently been developed known as the "multipole" format that models the quantum mechanical resonances (or "poles") that underly the pointwise data. By this method, the resonances can be modeled mathematically and assembled on-the-fly while storing only a fraction of the data that is required for the traditional pointwise format. The tradeoff is that a greatly increased amount of floating point work must be performed when expanding the quantum mechanical "residues" into useable cross section data.

More information regarding the mathematics and equations used in the multipole method can be found in:

C. Josey, P. Ducru, B. Forget, K. Smith, Windowed multipole for cross section Doppler broadening, Journal of Computational Physics, Volume 307, 2016, Pages 715-727. https://doi.org/10.1016/j.jcp.2015.08.013

Faddeeva Function Evaluation

Doppler broadening of resonance data requires evaluation of the complex error function, also known as the Faddeeva Functon. This function is not typically available as a language supplied or standard library intrinsic, so use of the multipole method requires implementing our own or adding a library dependency. Typical libraries that implement the Faddeeva function (e.g., the MIT Faddeeva Package written by Steven Johnson), break up the phase space of the function into many different areas, with each area using its own evaluation technique. This minimizes the number of floating point operations that must be performed, but can create a lot of branching which often precludes high SIMD efficiency. An alternative lightweight formulation is available, known as the Fast Nuclear Faddeeva (FNF) Function. FNF is a much simpler source implementation that only has one possible branch, allowing it to be sepcialized for use in complex phase space commonly seen in light water reactor simulations. FNF is therefore used in RSBench to minimize code complexity while providing acceptable accuracy for the use case of neutron transport.

Optimized Kernels

If using the event-based model, we will be executing the lookup kernel in RSBench across all particles at once. While SIMD execution is possible using this method, typically issues can arrise that greatly reduce SIMD efficiency. In particular, different materials in the simulation have very different numbers of nuclides in them. For instance, spent fuel has 300+ nuclides, while moderator regions only have 10 or so nuclides. This creates a significant load imbalance across lanes in a SIMD vector, as some particles may only need a few iterations to complete all nuclides while others would need hundreds. Therefore, efficient SIMD execution of the event-based model is not possible without some optimizations.

One promising optimization for the event-based model is to perform a key-value sort of particles: first by material, and then by energy within each material. The first sort by material allows for adjacent particles in the vector to typically reside in the same type of material -- meaning that they will require the same number of nuclide lookup iterations. Then, the energy sort means that adjacent particles in the vector will be located close in energy space -- potentially allowing for many adjacent particles to access the same energy indices in each nuclide and therefore perform many or all of the same branching operations and read the same cache lines into memory at the same time. Once sorted, separate event kernels are then called for each material in the simulation. These two sorts can potentially boost both SIMD efficieny and cache efficiency, with effects being amplified as more particles are simulated at each event stage. The downside to this optimization is the introduction of the key-value particle sorting operations, which can be costly and potentially outweight any gains due to improved SIMD efficiency and cache performance.

We have implemented this optimization in both the OpenMP threading and CUDA models. They are not enabled by default, but must be enabled with the "-k 1" or "-k 6" flags if running with OpenMP and CUDA respectively. These optimizations have not yet been implemented in the other programming models due to the lack of an efficient parallel sorting function being easily available without having to create an external library dependency.

Citing RSBench

Papers citing the RSBench program in general should refer to:

Tramm J.R., Siegel A.R., Forget B., Josey C, "Performance Analysis of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations," Presented at EASC 2014 - Solving Software Challenges for Exascale, Stockholm. https://doi.org/10.1007/978-3-319-15976-8_3

Bibtex Entry:

@inproceedings{Tramm:rs,
author="Tramm, John R. and Siegel, Andrew R. and Forget, Benoit and Josey, Colin",
title="Performance Analysis of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations",
booktitle = {{EASC} 2014 - Solving Software Challenges for Exascale},
address = {Stockholm},
year = "2014",
url = "https://doi.org/10.1007/978-3-319-15976-8_3"
}

Development Team

Authored and maintained by John Tramm (@jtramm) with help from Ron Rahaman, Amanda Lund, and other contributors.