ISO C++ Parallel Algorithms for n-body simulations on multi-core CPUs and GPUs

Galaxy collision

This repository provides multiple parallel n-body simulation algorithms, implemented in portable ISO C++ that runs on multi-core CPUs and GPUs:

All-Pairs; $O(N^2)$ time complexity:
- Classic all-pair, parallelized over bodies.
- all-pairs-collapsed, parallelized over force pairs.
Barnes-Hut ; $O(N \log N)$ time complexity:
- Starvation-free octree algorithm: requires parallel forward progress.
- Hilbert-sorted Bounding Volume Hierarchy (bvh) algorithm: requires weakly parallel forward progress.

Reproducing results

Pre-requisites: docker and HPCCM:

$ pip install hpccm

Run samples as follows:

# Options
# ./ci/run_docker <toolchain> <algorithm> <workload case> <dim> <precision> <bodies> <steps>
# Example: nvc++ gpu compiler, octree algorithm, galaxy simulation, 3D, double precision:
$ ./ci/run_docker nvgpu octree galaxy 3 double
# Build, but not run:
$ BUILD_ONLY=1 ./ci/run_docker nvgpu octree galaxy 3 double
# Run assuming binary is built:
$ RUN_ONLY=1 ./ci/run_docker nvgpu octree galaxy 3 double

To reproduce without a container, a properly set up environment is required, in which case the ./ci/run script can be used instead.

Following options available:

Toolchain:
- Open-source vendor-neutral: acpp (AdaptiveCpp), gcc (Intel TBB), clang (Intel TBB),
- Vendor-specific:
- AMD ROCm stdpar: amdclang.
- NVIDIA HPC SDK: nvgpu (nvc++ -stdpar=gpu), nvcpu (nvc++ -stdpar=cpu).
- Intel oneAPI: dpc++
Algorithm: all-pairs, all-pairs-collapsed, octree, bvh.
Dimensions: 2 (2D), 3 (3D).
Precision: float, double.
Workloads:
- galaxy
- nasa: loads data-set from file, requires using ./ci/run_docker thuering fetch for set up.

To run all benchmarks on a given systems, you can use ./ci/run_docker bench.

License

MIT License, see LICENSE.

Citing

Thomas Lane Cassell, Tom Deakin, Aksel Alpay, Vincent Heuveline, and Gonzalo Brito Gadeschi. "Efficient Tree-Based Parallel Algorithms for N-Body Simulations Using C++ Standard Parallelism." In Workshop on Irregular Applications: Architectures and Algorithms Held in Conjunction with Supercomputing (P3HPC). IEEE, 2024.

https://research-information.bris.ac.uk/en/publications/efficient-tree-based-parallel-algorithms-for-n-body-simulations-u

Contributing code

When contributing code, you may format your contributions as follows:

$ ./ci/run_docker fmt

but doing this is not required.

Reproducing with mamba

Installing

The environment is made portable through mamba/conda. This must be installed as a prerequisite, e.g., run the Miniforge installer from https://github.com/conda-forge/miniforge . Then create the stdpar-nbody environment:

$ mamba env create -f environment.yaml

Other things you might want:

NVIDIA HPC SDK

Building

Use make to build the program. This must be done within the mamba environment:

$ mamba activate stdpar-bh

The number of dimensions can be specified with D=<dim> parameter to make. By default D=2 is used. These are the available targets:

CPU

make gcc
make clang
make nvcpp

GPU

make gpu to build for NVIDIA GPUs using nvc++

The output will be ./nbody_d<dim>_<target>.

Run configuration

When running the nvcpp version, it is recommended to use the following environment variables:

OMP_PLACES=cores OMP_PROC_BIND=close ./nbody_d2_nvcpp -s 5 -n 1000000

If you get an error about missing libraries then try running with the following environment variable:

LD_LIBRARY_PATH=${CONDA_PREFIX}/lib ./nbody_d2_clang -s 5 -n 1000000

Examples

Run Barnes-Hut with $\theta=0$ and compare with all pairs algorithm. Run 5 steps with 10 bodies. They should have the same output.

$ ./nbody_d2_gpu -s 5 -n 10 --print-state --theta 0
$ ./nbody_d2_gpu -s 5 -n 10 --print-state --algorithm all-pairs

Run a large Barnes-Hut simulation with 1,000,000 bodies:

$ ./nbody_d2_gpu -s 5 -n 1000000

Generate a similar image to the above GIF:

$ ./nbody_d2_gpu -s 1000 -n 10000 --save pos --workload galaxy
$ python3 scripts/plotter.py pos --galaxy --gif

To find other program arguments:

$ ./nbody_d2_gpu --help

UoB-HPC / stdpar-nbody

readme