This is a Rust version of the examples from the "Performance with Stencil" course, with a few new tricks of mine.
In addition to a recent Rust toolchain, you will need to install development packages for the following C/++ libraries:
Additinally, GPU examples use the Vulkan API through the vulkano library, which comes with extra build requirements.
In addition to the Vulkano build requirements, actually running the GPU examples requires at least one working Vulkan implementation. Any reasonably modern Linux GPU driver will do, or if you just want them to run and don't care about actual performance, you may alternatively using the llvmpipe software renderer.
Debug builds additionally enable Vulkan validation layers for richer debug logs, so these must be installed too.
Overall, if you want to be able to run these examples in all possible configurations, you will want to install the following native packages:
# Example given for Ubuntu, other linux distributions will be similar except the
# packages will be named a little differently
sudo apt install git build-essential curl \
libhdf5-dev libhwloc-dev libudev-dev pkgconf \
cmake ninja-build python3 \
vulkan-validationlayers-dev libvulkan-dev vulkan-tools
# A rust toolchain can be installed in a distribution-agnostic fashion
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
The microbenchmarks are implemented using criterion
, and we use the newer
cargo-criterion
runner mechanism, which requires a separate binary that you
can install using this command:
$ cargo install cargo-criterion
In the same spirit as the C++ version, the code is sliced into several crates:
data
defines the general data model, parameters and HDF5 file I/O.compute/xyz
crates implement the various compute backends, based on a small
abstraction layer defined in compute/shared
. Here are the compute backends
in suggested learning order:
naive
backend follows the original naive algorithm, but makes
idiomatic use of the NumPy-like ndarray
multidimensional array library
for the sake of readability.regular
backend leverages the fact that the computation is simpler
at the center of the domain than it is at the edges in order to get about
2x more performance on the center pixels, at the cost of some code
duplication between the center and edge computations.
autovec
backend shapes the computation and data in such a way that
the compiler can automatically vectorize most of the code. The code is
simpler and more portable than if it were written directly against hardware
intrinsics, but this implementation strategy also puts us at the mercy of
compiler autovectorizer whims. Data layout is also improved, pretty much
like what was done in the _intrinsics
C++ version.manualvec
backend does the vectorization manually instead, like the
_intrinsics
C++ version does under the hood. It is significantly more
complex and less portable than autovec
while having comparable runtime
performance, which shows that for this particular problem
autovectorization can actually be a better tradeoff.
Species
concentration storage code is implemented in the
data
crate instead, see data/src/concentration/simd/safe_arch.rs
.block
backend demonstrates how to use a blocked iteration technique
to improve CPU cache locality, as the _link_block
C++ version does.parallel
backend implements multi-threaded iteration using
rayon, via a fork/join recursive splitting
technique.gpu_xyz
backends implement GPU-based computations using the Vulkan
API.
naive
backend starts simple with image-based concentrations
and a straightforward algorithm.specialized
backend uses specialization constants in order to...
compute/selector
crate provides a way for compute binaries to
selectively enable compute backends and pick the most powerful backend
amongst those that are currently enabled.ui
crate lets the various binaries listed below share code and
command-line options where appropriate.simulate
is a binary that runs the simulation. It uses the same CLI argument
syntax as the xyz_gray_scott
binaries from the C++ version, but the
choice of compute backend is made through Cargo features. For each
compute/xyz
backend, there is a matching compute_xyz
feature.livesim
is a variation of simulate
that displays each image to a live
window instead of writing images to files, and runs indefinitely. It is
designed to compute as many simulation steps per second as possible while
keeping the animation smooth, and should thus provide a nice visual overview
of how fast backends are.data-to-pics
is a binary that converts HDF5 output datafiles from simulate
into PNG images, much like the gray_scott2pic
binary from the C++ version
except it uses a different color palette.To run the simulation, build and run the simulate
program as follows...
$ cargo run --release --bin simulate --features <backend> -- <CLI args>
...where <backend>
is the name of a compute backend, such as "compute_block",
and <CLI args>
accepts the same arguments as the C++ version. You can put
a --help
in there for self-documentation.
Then, to convert the HDF5 output into PNG images for visualization purposes, you
can use the data-to-pics
program, using something like the following...
$ mkdir -p pics
$ cargo run --release --bin data-to-pics -- -i <input> -o pics
...where <input>
is the name of the input HDF5 file produced by simulate
(output.h5
by default).
Alternatively, you can run a live version of the simulation which produces a visual render similar to the aforementioned PNG images in real time, using the following command:
$ cargo run --release --bin livesim --features <backend> -- <CLI args>
To run all the microbenchmarks, you can use this command:
$ cargo criterion
Alternatively, you can run microbenchmarks for a specific compute backend xyz
,
which can speed up compilation by avoiding compilation of unused backends:
$ cargo criterion --package xyz
You can also selectively run benchmarks based on a regular expression, like so:
$ cargo criterion -- '(parallel|gpu).*2048x.*32'
The microbenchmark runner exports a more detailed HTML report in
target/criterion/reports/index.html
that you may want to have a look at.
The build system is configured to generate binaries that are optimized for your
CPU, using the Rust equivalent of GCC's -march=native
. You can change this
using the .cargo/config.toml configuration file.