CTSRD-CHERI / SIMTight

Synthesisable SIMT-style RISC-V GPGPU
28 stars 8 forks source link
fpga gpgpu haskell riscv simt

SIMTight

SIMTight is a fully synthesisable GPGPU core implementing the Single Instruction Multiple Threads (SIMT) model, featuring:

Further details about SIMTight can be found in the following publications.

SIMTight is being developed on the CAPcelerate project, part of the UKRI's Digital Security by Design programme.

Default SoC

The default SIMTight SoC consists of a host CPU and a 32-lane 64-warp streaming multiprocessor sharing DRAM, both supporting the CHERI-RISC-V ISA. A sample project is included for the DE10-Pro (revD and revE) FPGA development board.

Build instructions

We'll need Verilator, a RISC-V compiler, and GHC 9.2.1 or later.

On Ubuntu 20.04 or 22.04, we can do:

$ sudo apt install verilator
$ sudo apt install gcc-riscv64-unknown-elf
$ sudo apt install libgmp-dev

For GHC 9.2.1 or later, ghcup can be used.

If you're having difficulty meeting the dependencies, please use our docker container.

Getting started

Recursively clone the repo:

$ git clone --recursive https://github.com/CTSRD-CHERI/SIMTight

Inside the repo, there are various things to try. For example, to build and run the SIMTight simulator:

$ cd sim
$ make
$ ./sim &

With the simulator running in the background, we can build and run the test suite:

$ cd apps/TestSuite
$ make test-cpu-sim     # Run on the CPU
$ make test-simt-sim    # Run on the SIMT core

Alternatively, we can run one of the SIMT kernels:

$ cd apps/Samples/Histogram
$ make RunSim
$ ./RunSim

To run all tests and benchmarks, we can use the test script. This script will launch the simulator automatically, so we first make sure it's not already running.

$ killall sim
$ cd test
$ ./test.sh            # Run in simulation

To build an FPGA image for the DE10-Pro revE board (Quartus 21.3pro or later recommended):

$ cd de10-pro-e
$ make                 # Assumes quartus is in your PATH
$ make download-sof    # Assumes DE10-Pro revE is connected via USB

We can now run a SIMT kernel on FPGA:

$ cd apps/Samples/Histogram
$ make
$ ./Run

To run the test suite and all benchmarks on a DE10-Pro revE FPGA:

$ cd test
$ ./test.sh --fpga-e    # Assumes FPGA image built and FPGA connected via USB

Use the --stats option to generate performance stats.

Enabling CHERI :cherries:

To enable CHERI, some additional preparation is required. First, edit inc/Config.h and apply the following settings:

Second, install the CHERI-Clang compiler using our script. Assuming all of cheribuild's dependencies are met, we can simply do:

$ cd cheri-tools
$ ./build-cheri.sh

This will install the compiler into $(pwd)/cheri/output/sdk/bin, which we can then add to our PATH:

export PATH=$(pwd)/cheri/output/sdk/bin:$PATH

If you're having difficulty meeting any of cheribuild's dependencies, please use our docker container.

We musn't forget to make clean in the root of the SIMTight repo any time inc/Config.h is changed. At this point, all of the standard build instructions should work as before.

CHERI instructions for getting and setting bounds on capabilities are quite expensive in terms of logic area and typically not performance critical. Therefore, it can be useful to share bounds getting/setting logic between vector lanes:

Various optimisations are enabled by this setting. It leads to a large reduction in area overhead, at almost no performance cost accross the benchmark suite.

Another option that reduces the area overhead of CHERI is:

But beware, this setting removes some CHERI functionality. Specifically, it tells the SIMT core to ignore changes to the bounds and permissions of the PCC. Once the bounds and permissions of the PCC for each warp are set at kernel startup, they can never be changed.

Enabling scalarisation

Scalarisation is an optimastion that detects uniform and affine vectors and processes them more efficiently as scalars, reducing on-chip storage and increasing performance density. An affine vector is one in which there is a constant stride between each element; a uniform vector is an affine vector where the stride is zero, i.e. all elements are equal.

SIMTight implements dynamic scalarisation (i.e. in hardware, at runtime), and it can be enabled separately for the integer register file and the register file holding capability meta-data. To enable scalarisation of both register files, edit inc/Config.h and apply the following settings:

These options alone only enable scalarisation of uniform vectors. To enable scalariastion of affine vectors, apply the following settings

Note that affine scalarisation is never used in the register file holding capability meta-data, where it wouldn't make much sense.

SIMTight exploits scalarisation to reduce register file storage requirements. Hence, it is desirable to set the number of physical registers to a value smaller than the number of architectural registers. In cases where scalarisation cannot prevent overflow of the physical register file, the hardware implements dynamic register spilling, where registers are evicted to and fetched from DRAM as required. In the default configuration, the size of the physical register files is equal to the number of architectural registers (so dynamic spilling is not required):

At the moment we have two spill policies: pick-first and least-recently-used. To enable the latter:

When CHERI is enabled, it's possible to share vector register memory between the integer and capability meta-data register files.

In this case, both register file sizes must be defined the same. This option causes a one cycle pipeline bubble when loading a capability meta-data vector from the register file.

SIMTight also supports an experimental scalarised vector store buffer to reduce the cost of compiler-inserted register spills (as opposed to hardware-inserted dynamic spills), at low hardware cost, which can be enabled as follows.

As well as reducing on-chip storage, scalarisation is also exploited to improve runtime performance: enabling a scalar pipeline in the SIMT core allows an entire warp to be executed on a single execution unit in a single cycle (when the instruction is detected as scalarisable), and operates in parallel with the main vector pipeline. For many workloads, this increases perforance density significantly.

To enable the intial value optimisation (IVO) in the capability meta-data register file:

This a simple form of partial scalarisation allowing compact storage of vectors that can be partioned into an arbitrary scalar value and the initial value (null capability meta-data in this case) using a bit mask. These bit masks need to be stored alongside their associated scalar values but are allocated dynamically on demand so that the cost is not paid for every scalar register. The parameter SIMTCapRFLogNumPartialMasks determines the max number of masks that can be stored. If the limit is reached then the optimisation simply becomes unavailable.

In future, we are interested in looking at general partial scalarisation, as well as inter-warp scalarisation.



Supported by


Digital Security by Design (DSbD) Programme