The growing interest in deploying machine learning (ML) applications on devices with restricted processing power and energy capacity underscores the necessity for computing solutions that not only excel in power and memory efficiency but also ensure low latency for time-sensitive applications. Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision Neural Networks regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e. introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc.
In this work, we propose ISA extensions tailored to mixed precision hardware optimizations, targeting energy-efficient Deep Neural Network inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations.
This repository includes:
A brief overview of the whole process can be seen in the following flowchart:
To get started, we'll first build the RISC-V toolchain. Before proceeding, ensure that all necessary dependencies are installed on the system:
sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk \
build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev
Next, clone the toolchain from the official repository:
git clone https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
git submodule update --init --recursive
To implement the modifications in the RISC-V GNU toolchain, adhere to the changes outlined for the following files:
File Path: path/to/riscv-gnu-toolchain/binutils/include/opcode/riscv-opc.h
#ifndef RISCV_ENCODING_H
#define RISCV_ENCODING_H
#define MATCH_NEUR_INIT 0x1047
#define MASK_NEUR_INIT 0xfe00707f
#define MATCH_NEUR_MACC_8B 0x10002047
#define MASK_NEUR_MACC_8B 0xfe00707f
#define MATCH_NEUR_MACC_4B 0x08002047
#define MASK_NEUR_MACC_4B 0xfe00707f
#define MATCH_NEUR_MACC_2B 0x04002047
#define MASK_NEUR_MACC_2B 0xfe00707f
#define MATCH_NEUR_RES 0x4047
#define MASK_NEUR_RES 0xfe00707f
...
#endif /* RISCV_ENCODING_H */
#ifdef DECLARE_INSN
DECLARE_INSN(neur_init, MATCH_NEUR_INIT, MASK_NEUR_INIT)
DECLARE_INSN(nn_mac_8b, MATCH_NEUR_MACC_8B, MASK_NEUR_MACC_8B
DECLARE_INSN(nn_mac_4b, MATCH_NEUR_MACC_4B, MASK_NEUR_MACC_4B)
DECLARE_INSN(nn_mac_2b, MATCH_NEUR_MACC_2B, MASK_NEUR_MACC_2B)
DECLARE_INSN(neur_res, MATCH_NEUR_RES, MASK_NEUR_RES)
#endif /* DECLARE_INSN */
File Path: path/to/riscv-gnu-toolchain/binutils/opcodes/riscv-opc.c
...
const struct riscv_opcode riscv_opcodes[] =
{
{"neur_init", 0, INSN_CLASS_I, "d,s,t", MATCH_NEUR_INIT, MASK_NEUR_INIT, match_opcode, 0},
{"nn_mac_8b", 0, INSN_CLASS_I, "d,s,t", MATCH_NEUR_MACC_8B, MASK_NEUR_MACC_8B, match_opcode, 0},
{"nn_mac_4b", 0, INSN_CLASS_I, "d,s,t", MATCH_NEUR_MACC_4B, MASK_NEUR_MACC_4B, match_opcode, 0},
{"nn_mac_2b", 0, INSN_CLASS_I, "d,s,t", MATCH_NEUR_MACC_2B, MASK_NEUR_MACC_2B, match_opcode, 0},
{"neur_res", 0, INSN_CLASS_I, "d,s,t", MATCH_NEUR_RES, MASK_NEUR_RES, match_opcode, 0},
...
Now, let's configure and install the toolchain, with the custom instructions included. For the Ibex core, we require the RV32IMC instruction set:
./configure --prefix=/opt/riscv --with-arch=rv32imc --with-abi=ilp32 --with-isa-spec=2.2
sudo make && make install
sudo make clean
We will need to make sure that the directories containing the RISC-V tools are included in the PATH environment variable. Since our binaries are located in '/opt/riscv/bin' , we must add it to our PATH in the shell configuration file ('~/.bashrc' or '~/.bash_profile'):
export PATH=$PATH:/opt/riscv/bin
After adding this line, apply the changes:
source ~/.bashrc
Finally, we should verify that the RISC-V GCC compiler and other tools are accessible. Run the following command to check the RISC-V GCC compiler version:
riscv32-unknown-elf-gcc --version
You should see output similar to:
riscv32-unknown-elf-gcc (gc891d8dc23e) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The next step involves creating the Quantized Model using the Brevitas Library. To begin, we must acquire the full-precision model, either by importing a pre-trained version or training it from scratch. Once we have the model, we will evaluate its accuracy on the given dataset. Following this, we will replicate the model's architecture and substitute its layers with the corresponding quantized versions available in the Brevitas library. Finally, we need to determine the appropriate weight precision for each layer. Two options can be explored:
log2(3^L)
configurations, swiftly identifying a solution that conforms to the user's accuracy requirements. To utilize this approach for a specific network we need to set the max_acc_drop variable to a specific value.Finally, once the optimal configuration aligning with our constraints is determined, we proceed to automatically generate the files that will be utilized for the inference (and simulation) of the QNN on our extended RISC-V core. More specifically, we will create files that contain:
Comprehensive examples that implement the procedure, described above, are provided in this subfolder.
The base design utilized in this work is a generic microarchitecture implementation of the Ibex, an open-source 32 bit RISC-V CPU core written in SystemVerilog. Ibex was initially developed as part of the PULP platform under the name "Zero-riscy", and has been contributed to lowRISC who maintains it and develops it further. It is under active development. The CPU core is heavily parametrizable and well suited for embedded control applications. Ibex is being extensively verified and has seen multiple tape-outs. Ibex supports the Integer (I) or Embedded (E), Integer Multiplication and Division (M), Compressed (C), and B (Bit Manipulation) extensions.
You can explore the codebase for the Ibex repository on GitHub by visiting lowRISC/ibex.
Given that the system’s decoder is initially not configured to recognize the newly introduced instructions, modifications to the decoder are imperative, alongside the integration of the accelerator. By updating the decoder, we ensure it can accurately interpret the opcode and function code fields of these new instructions. This modification enables the decoder to correctly identify and differentiate the new instructions, extract the necessary values from the source registers, and subsequently feed forward this information to the newly added unit.
The overarching objective is to amplify the throughput of MAC operations per cycle, a critical metric for computational efficiency in NN processing, while simultaneously minimizing resource utilization and power consumption. These optimizations will enable us to activate the system’s prefetcher and utilize the advanced optimization flags of the GCC compiler resulting in the generation of faster and more efficient code, without having to stall the processor’s pipeline. The most notable ones are:
For the simulation of our QNNs on the modified Ibex core, with Verilator, we can build and test either:
Ibex Simple System, which an Ibex based system that can run stand-alone binaries and contains the core, a single memory for instructions and data, and a basic peripheral to write ASCII output.
Ibex Demo System, which is an example RISC-V SoC targeting the Arty-A7 FPGA board that includes core alongside UART, GPIO, PWM, Timer, SPI and debug support.
All the necessary files for the 2 designs are located on the extended_ibex folder.
Verilator (we recommend building version 5.008). In brief to install from git:
# Prerequisites:
sudo apt-get install git help2man perl python3 make autoconf g++ flex bison ccache
sudo apt-get install libgoogle-perftools-dev numactl perl-doc
sudo apt-get install libelf-dev
sudo apt-get install srecord
sudo apt-get install libfl2 # Ubuntu only (ignore if gives error)
sudo apt-get install libfl-dev # Ubuntu only (ignore if gives error)
sudo apt-get install zlibc zlib1g zlib1g-dev # Ubuntu only (ignore if gives error)
git clone http://git.veripool.org/git/verilator
# Every time you need to build:
unsetenv VERILATOR_ROOT # For csh; ignore error if on bash
unset VERILATOR_ROOT # For bash
cd verilator
git pull # Make sure we're up-to-date
git checkout v5.008
autoconf
./configure
make -j$(nproc)
sudo make install
At the end you can ensure that Verilator was successfully installed by checking its version:
verilator --version
The output should be the following:
Verilator 5.008 2023-03-04 rev v5.008
FuseSoC, an award-winning package manager and a set of build tools for HDL (Hardware Description Language) code, to automate the process of building the simulations. Ibex uses a custom fork of FuseSoC, so we must install it via this method rather than installing FuseSoC separately. For project dependencies, it is more appropriate to use a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip3 install -U -r python-requirements.txt
To build the simulation for the Simple System, run from its directory the following command (while still being inside the virtual environment):
cd extended_ibex/ibex_simple_system/
fusesoc --cores-root=. run --target=sim --setup --build lowrisc:ibex:ibex_simple_system --RV32E=0 --RV32M=ibex_pkg::RV32MSingleCycle
Similarly, in order to build the simulation for the Demo System (the one that we will mapped on the FPGA device) run from its directory the following:
cd extended_ibex/ibex_demo_system/
fusesoc --cores-root=. run --target=sim --tool=verilator --setup --build lowrisc:ibex:demo_system
When we are done, we can deactivate the virtual environment by running:
deactivate
To compile the appropriate C executable for our Quantized Neural Network, we should first make sure that the previously generated C file along with the header files containing the compressed network's parameters and inputs are located in the same folder under the inference codes directory. Within this directory, you'll find compatible software examples that work for both system simulations. Additionally on the common subfolder, you'll find the optimized kernels featuring the custom instructions.
For example, to build the LeNet5 MNIST example, from our root repository run:
make -C inference_codes/lenet5_mnist/optimized
The compiled .elf program is available at inference codes/lenet5_mnist/optimized. The same directory also contains a Verilog memory file (vmem file) to be used with some simulators.
Having built the simulator and software, to simulate using Verilator we can use the following commands.
<sw_elf_file>
should be a path to an ELF file (or alternatively a vmem file)
built as described above. Pass -t
to get an VCD trace of the execution.
# For the Simple System Simuation:
./extended_ibex/ibex_simple_system/build/lowrisc_ibex_ibex_simple_system_0/sim-verilator/Vibex_simple_system [-t] --meminit=ram,<sw_elf_file>
# For the Demo System Simulation:
./extended_ibex/ibex_demo_system/build/lowrisc_ibex_demo_system_0/sim-verilator/Vibex_demo_system [-t] --meminit=ram,<sw_elf_file>
# For example :
./extended_ibex/ibex_demo_system/build/lowrisc_ibex_demo_system_0/sim-verilator/Vibex_demo_system \
--meminit=ram,./inference_codes/lenet5_mnist/optimized/lenet5_mnist.elf
We can utilize the simulation output to extract measurements about the latency of our program and estimate speedup with respect to the original Ibex core.
Simulation statistics
=====================
Executed cycles: 488901
Wallclock time: 12.188 s
Simulation speed: 40113.3 cycles/s (40.1133 kHz)
Performance Counters
====================
Cycles: 480856
NONE: 0
Instructions Retired: 334753
LSU Busy: 121484
Fetch Wait: 2168
Loads: 117569
Stores: 3915
Jumps: 7
Conditional Branches: 45792
Taken Conditional Branches: 8224
Compressed Instructions: 43
Multiply Wait: 0
Divide Wait: 0
[1] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference, 2017.
If you find our project helpful, please consider citing our paper:
@misc{mixedICCAD24,
title={Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations},
author={Giorgos Armeniakos and Alexis Maras and Sotirios Xydis and Dimitrios Soudris},
year={2024},
eprint={2407.14274},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2407.14274},
}