chili-chips-ba / wireguard-fpga

Full-throttle, wire-speed hardware implementation of Wireguard VPN, using low-cost Artix7 FPGA with opensource toolchain. If you seek security and privacy, nothing is private in our codebase. Our door is wide open for backdoor scrutiny, be it related to RTL, embedded, build, bitstream or any other aspect of design and delivery package. Bujrum!
https://nlnet.nl/project/KlusterLab-Wireguard
BSD 3-Clause "New" or "Revised" License
34 stars 0 forks source link
cocotb embedded fpga iss risc-v rtl verilator verilog vpn vproc wireguard

Wireguard FPGA

Virtual Private Networks (VPNs) are the central and indispensable component of Internet security. They comprise a set of technologies that connect geographically dispersed, heterogeneous networks through encrypted tunnels, creating the impression of a homogenous private network on the public shared physical medium.

With traditional solutions (such as OpenVPN / IPSec) starting to run out of steam, Wireguard is increasingly coming to the forefront as a modern, secure data tunneling and encryption method, one that's also easier to manage than the incumbents. Both software and hardware implementations of Wireguard already exist. However, the software performance is far below the speed of wire. Existing hardware approaches are both prohibitively expensive and based on proprietary, closed-source IP blocks and tools.

The intent of this project is to bridge these gaps with an FPGA open-source implementation of Wireguard, written in SystemVerilog HDL.

A Glimpse into History

We have contributed to Blackwire project, which is a 100Gbps hardware implementation of Wireguard switch based on AMD/Xilinx-proprietary AlveoU50 PC-accelerator card (SmartNIC form-factor), and implementable only with proprietary Vivado toolchain.

While working on the Blackwire, we have touched multiple sections, and focused on the novel algorithm for Balanced Binary Tree Search of IP tables. However, the Blackwire hardware platform is expensive and priced out of reach of most educational institutions. Its gateware is written in SpinalHDL, which is a niche HDL, hardly popular in either academia or industry. While Blackwire is now released to open-source, that decision came from financial hardship -- It was originaly meant for sale. Moreover, the company behind it has potential lawsuit issues that bring into question the legality of ownership over the codebase they donated to the open source community.

Back to the Future

To make the hardware Wireguard truly accessible in the genuine spirit of open-source movement, this project implements it:

References

[Ref1] Wireguard implementations in software:

[Ref2] 100Gbps Blackwire Wireguard

[Ref3] Corundum, open-source FPGA-NIC platform

[Ref4] ChaCha20-Poly1305 open-source Crypto RTL

[Ref5] Cookie Cutter SOC

[Ref6] RISC-V ISS

[Ref7] 10Gbps Ethernet Switch

[Ref8] OpenXC7 open-source tools for Xilinx Series7

Project Outline

Scope

The Phase1 (This!) is primarily Proof of Concept, i.e. not full-featured, and definitely not a deployable product. It is envisoned as a mere on-ramp, a springboard for future build-up and optimizations.

The Phase2 continuation project is therefore also in the plans, to maximize efficiency and overall useability, such as by increasing the number of channels, facilitating management with GUI apps, or something else as identified by the community feedback.

Recognized Challenges

1) HW/SW partitioning, interface, interactions and workload distribution

  • While, contrary to Blackwire, we don’t rely on an external PC connected via PCIE, we will still have an on-chip RISC-V CPU with intricate hardware interface and significant Embedded Software component that controls the backbone of wire-speed datapath

2) HW/SW co-development, integration and debugging

  • Standard simulation is impractical for the project of this size and complexity. We therefore intend to put to test and good use the very promissing new VProc ISS [Ref6]
  • It’s also impractical and expensive to provide full test systems with real traffic generators and checkers to all developers. We therefore plan to rent some space for a central lab that will host two test systems, then provide remote access to all developers 3) Real-life, at-speed testing 4) Extent of open-source tools support for SystemVerilog and all needed FPGA primitives and IP functions 5) QOR of the (still maturing) open-source tools
  • Blackwire used commercial, AMD/Xilinx-proprietary Vivado toolchain, as well as high-end Alveo U50 FPGA silicon. Even then, they ran into multiple timing closure, utilization and routing congestion challenges. 6) Financial resources
  • Given that this is a complex, multi-disciplinary dev effort, the available funding may not be sufficient to bring it to completion. Blackwire, despite a larger allocated budget, ended up with funding crisis and abrupt cessation of dev activities.

Project Execution Plan / Tracking

This project is WIP at the moment. The checkmarks below indicate our status. Until all checkmarks are in place, anything you get from here is w/o guaranty -- Use at own risk, as you see fit, and don't blame us if it is not working 🌤️

Take1

Board bring up. In-depth review of Wireguard ecosystem and prior art. Design Blueprint

While the board we're using is low cost, it is also not particularly known in the open-source community. We certainly don’t have prior experience with it. In this opening take we will build a solid foundation for efficient project execution. Good preparation is crucial for a smooth run. We thus seek to first understand and document what we will be designing: SOC Architecture, Datapath Microarchitecture, Hardware/Software Partitioning, DV and Validation Strategy.

Getting a good feel for our Fmax is also a goal of this take. Artix-7 does not support High-Performance (HP) I/O. Consequently, we cannot push its I/O beyond 600MHz, nor its core logic beyond 100 MHz.

Take2

Implementation of a basic, statically pre-configured Wireguard link

It it in this take that we start creating hardware Datapath and hardening Wireguard encryption protocols, all using Vivado and Xilinx primitives.

Take3

Development and integration of embedded management software (Control Plane)

This work package is about hardware/software codesign and integration. The firmware will run on a soft RISC V processor, inside the FPGA. Our vanilla SOC is at this point starting to be customized to Wireguard needs. This work can to some extent go on in parallel with hardware activities of Take2.

Take4

VPN Tunnel: Session initialization, maintenance, and secure closure

This is about managing the bring-up, maintenance and tear-down of VPN tunnels between two devices.

Take5

Testing, Profiling and Porting to OpenXC7

Take6 (time-permitting Bonus)

Flow control module for efficient and stable VPN tunnel data management

The objective of this optional deliverable is to ensure stable and efficient links, thus taking this project one step closer to a deployable product.

Design Blueprint (WIP)

HW/SW Partitioning

Since the Wireguard node essentially functions as an IP router with Wireguard protocol support, we have decided to design the system according to a two-layer architecture: a control plane responsible for managing IP routing processes and executing the Wireguard protocol (managing remote peers, sessions, and keys), and a data plane that will perform IP routing and cryptography processes at wire speed. The control plane will be implemented as software running on a soft CPU, while the data plane will be fully realized in RTL on an FPGA.

HWSWPartitioning

In the HW/SW partitioning diagram, we can observe two types of network traffic: control traffic, which originates from the control plane and goes toward the external network (and vice versa), and data traffic, which arrives from the external network and, after processing in the data plane, returns to the external network. Specifically, control traffic represents Wireguard protocol handshake messages, while data traffic consists of end-user traffic, either encrypted or in plaintext, depending on the perspective.

Hardware Architecture

HW Block Diagram

HWArchitecture

HW Theory of Operation

The hardware architecture essentially follows the HW/SW partitioning and consists of two domains: a soft CPU for the control plane and RTL for the data plane.

The soft CPU is equipped with a Boot ROM and a DDR3 SDRAM controller for interfacing with off-chip memory. External memory is exclusively used for control plane processes and does not store packets. The connection between the control and data planes is established through a CSR-based HAL.

The data plane consists of several IP cores, which are listed and explained in the direction of network traffic propagation:

ChaCha20-Poly1305 Encryptor/Decryptor are using RFC7539's AEAD (Authenticated Encryption Authenticated Data) construction based on ChaCha20 for symmetric encryption and Poly1305 for authentication.

Software Architecture

SW Conceptual Class Diagram

SWArchitecture

SW Theory of Operation

The conceptual class diagram provides an overview of the components in the software part of the system without delving into implementation details. The focus is on the Wireguard Agent, which implements the protocol's handshake procedures, along with the following supplementary components:

Hardware Data Flow

HW Flow Chart, Throughputs and Pushbacks

The hardware architecture features three clock signal domains:

The blue domain is defined based on the SDR GMII interface, which operates at 125 MHz and connects the Realtek PHY controller with the MAC cores on the FPGA.

The red domain encompasses the entire CSR with all peripherals. The clock signal frequency and bus width are defined based on the assumption that Wireguard peers exchange handshake messages sporadically—during connection initialization and periodically, typically every few minutes, for key rotation. Since handshake signaling does not significantly impact network traffic, we decided to implement the connection between the data and control planes without DMA, utilizing direct CPU interaction with Tx/Rx FIFOs through a CSR interface.

Although the Data Plane Engine (green domain) transfers packets at approximately 10 Gbps, the cores in the DPE pipeline are not expected to process packets at such a rate. Given that we have 4 x 1Gbps Ethernet interfaces, the cryptographic cores must process packets at a rate of at least 4 Gbps to ensure the system works at wire speed. For some components, such as the IP Lookup Engine, packet rate is more critical than data rate because their processing focuses on the packet headers rather than the payload. Assuming that, in the worst-case scenario, the smallest packets (64 bytes) arrive via the 1 Gbps Ethernet interface, the packet rate for each Ethernet interface would be 1,488,096 packets per second (pps). Therefore, in the worst-case scenario, such components must process packets at approximately 6 Mpps rate (e.g. 6 million IP table lookups per second).

ExampleToplogy

The cores within the DPE transmit packets via the AXI4-Stream interface. Although data transfer on the TDATA bus is organized as little-endian, it is important to note that the internal organization of fields within the headers of Ethernet, IP, and UDP protocols follows big-endian format (also known as network byte order). On the other hand, the fields within the headers of the WireGuard protocol are transmitted in little-endian format.

ExampleToplogy

Software Control Flow

SW Flow Chart, Messages and HW Intercepts

During the initial WireGuard handshake and subsequent periodic key rotations, the control plane must update the cryptokey routing table implemented in register memory within the CSR. Since the CSR manages the operation of the DPE, such changes must be made atomically to prevent unpredictable behavior in the DPE. One way to achieve this is by using write-buffered registers (WBR). However, implementing 1 bit of WBR memory requires three flip-flops: one to store the current value, one to hold the future value, and one for the write-enable signal. Therefore, we consider an alternative mechanism for atomic CSR updates based on flow control between the CPU and the DPE. Suppose the CPU needs to update the contents of a routing table implemented using many registers. Before starting the update, the CPU must pause packet processing within the DPE. However, such a pause cannot be implemented using the inherent stall mechanism supported by the AXI protocol (by deactivating the TREADY signal at the end of the pipeline), as a packet that has already entered the DPE must be processed according to the rules in effect at the time of its entry. We introduce a graceful flow control mechanism coordinated through a dedicated Flow Control Register (FCR) to address this.

ExampleToplogy

The atomic CSR update mechanism works as follows:

  1. When the CPU needs to update the routing table, it activates the PAUSE signal by writing to the FCR.P register.
  2. The active FCR.P signal instructs the input multiplexer to transition into the PAUSED state after completing the servicing of the current queue. The CPU periodically checks the ready bits in the FCR register.
  3. Once the first component finishes processing its packet and clears its datapath, it deactivates the TVALID signal and transitions to the IDLE state. The CPU continues to check the ready bits in the FCR register.
  4. The processing of remaining packets and datapath clearing continues until all components transition to the IDLE state. The CPU monitors the ready bits in the FCR register, which now indicates that the DPE has been successfully paused.
  5. The CPU updates the necessary registers (e.g., the routing table) over multiple cycles.
  6. Upon completing the updates, the CPU deactivates the PAUSE signal (FCR.P).
  7. The multiplexer returns to its default operation mode and begins accepting packets from the next queue in a round-robin fashion.
  8. As packets start arriving, all components within the DPE gradually transition back to their active states.

ExampleToplogy

HW/SW Working Together as a Coherent System

The example is based on a capture of real Wireguard traffic, recorded and decoded using the Wireshark tool (encrypted and decrypted). The experimental topology consists of four nodes:

ExampleToplogy

To illustrate the operation of the system as a whole, we will follow the step-by-step passage of packets through the system in several phases:

Example1

  1. The WireGuard Agent on peer A initiates the establishment of the VPN tunnel by generating the contents of the Handshake Initiation packet.
  2. The CPU transfers the Handshake Initiation packet from RAM to the Rx FIFO via the CSR interface towards the data plane.
  3. Once the entire packet is stored in the Rx FIFO, the Round Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  4. In the first three cycles of the packet transfer, the Header Parser extracts important information from the packet header (including the destination IP address and type of WireGuard message) and supplements the extracted metadata to the packet before passing it along. Now, knowing the message type (Handshake Initiation), the WireGuard/UDP Packet Disassembler and ChaCha20-Poly1305 Decryptor forward the packet without any further processing.
  5. The IP Lookup Engine searches the routing table based on the destination IP address and determines the outgoing Ethernet interface, supplementing this information to the packet before forwarding it. Similar to the previous step, the WireGuard/UDP Packet Assembler and ChaCha20-Poly1305 Encryptor forward the packet without any additional processing.
  6. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (to if1).
  7. Once the entire packet is stored in the Tx FIFO, it is dispatched to the MAC core of the outgoing interface if1, provided that the corresponding 1 Gbps link is active and ready.
  8. The 1G MAC writes its MAC address as the source address, calculates the FCS on the fly, adds it to the end of the Ethernet frame, and sends it to peer B.

Example23

  1. On peer B, the 1G MAC receives the incoming Ethernet frame and calculates the FCS on the fly. If the frame is valid, it is forwarded to the Rx FIFO (from if1).
  2. Once the entire packet is stored, the Rx FIFO signals the Round-Robin Multiplexer.
  3. The Round-Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  4. The Header Parser extracts important information from the packet header (including the destination IP address and type of WireGuard message) and supplements the extracted metadata to the packet before passing it along. Now, knowing the message type (Handshake Initiation), the WireGuard/UDP Packet Disassembler and ChaCha20-Poly1305 Decryptor forward the packet without any further processing.
  5. The IP Lookup Engine searches the routing table based on the destination IP address and determines that the control plane is the destination, supplementing this information to the packet before forwarding it. Similar to the previous step, the WireGuard/UDP Packet Assembler and ChaCha20-Poly1305 Encryptor forward the packet without any additional processing.
  6. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (toward the CPU).
  7. Once the entire packet is stored in the Tx FIFO, the CPU transfers the packet from the FIFO to RAM via the CSR-based interface and hands it over to the WireGuard Agent for further processing.
  8. The WireGuard Agent processes the Handshake Initiation request and generates the Handshake Response.
  9. The Routing DB Updater updates the routing table per the WireGuard Agent's instructions (adding the peer's IP address and WireGuard-related data).
  10. The CPU updates the registers from which the data plane reads the routing table and the corresponding cryptographic keys via the CSR interface.
  11. The CPU transfers the Handshake Response packet from RAM to the Rx FIFO (to the data plane) via the CSR interface.
  12. The Round-Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  13. The Header Parser extracts important information from the packet header (including the destination IP address and type of WireGuard message) and supplements the extracted metadata to the packet before passing it along. Now, knowing the message type (Handshake Response), the WireGuard/UDP Packet Disassembler and ChaCha20-Poly1305 Decryptor forward the packet without any further processing.
  14. The IP Lookup Engine searches the routing table based on the destination IP address and determines the outgoing Ethernet interface, supplementing this information to the packet before forwarding it. Similar to the previous step, the WireGuard/UDP Packet Assembler and ChaCha20-Poly1305 Encryptor forward the packet without any additional processing.
  15. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (toward if1).
  16. Once the entire packet is stored in the Tx FIFO, it is dispatched to the MAC core of the outgoing interface if1, provided that the corresponding 1 Gbps link is active and ready.
  17. The 1G MAC writes its MAC address as the source address, calculates the FCS on the fly, adds it to the end of the Ethernet frame, and sends it to peer A.

Example4

  1. On peer A, the 1G MAC receives the incoming Ethernet frame and calculates the FCS on the fly. If the frame is valid, it is forwarded to the Rx FIFO (from if1).
  2. Once the entire packet is stored, the Rx FIFO signals the Roung-Robin Multiplexer.
  3. The Round-Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  4. The Header Parser extracts important information from the packet header (including the destination IP address and type of WireGuard message) and supplements the extracted metadata to the packet before passing it along. Now, knowing the message type (Handshake Response), the WireGuard/UDP Packet Disassembler and ChaCha20-Poly1305 Decryptor forward the packet without any further processing.
  5. The IP Lookup Engine searches the routing table based on the destination IP address and determines that the control plane is the destination, supplementing this information to the packet before forwarding it. Similar to the previous step, the WireGuard/UDP Packet Assembler and ChaCha20-Poly1305 Encryptor forward the packet without any additional processing.
  6. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (toward the CPU).
  7. Once the entire packet is stored in the Tx FIFO, the CPU transfers the packet from the FIFO to RAM via the CSR-based interface and hands it over to the WireGuard Agent for further processing.
  8. The WireGuard Agent processes the Handshake Response.
  9. The Routing DB Updater updates the routing table per the WireGuard Agent's instructions (adding the peer's IP address and WireGuard-related data).
  10. The CPU updates the registers from which the data plane reads the routing table and the corresponding cryptographic keys. The session is now officially established, and the exchange of user data over the encrypted VPN tunnel can commence.

Example5

  1. On peer A, an end-user packet (ICMP Echo Request) arrives via the if2 Ethernet interface. The 1G MAC receives the incoming Ethernet frame and calculates the FCS on the fly. If the frame is valid, it is forwarded to the Rx FIFO (from if2).
  2. Once the entire packet is stored, the Rx FIFO signals the Round-Robin Multiplexer.
  3. The Round-Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  4. The Header Parser extracts important information from the packet header (including the destination IP address and protocol type) and supplements the extracted metadata to the packet before passing it along. Now, knowing the protocol type (ICMP), the WireGuard/UDP Packet Disassembler and ChaCha20-Poly1305 Decryptor forward the packet without any further processing.
  5. The IP Lookup Engine searches the routing table based on the destination IP address and determines the target WireGuard peer and the outgoing Ethernet interface, supplementing this information to the packet before forwarding it.
  6. Based on the information about the target peer and the corresponding key, the ChaCha20-Poly1305 Encryptor encrypts the packet and adds an authentication tag.
  7. The WireGuard/UDP Packet Assembler adds WireGuard, UDP, IP, and Ethernet headers filled with the appropriate data to the encrypted packet and forwards it.
  8. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (toward if1).
  9. Once the entire packet is stored in the Tx FIFO, it is sent to the MAC core of the outgoing interface if1, provided that the corresponding 1 Gbps link is active and ready.
  10. The 1G MAC writes its MAC address as the source, calculates the FCS on the fly, which it ultimately appends to the end of the Ethernet frame, and then sends it to peer B.

Example6

  1. On peer B, the 1G MAC receives the incoming Ethernet frame and calculates the FCS on the fly. If the frame is valid, it is forwarded to the Rx FIFO (from if1).
  2. Once the entire packet is stored, the Rx FIFO signals the Round-Robin Multiplexer.
  3. The Round-Robin Multiplexer services the packet from the FIFO and injects it into the data plane pipeline.
  4. The Header Parser extracts important information from the packet header (including source/destination IP addresses and the type of WireGuard message) and supplements the extracted metadata to the packet before passing it along.
  5. Based on the destination IP address, the WireGuard/UDP Packet Disassembler knows that the packet is intended for this peer, extracting the encrypted payload and forwarding it for further processing.
  6. The ChaCha20-Poly1305 Decryptor decrypts the packet and, after verifying the authentication tag, forwards it further.
  7. The IP Lookup Engine now receives the decrypted plaintext user packet (ICMP Echo Request). After searching the cryptokey routing table based on the source IP address of the decrypted plaintext packet, a decision is made to accept or reject the packet. If the packet correspondingly routes, it is forwarded.
  8. Based on the accompanying metadata, the Demultiplexer directs the packet to the corresponding Tx FIFO (toward if2).
  9. Once the entire packet is stored in the Tx FIFO, it is sent to the MAC core of the outgoing interface if2, provided that the corresponding 1 Gbps link is active and ready.
  10. The 1G MAC writes its MAC address as the source, calculates the FCS on the fly, which it ultimately appends to the end of the Ethernet frame, and then sends it to the end-user host of peer B.

Simulation Test Bench

References:

The WireGuard test bench aims to have a flexible approach to simulation which allows a common test envoironment to be used whilst selecting between alternative CPU components, one of which uses the VProc virtual processor co-simulation element. This allows simulations to be fully HDL, with a RISC-V processor RTL implementation such as picoRV32 or EDUBOS5, or to co-simulate software using the virtual processor, with a significant speed up in simulation times.

The VProc component is wrapped up into an soc_cpu.VPROC component with identical interfaces to the RTL. Some converion logic is added to this BFM to convert between VProc's generic memory mapped interface and the soc_if defined interface. This is very lightweight logic, with less than ten combinatorial gates to match the control signals. In addition, the soc_cpu.VPROC component has a mem_model component instantiated. This is a 'memory port' to the mem_model C software implementation of a sparse memory model, allowing updates to the RISC-V program, if using the rv32 RISC-V ISS model (see below). The diagram below shows a block diagram of the test bench HDL.

Shown in the diagram is the WireGuard top level component (top) with the soc_cpu.VPROC component instantiated in it as one of three possible selected devices for the soc_cpu. The IMEM write port is connected to the UART for program updates and the soc_if from soc_cpu.VPROC is connected to the interconnect fabric (soc_fabric), just as for the two RTL components. The test bench around the top level WireGuard component has a driver for the UART (bfm_uart) and the four GMII interfaces (includeing MDIO signalling) come from the WireGuard core to some verification IP to drive this signalling. This VIP implementation is TBD, but might be based on a modified tcpIpPg model, which is only XGMII compliant at this time. Finally the test bench generates clocks and key press resets that go to the top level's clk_rst_gen and debounce components.

Auto-selection of soc_cpu Component

The WireGuard's top level component has the required RTL files listed in 1.hw/top.filelist. This includes files for the soc_cpu, under the directory ip.cpu. The simulation build make file (see below) will process the top.filelist file to generate a new local copy, having removed all references to the files under the ip.cpu directory. Since the VProc soc_cpu component is a test model, the soc_cpu.VPROC.sv HDL file is placed in 4.sim/models whilst the rest of the HDL files come from the VProc and mem_model repositories (auto-checked out by the make file, if necessary). These are referenced within the make file, along with the other test models that are used in the test bench. Thus the VProc device is selected for the simulation as the CPU component.

VProc Software

The VProc software consists of DPI-C code for communication and sycnronisation with the simulation, for both the memory model and VProc itself. On top of this are the APIs for VProc and mem_model for use by the running code. In the case of VProc there is a low level C API or, if preferred, a C++ API. In WireGuard, the VProc soc_cpu is node 0, and so the entry point for user software is VUserMain0, in place of main.

The C++ API is defined in a class VProc (defined in VProcClass.h) from the VProc repository, and a constructor creates an API object, defining the node for which is connected:

VProc (const unsigned node);

For the C++ VProc API there are two basic word access methods:

    int  write (const unsigned   addr, const unsigned    data, const int delta=0);
    int  read  (const unsigned   addr,       unsigned   *data, const int delta=0);

For these methods, the address argument is agnostic to being a byte address or a word address, but for the WireGuard implementation these are byte addresses. The delta argument is unused in WireGuard, and should be left at its default value, with just the address and data arguments used in the call to these methods. Along with these basic methods is a method to advance simulation time without doing a read or write transaction.

int  tick (const unsigned ticks);

This method's units of the ticks argument are in clock cycles, as per the clock that the VProc HDL is connected to. A basic VProc program, then, is shown below:

#include "VProcClass.h"
extern "C" {
#include "mem.h"
}

static const int node    = 0;

extern "C" void VUserMain0(void)
{   
    // Create VProc access object for this node
    VProc* vp0 = new VProc(node);

    // Wait a bit
    vp0->tick(100);

    uint32_t addr  = 0x10001000;
    uint32_t wdata = 0x900dc0de;

    vp0->write(addr, wdata);
    VPrint("Written   0x%08x  to  addr 0x%08x\n", wdata, addr);

    vp0->tick(3);

    uint32_t rdata;
    vp0->read(addr, &rdata);

    if (rdata == wdata)
    {
        VPrint("Read back 0x%08x from addr 0x%08x\n", rdata, addr);
    }
    else
    {   VPrint("***ERROR: data mis-match at addr = 0x%08x. Got 0x%08x, expected 0x%08x\n", addr, rdata, wdata);
    }

    // Sleep forever
    while(true)
        vp0->tick(GO_TO_SLEEP);
}

The above code is a slightly abbreviated version of the code in 4.sim/usercode. Note that the VUserMain0 function must have C linkage as the VProc software that calls it is in C (as all the programming logic interfaces, including DPI-C, are C). The API also has a set of other methods for finer access control which are listed below, and more details can be found in the VProc manual.

    int  writeByte    (const unsigned   byteaddr, const unsigned    data, const int delta=0);
    int  writeHword   (const unsigned   byteaddr, const unsigned    data, const int delta=0);
    int  writeWord    (const unsigned   byteaddr, const unsigned    data, const int delta=0);
    int  readByte     (const unsigned   byteaddr,       unsigned   *data, const int delta=0);
    int  readHword    (const unsigned   byteaddr,       unsigned   *data, const int delta=0);
    int  readWord     (const unsigned   byteaddr,       unsigned   *data, const int delta=0);

The other methods is this class are not, at this point, used by WireGuard. These methods can now be used to write test code to drive the soc_if bus of the soc_cpu component, and is the basic method to write test code software. As well as the VProc API, the user software can have direct access to the sparse memory model API by including mem.h, which are a set of C methods (and mem.h must be included as extern "C" in C++ code). The functions relevant to WireGuard are shown below:

void     WriteRamByte  (const uint64_t addr, const uint32_t data, const uint32_t node);
void     WriteRamHWord (const uint64_t addr, const uint32_t data, const int little_endian, const uint32_t node);
void     WriteRamWord  (const uint64_t addr, const uint32_t data, const int little_endian, const uint32_t node);
uint32_t ReadRamByte   (const uint64_t addr, const uint32_t node);
uint32_t ReadRamHWord  (const uint64_t addr, const int little_endian, const uint32_t node);
uint32_t ReadRamWord   (const uint64_t addr, const int little_endian, const uint32_t node);

Note that, as C functions, there are no default parameters and the little_endian and node arguments must be passed in, even though they are constant. The little_endian argument is non-zero for little endian and zero for big endian. The node argument is not the same as for VProc, but allows multiple separate memory spaces to be modelled, just as for VProc multiple virtual processor instantiations. For WireGuard, this is always 0. All instantiated mem_model components in the HDL have (through the DPI) access to the same memory space model as the API, and so data can be exchanged from the simulation and the running code, such as the RISC-V programs over the IMEM write interface.

Compiling co-designed application code, either compiled for the native host machine, or to run on the rv32 RISC-V ISS will need further layers on top of these APIs, which will be virtualised away by that point (see the sections below). The diagram below summarises the software layers that make up a program running on the VProc HDL component. The "native test code" use case, shown at the top left, is for the case just described above that use the APIs directly.

Other Software Use Cases

Natively Compiled Application

As well as the native test code case seen in the previous section, the WireGuard application can be compiled natively for the host machine, including the hardware access layer (HAL), generated from SystemRDL. The HAL software output from this is processed (Work in Progress) to generate a version that makes accesses to the VProc and mem_model APIs in place of accesses with pointers to and from memory. The rest of the application software has these details hidden away in the HAL and sees the same API as presented by the auto-generated code. In both cases transactions happen on the soc_if bus port of the soc_cpu component. The main entry point is also swapped for VUserMain0 (method is a Work in Progress).

RISC-V Compiled Application

To execute RISC-V compiled application code, the rv32 instruction set simulator is used as the code running on the virtual processor. The VUserMain0 program now becomes software to creates an iss object and integrate with VProc. This uses the ISS's external memory access callback function to direct loads and stores either towards the sparse memory model, the VProc API for simulation transactions, or back to the ISS itself to handle. This ISS integration VUserMain0 program is located in 4.sim/models/rv32/usercode. When built the code here is compiled and uses the pre-built library in 4.sim/models/rv32/lib/librv32lnx.a containing the ISS, with the headers for it in 4.sim/models/rv32/include (Work in Progress).

The ISS supports interrupts, but these are not currently used on WireGuard. The integration software can read a configuration file, if present in the 4.sim/ directory, called vusermain.cfg. This allows the ISS and other features to be configured at run-time. The configuration file is in lieu of command line options and the entries in the file are formatted as if they were such, with a command matching the VUserMain program:

vusermain0 [options]

One of the options is -h for a help message, which is as shown below:

Usage:vusermain0 -t <test executable> [-hHebdrgxXRcI][-n <num instructions>]
      [-S <start addr>][-A <brk addr>][-D <debug o/p filename>][-p <port num>]
      [-l <line bytes>][-w <ways>][-s <sets>][-j <imem base addr>][-J <imem top addr>]
      [-P <cycles>][-x <base addr>][-X <top addr>][-V <core>]
   -t specify test executable (default test.exe)
   -n specify number of instructions to run (default 0, i.e. run until unimp)
   -d Enable disassemble mode (default off)
   -r Enable run-time disassemble mode (default off. Overridden by -d)
   -C Use cycle count for internal mtime timer (default real-time)
   -a display ABI register names when disassembling (default x names)
   -T Use external memory mapped timer model (default internal)
   -H Halt on unimplemented instructions (default trap)
   -e Halt on ecall instruction (default trap)
   -E Halt on ebreak instruction (default trap)
   -b Halt at a specific address (default off)
   -A Specify halt address if -b active (default 0x00000040)
   -D Specify file for debug output (default stdout)
   -R Dump x0 to x31 on exit (default no dump)
   -c Dump CSR registers on exit (default no dump)
   -g Enable remote gdb mode (default disabled)
   -p Specify remote GDB port number (default 49152)
   -S Specify start address (default 0)
   -I Enable instruction cache timing model (default disabled)
   -l Specify number of bytes in icache line (default 8)
   -w Specify number of ways in icache (default 2)
   -s Specify number of sets in icache (default 256)
   -j Specify cached IMEM base address (default 0x00000000)
   -J Specify cached IMEM top address (default 0x7fffffff)
   -P Specify penalty, in cycles, of one slow mem access (default 4)
   -x Specify base address of external access region (default 0xFFFFFFFF)
   -X Specify top address of external access region (default 0xFFFFFFFF)
   -V Specify RISC-V core timing model to use (default "DEFAULT")
   -h display this help message

With these options the model can load an elf executable to memory directly and be set up with some execution termination conditions. Disassembly output can also be switched on and registers dumped on exit. More details of all these features can be found in the rv32 ISS manual.

Specific to the WireGuard project is the ability to specify the region where memory loads and stores will make external simulation transactions rather than use internal memory modelling or peripherals, using the -x and -X options. This is useful to allow access to the CSR registers in the HDL whilst mapping all of the memory internal using the sparse C memory model of mem_model. The cache model can be enabled with the -I option and the cache configured. The -l option specifies the number of bytes in a cache line, which can be 4, 8 or 16. The number of ways is set with -w and can be either 1 or 2, and the number of sets is specified with the -s options and can be 128, 256, 512 or 1024. A set of pre-configured timing models can be specified with the -V option. The argument must be one of the following:

This reflects the available models as detailed in the Configuring ISS timing model section below.

Building and Running Code (Work in Progress)

A MakefileVProc.mk file is provided in the 4.sim/ directory to compile the VProc software and to build and run the test bench HDL. The make file makes use of VProc's own makefile.verilator file to compile all the software for VProc, mem_model and the user code, where the user code is the rv32 ISS code when an ISS build is selected (see make file variables below). The software is compiled into a local static library, libvproc.a which is linked to the simulation code within Verilator.

The make file has a target help, which produces the following output:

make -f MakefileVProc.mk help          Display this message
make -f MakefileVProc.mk               Build C/C++ and HDL code without running simulation
make -f MakefileVProc.mk run           Build and run batch simulation
make -f MakefileVProc.mk rungui/gui    Build and run GUI simulation
make -f MakefileVProc.mk clean         clean previous build artefacts
make -f MakefileVProc.mk deepclean     clean previous build artefacts and checked out repos

Command line configurable variables:
  USER_C:       list of user source code files (default VUserMain0.cpp)
  USRCODEDIR:   directory containing user source code (default $(CURDIR)/usercode)
  OPTFLAG:      Optimisation flag for VProc code (default -g)
  TIMINGOPT:    Verilator timing flags (default --timing)
  TRACEOPTS:    Verilator trace flags (default --trace-fst --trace-structs)
  TOPFILELIST:  RTL file list name (default top.filelist)
  SOCCPUMATCH:  string to match for soc_cpu filtering in h/w file list (default ip.cpu)
  USRSIMOPTS:   additional Verilator flags, such as setting generics (default blank)
  WAVESAVEFILE: name of .gtkw file to use when displaying waveforms (default waves.gtkw)
  BUILD:        Select build type from DEFAULT or ISS (default DEFAULT)
  TIMEOUTUS:    Test bench timeout period in microseconds (default 15000)

By default, without a named target, the simulation executable will be built but not run. With a run target, the simulation executable is built and then executed. To fire up waveforms after the run, a target of rungui or gui can be used. A target of clean removes all intermediate files. A target of deepclean will also remove the VProc and mem_model repositories (actually renaming them with and _old suffix). This is useful to get any new versions of these repositories that have beem specified for WireGuard.

The make file has a set of variables (with default settings) that can be overridden on running make. E.g. make VAR=NewVal. The help output shows these variables with brief decriptions. Entries with multiple values should be enclosed in double quotes. By default native test code is built, but if BUILD is set to ISS, then the rv32 ISS and VProc program is compiled and, in this case, the USER_C and USRCODEDIR are ignored as the makfiles compiles the supplied source code for the ISS.

The user code variable allow different (and multiple) file names from the default, and to change the location of where the user code is located (if not the ISS build). This allows different programs to be run by simply changing these variable, and to organise the different source code in different directories etc. By default, the VProc code is compiled for debugging (-g), but this can be overridden by changing OPTFLAG. The trace and timing options can also be overridden to allow a faster executable. The WireGuard top.filelist filename can be overridden to allow multiple configurations to be selected from, if required. The processing of this file to remove the listed soc_cpu HDL files is selected on a pattern (ip.cpu) but this can be changed using SOCCPUMATCH. If any additional options for Verilator are required, then these can be added to USRSIMOPTS. The GTKWave waveform file can be selected with WAVESAVEFILE.

Control of when the simulation exits can be specified with the TIMEOUTUS variable in units of microseconds.

Configuring ISS timing model

Configuration of the timing model can done from the supplied integration code in VUserMain0.cpp. The main pre_run_setup() function, in VUserMain0.cpp, creates an rv32_timing_config object (rv32_time_cfg) which has an update_timing method that takes a pointer to the iss object and an enumerated type to select the model to use for the particular core timings required. This second argument is selected from one of the following:

As detailed in the RISC-V Compiled Application section above, the ISS can be configured via the vusermain.cfg file using the -V option.

Running ISS code

When the test bench is built for the rv32 ISS, the actual 'user' application code is run on the RISC-V ISS model itself, and is compiled using the normal RISC-V GNU toochain to produce a binary file that the ISS can load and run. As described above, the code that is run is slected with the vusermain.cfg file and the -t option. The various flags configure the ISS and determines when the ISS is halted (if at all). An example assembly file is provided in 4.sw/models/rv32/riscvtest/main.s (as well as a recompiled main.bin). This assembly code reproduces the functionality of the example VuserMain0.cpp program discussed previously, writing to memory, reading back and comparing for a mismatch. The example assembly code is compiled with:

$riscv64-unknown-elf-as.exe -fpic -march=rv32imafdc -aghlms=main.list -o main.o main.s
$riscv64-unknown-elf-ld.exe main.o -Ttext 0 -Tdata 1000 -melf32lriscv -o main.bin

In this instance, the code is set to compile to use the MAFDC extensions (maths, atomic, float, double and compressed). To run this code the vusermain.cfg is set to:

vusermain0 -x 0x10000000 -X 0x20000000 -rEHRca -t ./models/rv32/riscvtest/main.bin

This sets the address region that will be sent to the HDL soc_cpu bus to be between byte addresses 0x10000000 and 0x1FFFFFFF. All other accesses will use the direct memory model's API, with no simulation transactions. The next set of options turn on run-time disassembly (-r), exit on ebreak (-E) or unimplemented instruction (-H), dump registers (-R) and CSR register (-c) and display the registers in ABI format (-a). The pre-compiled example program binary is then selected with the -t option. Of course, many of these options are not necessary and, for example, the output flags (-rRca) can be removed and the program will still run correctly. In the 4.sim/ directory, using make to build and run the code gives the following output:

$make -f MakefileVProc.mk BUILD=ISS run
- V e r i l a t i o n   R e p o r t: Verilator 5.024 2024-04-05 rev v5.024-42-gc561fe8ba
- Verilator: Built from 2.145 MB sources in 40 modules, into 0.556 MB in 20 C++ files needing 0.001 MB
- Verilator: Walltime 0.298 s (elab=0.020, cvt=0.087, bld=0.000); cpu 0.000 s on 1 threads; alloced 14.059 MB
Archive ar -rcs Vtb__ALL.a Vtb__ALL.o
VInit(0): initialising DPI-C interface
  VProc version 1.11.4. Copyright (c) 2004-2024 Simon Southwell.
                   0 TOP.tb.error_mon (0) - ERROR_CLEARED

  ******************************
  *   Wyvern Semiconductors    *
  * rv32 RISC-V ISS (on VProc) *
  *     Copyright (c) 2024     *
  ******************************

00000000: 0x00001197    auipc     gp, 0x00000001
00000004: 0x0101a183    lw        gp, 16(gp)
00000008: 0x0001a103    lw        sp, 0(gp)
0000000c: 0x10001237    lui       tp, 0x00010001
00000010: 0x00222023    sw        sp, 0(tp)
00000014: 0x00022283    lw        t0, 0(tp)
00000018: 0x00229663    bne       t0, sp, 12
0000001c: 0x00004505'   addi      a0, zero, 1
0000001e: 0x00004501'   addi      a0, zero, 0
00000020: 0x05d00893    addi      a7, zero, 93
00000024: 0x00009002'   ebreak   
    *

Register state:

  zero = 0x00000000   ra = 0x00000000   sp = 0x900dc0de   gp = 0x00001000 
    tp = 0x10001000   t0 = 0x900dc0de   t1 = 0x00000000   t2 = 0x00000000 
    s0 = 0x00000000   s1 = 0x00000000   a0 = 0x00000000   a1 = 0x00000000 
    a2 = 0x00000000   a3 = 0x00000000   a4 = 0x00000000   a5 = 0x00000000 
    a6 = 0x00000000   a7 = 0x0000005d   s2 = 0x00000000   s3 = 0x00000000 
    s4 = 0x00000000   s5 = 0x00000000   s6 = 0x00000000   s7 = 0x00000000 
    s8 = 0x00000000   s9 = 0x00000000  s10 = 0x00000000  s11 = 0x00000000 
    t3 = 0x00000000   t4 = 0x00000000   t5 = 0x00000000   t6 = 0x00000000 

CSR state:

  mstatus    = 0x00003800
  mie        = 0x00000000
  mvtec      = 0x00000000
  mscratch   = 0x00000000
  mepc       = 0x00000000
  mcause     = 0x00000000
  mtval      = 0x00000000
  mip        = 0x00000000
  mcycle     = 0x0000000000000037
  minstret   = 0x000000000000000b
  mtime      = 0x0006263f2bfc6bcf
  mtimecmp   = 0xffffffffffffffff
Exited running ./models/rv32/riscvtest/main.bin
- /mnt/hgfs/winhome/simon/git/wireguard-fpga/4.sim/tb.sv:44: Verilog $finish

Note that the disassembled output is a mixture of 32-bit and compressed 16-bit instructions, with the compressed instruction hexadecimal values shown followed by a ' character and the instruction heximadecimal value in the lower 16-bits. Unlike for the native compiled code use cases, unless the HDL has changed, the test bench does not need to be re-built when the RISC-V source code is changed or a different binary is to be run, just the RISC-V code is re-compiled or the vusermain.cfg updated to point to a different binary file.

Debugging Code

In each of the three usage cases of software, each can be debugged using gdb, either for the host computer or the gnu RISC-V toolchain's gdb.

Natively Compiled code

For natively compiled code, whether test code or natively compiled application code, so long as each was compiled with the -g flag set (see above for make file options) then the Verilator compiled simulation is an executable file (compiled into an output/ directory) that contains the all the compiled user code. Therefore, to debug using gdb, this executable just needs to be run with the host computer's gdb. E.g., from the 4.sim/ directory:

gdb output/Vtb

Debugging then proceeds just as for any other executable.

ISS Software

The ISS has a remote gdb interface (enable with the -g option in the vusermain.cfg file) allowing the loading of programs via this connection, and of doing all the normal debugging steps of the RISC-V code. The ISS manual details how to use the gdb remote debug interface but, to summarise, when the ISS is run in GDB mode, it will create a TCP socket and advertise the port number to the screen (e.g. RV32GDB: Using TCP port number: 49152). The RISC-V gdb is then run and a remote connection is made with a command:

 (gdb) target remote :49152

A blank before the colon character in the port number indicates the connection is on the local host, but a remote host name can be used to do remote debugging from another machine on the network, or even over the internet, if sufficient access permissions. The program (if not done so by other means) can be loaded over this connection and then debugging commence as normal.

The ISS manual has more details on this and also has an appendix showing how to setup an Eclipse IDE project to debug the code via gdb.

Lab Test and Validation Setup

TODO

Shared Linux Server with tools

WIP

Tool Versions

Simulation

Note: the test bench make file (4.sim/MakefileVProc.mk) will check out VProc and mem_model to the specified revisions when not already present. The rv32 ISS has a pre-compiled library and associated headers already present in the 4.sim/models/rv32 directory at the specified version.

Build process

Hardware

TODO

Software

TODO

CPU Live debug and reload

TODO

Closing Notes

TODO

Acknowledgements

We are grateful to NLnet Foundation for their sponsorship of this development activity.

NGI-Entrust-Logo

Public posts:

End of Document