Stall-vs-Swap approach to atomic updates of large register banks, with fields wider than CPU bus

We are looking for a way to ensure atomic writes to registers wider than 32 bits (which is our CPU access bus).

Some control structures (such as crypto key routing table) must be updated as a whole (as partial update could cause unexpected behavior in the DataPlane Engine - DPE).

SystemRDL supports the write-buffered registers. They however seem inefficient for FPGA resources.

Here is an example that specifies a table of 16 records, each 224 bits in size, for the total of 3584 flops. that is control bits from the DPE viewpoint. However, the SystemRDL-generated RTL output expends almost 3x that number of flops! utilization

one layer of flops stores the actual / live values that go out to the DPE.
the second layer stores the values as they arrive from the CPU / shadow.
the third part includes a bunch of bit-enable flops. Granted, these per-bit enables can be rationalized, and even completely eliminated when full 32 writes (w/o even the byte enables) are acceptable.

For maximum resource utilization, we are most seriously considering the option of stalling the DPE while CPU is updating its control registers. That would ensure access atomicity using ordinary, i.e. the least expensive SystemRDL reg types. Granted, this flop gain would be paid in BRAM loss, as the depth of RxFIFOs would now need to be enlarged to absorb the slack while the DPE processing is held off.

As an alternative, we are also assessing the cost of register bank swap method. This is similar to the Z80 register swaps. The first thought is to have SystemRDL create two full sets of ordinary registers: A and B. Then implement a mux in our RTL, outside of SystemRDL, so that A<->B swap can be instantly executed on CPU command.

Ideally, the SystemRDL would introduce the Swap register type, so that the user can define only one set, and flow under the hood creates the other set, along with a mux and register for selection of the bank.

Here are a few relevant excerpts from other discussion threads on this question:

----1----

"... the most important part is the atomicity of one CryptoKey routing table record. The table will contain only 16 such records in the first, POC version of the project. Each record describes the configuration of one WireGuard peer, and will store data in registers of various lengths:

MAC addresses: 48 bits

IPv4 network address: 32-bit address + 32-bit mask

CryptoKeys: 32-bit

Endpoint address: 32-bit

UDP port: 16-bit

Performance counters, etc.

The rough guesstimate is that one record is 200-300 bits, which we can refer to as a "super-register". This super-register MUST be updated atomically.

However, while as a bare minimum we must provide the atomicity of accessing each individual super-register, the entire routing table should ideally also be updated atomically. That would then allow the CPU to sort the records after each table update, such as by a prefix length, thus enabling the Longest Prefix Match (LPM) lookup algorithm. There are other lookup algorithms, e.g. Balanced Binary Tree (BBT). For efficient searches, they all require some kind of table organization and pre-sorting.

To simplify things, we assume that our device will not work as a standard IP router/switch, but only as a WireGuard endpoint. We thus eliminate the need for MAC tables. But, we still have to implement a limited version of the MAC learning and ARP protocols.

Once we fully work out the DPE (Data Plane Engine) components, we will know the exact dimensions of the super-registers, as wells as the purpose of its fields..."

----2----

"... putting aside the maximalist goal for atomic access of the entire table, the access atomicity of an individual super-register can be accomplished by providing one common/shared set of staging registers which are updated in background. They are used to construct a 200/300-bit super-word using 8, 16, or 32-bit CPU transactions. A final register write would then transfer this staged value into a designated super-register target.

SystemRDL supports "external registers", the facility we can employ for building such staging "hold" register. It would require some custom logic outside the auto-generated register RTL, but the main cost is the additional 200/300 bits for holding the staged value..."

----3----

"... while in the POC Phase 1 we go for only 16 super-registers, what is a reasonable goal for the full Wireguard product version? This question ties into the target capacity of Routing Processor

----4----

"... pipeline stalling mechanism should be viewed more as safety than a hazard. Perhaps term "stalling" isn't appropriate. This is rather a flow control handshake between control and data planes.

When there is a need to update the DPE tables, the CPU would first assert a Pause signal. That notifies the DPE to:

stop accepting new packets

gracefully complete all started jobs

and clear the pipeline.

When DPE has cleared the pipeline, i.e. entered the Pause state, it returns an acknowledge to the CPU, which can then start updating and sorting the DPE tables. When the CPU has finished this job, it shall deactivate the Pause signal and data flow may continue. Pause handshake may be implemented though dedicated registers within the CSR.

Let's note that the Pause-based Flow Control is not foreign to Ethernet. Even on the contrary, it is a part of the standard. Ethernet is also tolerant to losing the packets -- A few drops here and there are not necessarily bad..."

----5----

"... from this description, it sounds like the "super register" records, that is the routing table, might be better off for implementation in on-chip BRAM than in individual flops. Also, since our DPE is 4-threaded, and it will never need all 4 tables at the same time, which plays into hand of a RAM-based solution..."

I have updated README.md to include a description of the atomic CSR update mechanism. Here I will also present a draft of latency calculations and buffer sizing.

csr_flow_control

If we assume that the DPE just started processing a 1500B frame, PAUSE setup latency at the multiplexer can be expressed as:

T_PS = T_CLK + (1500/16)T_CLK = 95T_CLK

Furthermore, PAUSE to READY latency can be expressed as:

T_PR = T_PS + T_DC + T_CLK where T_DC is a data path clearing latency.

For the cut-through pipeline:

T_DC1 (Header Parser) = 4*T_CLK
T_DC2 (WireGuard Assembler/Disassembler) = 4*T_CLK
T_DC3 (WireGuard Encryptor/Decryptor) = 4*T_CLK
T_DC4 (IP Lookup Engine) = 2*T_CLK
T_DC = T_DC1 + T_DC2 + T_DC3 + T_DC4 = 14*T_CLK

For the store-and-forward DPE (word-case scenario):

T_DC1 (Header Parser) = (1500/16)T_CLK = 95T_CLK
T_DC2 (WireGuard Assembler/Disassembler) = (1500/16)T_CLK = 95T_CLK
T_DC3 (WireGuard Encryptor/Decryptor) = (1500/16)T_CLK = 95T_CLK
T_DC4 (IP Lookup Engine) = (1500/16)T_CLK = 95T_CLK
T_DC = T_DC1 + T_DC2 + T_DC3 + T_DC4 = 380*T_CLK

Hence, PAUSE to READY latency can be calculated as follows:

T_PR = 95T_CLK + 14T_CLK + T_CLK = 110T_CLK (or 476T_CLK for S&F DPE)

Update latency per byte of written CSR data if a 32-bit bus is used:

T_CSR = (1/4) * T_CLK

Given that the Routing Table contains 16 entries of 300 bytes each, the update latency can be calculated as follows:

T_UPDATE = 16300T_CSR = 1200*T_CLK

Finally, the total CSR update latency can be calculated as follows:

T = T_PR + T_UPDATE = 1310*T_CLK = approx. 16us (or 21us for S&F DPE)

Since no packets are received in the DPE during the FCR handshake procedure, from the establishment of the pause until the completion of the CSR update, it is necessary to size the input FIFOs to at least:

C_FIFO = T R = 16us 1Gbps = approx. 2000 bytes (or 2625 bytes for S&F DPE)

Given that there are four Rx FIFOs connected to 1GbE interfaces, the total required capacity is equal to:

C_TOTAL = 4*C_FIFO = 8000 bytes (or 10500 bytes for S&F DPE)

As the CSR update latency has the biggest contribution to the total latency, we can conclude that the additional consumption of BRAM (needed for Rx FIFOs) is approximately equal to 2 bytes of memory for each byte of written CSR data.

If we want to support Jumbo frames, then each Rx FIFO must be at least 9000 bytes in size, which gives us a huge margin for CSR update latency (4-5 times more than needed).

Considering the large number of variables that will depend on the implementation details, I will make a spreadsheet for easier calculation of expected latency and necessary buffer sizes.

Number of VPN peers is an important element in the above calculations. It affects both:

T_DC4 (IP Lookup Engine)
T_UPDATE (CSR Walk-Through by the CPU)

The 16 that our first POC prototype will be built for is far below the number that a real-life apps need.

While a spreadsheet is a great idea that will facilitate design space exploration, could we, for the sake of an argument, also specify a ball park number of peers that the commercial products of this kind typically have, or need?!

Commercial-grade devices in this range (1Gbps) support up to 20-25 peers:

As far as I have observed the experiences of users, the needs are mostly up to 100 peers, so commercial vendors are working to support that.

chili-chips-ba / wireguard-fpga

Stall-vs-Swap approach to atomic updates of large register banks, with fields wider than CPU bus #1