Open chili-chips-ba opened 2 weeks ago
Here are a few relevant excerpts from other discussion threads on this question:
----1----
"... the most important part is the atomicity of one CryptoKey routing table record. The table will contain only
16 such records in the first, POC version of the project
. Each record describes the configuration of one WireGuard peer, and will store data in registers of various lengths:
- MAC addresses: 48 bits
- IPv4 network address: 32-bit address + 32-bit mask
- CryptoKeys: 32-bit
- Endpoint address: 32-bit
- UDP port: 16-bit
- Performance counters, etc.
The rough guesstimate is that
one record is 200-300 bits
, which we can refer to as a "super-register". This super-register MUST be updated atomically.However, while as a bare minimum we must provide the atomicity of accessing each individual super-register, the
entire routing table should ideally also be updated atomically
. That would then allow the CPU to sort the records after each table update, such as by a prefix length, thus enabling the Longest Prefix Match (LPM) lookup algorithm. There are other lookup algorithms, e.g. Balanced Binary Tree (BBT). For efficient searches, they all require some kind of table organization and pre-sorting.To simplify things, we assume that our device will not work as a standard IP router/switch, but only as a WireGuard endpoint. We thus eliminate the need for MAC tables. But, we still have to implement a limited version of the MAC learning and ARP protocols.
Once we fully work out the DPE (Data Plane Engine) components, we will know the exact dimensions of the super-registers, as wells as the purpose of its fields..."
----2----
"... putting aside the maximalist goal for atomic access of the entire table, the access atomicity of an individual super-register can be accomplished by providing one common/shared set of staging registers which are updated in background. They are used to construct a 200/300-bit super-word using 8, 16, or 32-bit CPU transactions. A final register write would then transfer this staged value into a designated super-register target.
SystemRDL supports "external registers", the facility we can employ for building such staging "hold" register. It would require some custom logic outside the auto-generated register RTL, but the main cost is the additional 200/300 bits for holding the staged value..."
----3----
"... while in the POC Phase 1 we go for only 16 super-registers, what is a reasonable goal for the full Wireguard product version? This question ties into the target capacity of Routing Processor
----4----
"... pipeline stalling mechanism should be viewed more as safety than a hazard. Perhaps term "stalling" isn't appropriate. This is rather a
flow control
handshake between control and data planes.When there is a need to update the DPE tables, the CPU would first assert a Pause signal. That notifies the DPE to:
- stop accepting new packets
- gracefully complete all started jobs
- and clear the pipeline.
When DPE has cleared the pipeline, i.e. entered the Pause state, it returns an acknowledge to the CPU, which can then start updating and sorting the DPE tables. When the CPU has finished this job, it shall deactivate the Pause signal and data flow may continue. Pause handshake may be implemented though dedicated registers within the CSR.
Let's note that the Pause-based Flow Control is not foreign to Ethernet. Even on the contrary, it is a part of the standard. Ethernet is also tolerant to losing the packets -- A few drops here and there are not necessarily bad..."
----5----
"... from this description, it sounds like the "super register" records, that is the routing table, might be better off for implementation in on-chip BRAM than in individual flops. Also, since our DPE is 4-threaded, and it will never need all 4 tables at the same time, which plays into hand of a RAM-based solution..."
I have updated README.md to include a description of the atomic CSR update mechanism. Here I will also present a draft of latency calculations and buffer sizing.
If we assume that the DPE just started processing a 1500B frame, PAUSE setup latency at the multiplexer can be expressed as:
Furthermore, PAUSE to READY latency can be expressed as:
For the cut-through pipeline:
For the store-and-forward DPE (word-case scenario):
Hence, PAUSE to READY latency can be calculated as follows:
Update latency per byte of written CSR data if a 32-bit bus is used:
Given that the Routing Table contains 16 entries of 300 bytes each, the update latency can be calculated as follows:
Finally, the total CSR update latency can be calculated as follows:
Since no packets are received in the DPE during the FCR handshake procedure, from the establishment of the pause until the completion of the CSR update, it is necessary to size the input FIFOs to at least:
Given that there are four Rx FIFOs connected to 1GbE interfaces, the total required capacity is equal to:
As the CSR update latency has the biggest contribution to the total latency, we can conclude that the additional consumption of BRAM (needed for Rx FIFOs) is approximately equal to 2 bytes of memory for each byte of written CSR data.
If we want to support Jumbo frames, then each Rx FIFO must be at least 9000 bytes in size, which gives us a huge margin for CSR update latency (4-5 times more than needed).
Considering the large number of variables that will depend on the implementation details, I will make a spreadsheet for easier calculation of expected latency and necessary buffer sizes.
Number of VPN
peers is an important element in the above calculations. It affects both:
The 16
that our first POC prototype will be built for is far below the number that a real-life apps need.
While a spreadsheet is a great idea that will facilitate design space exploration, could we, for the sake of an argument, also specify a ball park number of peers that the commercial products of this kind typically have, or need?!
Commercial-grade devices in this range (1Gbps) support up to 20-25 peers:
As far as I have observed the experiences of users, the needs are mostly up to 100 peers, so commercial vendors are working to support that.
We are looking for a way to ensure atomic writes to registers wider than 32 bits (which is our CPU access bus).
Some control structures (such as crypto key routing table) must be updated as a whole (as partial update could cause unexpected behavior in the DataPlane Engine - DPE).
SystemRDL supports the write-buffered registers. They however seem inefficient for FPGA resources.
Here is an example that specifies a table of 16 records, each 224 bits in size, for the total of 3584 flops. that is control bits from the DPE viewpoint. However, the SystemRDL-generated RTL output expends almost 3x that number of flops!
live
values that go out to the DPE.shadow
.bit-enable
flops. Granted, these per-bit enables can be rationalized, and even completely eliminated when full 32 writes (w/o even the byte enables) are acceptable.For maximum resource utilization, we are most seriously considering the option of
stalling
the DPE while CPU is updating its control registers. That would ensure access atomicity using ordinary, i.e. the least expensive SystemRDL reg types. Granted, this flop gain would be paid in BRAM loss, as the depth of RxFIFOs would now need to be enlarged to absorb the slack while the DPE processing is held off.As an alternative, we are also assessing the cost of register bank
swap
method. This is similar to the Z80 register swaps. The first thought is to have SystemRDL create two full sets of ordinary registers: A and B. Then implement a mux in our RTL, outside of SystemRDL, so that A<->B swap can be instantly executed on CPU command.Ideally, the SystemRDL would introduce the Swap register type, so that the user can define only one set, and flow under the hood creates the other set, along with a mux and register for selection of the bank.