Open ZipCPU opened 5 days ago
One drawback of this approach is that it doesn't directly allow address modification. This is great for switches that don't need to modify their network address, but not so great for routers between networks. Hence, if a packet comes into the router from IP1 on network 1 destined for IP2 on network 2, then the network addresses should be changed as the packet switches from network 1 to network 2. Specifically, the network addresses (i.e. Ethernet MAC) from the prior link need to be adjusted to those of the new link where the source would now be the router's MAC address, and the destination would be that of IP2 (assuming it exists on network 2).
While it would be possible to map the "target" field to control a set of address modifiers and thus "correct" this lack, such packet modifications would need to be external to this routing algorithm at present.
A second drawback is found in the "Increment" instruction. For one, it places the performance metrics all over the table's memory area. This could be "fixed" by replacing the immediate field in the increment instruction with a performance table address. Likewise, there's no condition on the increment instruction--such as "If Z, increment." Hence, it would be difficult to capture metrics from the RZ and RNZ instructions separate from any metrics available for the RA instruction. Fixing this would require 2 more variants of the increment instruction.
Assuming that this is a typo and that the processor is not using 30-bit, but rather 32-bit instructions, we wonder why only 32 bits, given that the accumulator is 128 bits. With 128-bit VLIW, the processor would be able to load the complete address, along with shift value, all in one cycle.
We may also consider a bit in this long instruction word for the control of RxFIFO advancement. Consider that the RxFIFO may come in different form-factors, such as 32 or 64 bits, where some of address shifts are not in space but in time, by advancing the RxFIFO.
Also consider expanding the instruction set so that this processor can be used to transfer the complete packet from RxFIFO to the calculated destination, that is serve as a mini DMA.
What's the vision for the compiler framework of this network processor? Do you see it programmed manually, in machine code, or in some kind of Assembly, or in a new, custom DSL, or in one of the already existing opensource DSLs with networking focus?!
As I see this in a block diagram sense, it would look something like the following:
Packets would come in using the AXIN protocol. A small buffer would exist within the router to buffer the packet header until the destination target had been determined. If this buffer ever overflowed before the target was known, the packet would be discarded and an error condition would be generated--in much the same way as with illegal instructions.
Dan
"... 1GbE only receives and processes 8b at a time. 128b VLIW instructions would be overkill for such an approach..."
Our Wireguard-1GE is actually, for multiple reasons, using 128 bits in the Data Plane Engine (DPE) core.
We have in the past used even 1K-bit wide tables of the keys to search for, so that multiple addresses can be compared on the same clock cycle.
The search key capacity and time to complete are critical parameters of a real-life networking product. We have for that reason also always used the BRAM for the search key storage. Reaching out to an external DRAM is typically too expensive for performance when one has to match against hundreds of keys in the time allotment of a minimum Ethernet packet or, even worse, an ATM cell.
A good state diagram for this processor would be worth thousand words 😉
How about "SORT" instruction?
Many search algorithms count on certain table organization, and typically must reorganize table whenever an entry is entered or deleted. Here is one of them.
"SORT" is a rather complicated operation for hardware, but much easier in software. It makes sense that the table might require some form or kind of sorting within it, but it also makes more sense that said sorting be done before the table is made active to the hardware.
That's exactly why we are rooting for the Harvard architecture of this Search-and-Route (SAR) processor!
By clearly separating Data from Instructions space, the design would facilitate co-processing scenarios, where a standard CPU would occasionally come in to make the life of the HW FSM easier.
These interventions would likely be for the following reasons: 1) add or delete a Data Table entry 2) sort the Data Table afterwards, per requirements of the SEARCH algorithm used by the HW FSM of the SAR Proc.
Then, we could take the Instructions to a high level that even forgoes the need for a compiler... such as by promoting them to
Commands:
- Reset
- Run / Pause_N
Config settings:
- Table_Depth
- (optional) Net_Protocol (MAC, IP, VLAN, ...)
- (optional) Search_Algo (LPM, BBT, ...)
and Status:
- Ready / Busy_N
This setup is more amenable to co-processing arrangements, effectively transforming the SAR Proc into SAR CoProc, or Accellerator.
That would arguably lead to a better performing system, getting away with instruction fetches, caches, bug catches and other degrading complications that come from programmability at machine level.
This also plays nicely with FPGA kind of programmability that can be exploited as an alternative to a SAR Proc program, or reduce the SAR CoProc config options to only _TableDepth. Everything else can be a different FPGA program that's compiled from the pin-compatible Plug-and-Play RTL module that the user can swap in or out of his SOC, by linking in the corresponding file.
We should also define the capacity and performance objectives for the Search part of the SAR CoProc. While commercial implementations accommodate thousands of entries, we could start with modest 1K at 10GE arrival rates...
I'm not sure I follow. Can you slow down, back up, and explain your thoughts please?
Thank you.
Rather then designing a specialized processor (Master), look at this design task through the lenses of a co-processor (intelligent and capable peripheral, but still a Slave).
Forget about instruction fetches, assembler, compiler, and all that jazz, and design it as an FSM that an external CPU can:
and which can:
Make it operate directly on local BRAM-based tables that the external CPU can manage and sort in order to facilitate the searches.
The bulk of its programmability, that is adaptability to the given use-case, comes from the softness of FPGA gates.
While the current routing algorithm used by the eth10g controller is blazing fast, is not very configurable. Worse, it requires a lot of logic to manage. Further, the current routing method will only work on Ethernet headers, not IPv3 or IPv4 addressing, or IP/UDP or IP/TCP protocol information (ports). For this reason, a highly configurable memory-based routing table is of interest.
Many routing methods exist in the literature, often parameterized by how the table is structured. A user configurable structure, which can be optimized in many ways, might therefore be valuable.
I therefore propose the following routing table approach. It's based upon 32-bit "instruction" words, which will feed a small FSM based router.
The router shall maintain 3 registers: a "program counter", controlling where the next table word is read from, a "shift register", containing a 128b window into the current packet, and a 128b accumulator containing the results from comparing table values with the 128b window into the current packet. Once comparisons are complete, the router can either 1) jump to a new instruction on a match, 2) jump to a new instruction on a failed match, 3) route to a given address a) always, b) on a match, or c) on a failed match.
The current draft instruction set is shown below:
MAC Example
To match against an Ethernet MAC 86:33:24:17:bf:dd and send it to target "8", the table would have instructions:
IP Example
To match against an IP packet and route subnetwork 192.168.3.X to target 3, the table would use two steps. The first would determine if the packet were (or were not) an IP packet:
The IP table would then have:
Routing to a second destination, such as 192.168.17.X, to target 1 would only take two further instructions:
All other packets might then be routed to target 0.
RA 0. (Route Always to target zero)
In these examples, the "targets" will be defined with local meaning. For example, target zero might be set up as a catch all to be sent to a CPU for inspection, and potentially updating the routing table.
Performance monitoring
Performance counters are maintained within the table. Upon hitting a performance counter, the performance counter instruction will be re-read, and atomically incremented. How this happens will be dependent upon the bus in use. For example, on WB, the Wishbone cycle line will be held high while the performance counter is ready, a one added to it, and then the counter is written back to memory. With AXI, an exclusive access operation can be used to guarantee an atomic increment. This helps to insure that both CPU and router, and potentially other concurrent routers, will always have a consistent view of the performance counters.
Potential issues
The most obvious "issue" with this router is that it is memory based. Memory will need to be read to determine the optimal routing. While this is taking place, the packet will be forced to "stall" and wait for routing to complete. As long as packets are stored in virtual FIFOs (as they are within this design), such stalls shouldn't cause packets to drop, but they will delay packets moving through the system.
Because this approach is designed to look and feel like an instruction stream, a standard CPU instruction fetch may be used to feed it. This lends open the possibility of repurposing a pre-built cache.
As with any CPU, branches will cause pipeline stalls and hence delays.
There is no requirement in this proposal that any specific type of memory be used. The memory used could therefore come from on-chip memory, or equivalently an off-chip SDRAM type of memory, with expected performance advantages and disadvantages naturally following.
Conclusions
This table based "instruction-set" solution solves several problems with the current approach, and so I offer it here for anyone to review and/or discuss.
Dan