Routing tables: A proposal

ZipCPU commented 5 days ago

While the current routing algorithm used by the eth10g controller is blazing fast, is not very configurable. Worse, it requires a lot of logic to manage. Further, the current routing method will only work on Ethernet headers, not IPv3 or IPv4 addressing, or IP/UDP or IP/TCP protocol information (ports). For this reason, a highly configurable memory-based routing table is of interest.

Many routing methods exist in the literature, often parameterized by how the table is structured. A user configurable structure, which can be optimized in many ways, might therefore be valuable.

I therefore propose the following routing table approach. It's based upon 32-bit "instruction" words, which will feed a small FSM based router.

The router shall maintain 3 registers: a "program counter", controlling where the next table word is read from, a "shift register", containing a 128b window into the current packet, and a 128b accumulator containing the results from comparing table values with the 128b window into the current packet. Once comparisons are complete, the router can either 1) jump to a new instruction on a match, 2) jump to a new instruction on a failed match, 3) route to a given address a) always, b) on a match, or c) on a failed match.

The current draft instruction set is shown below:

tblrouter

MAC Example

To match against an Ethernet MAC 86:33:24:17:bf:dd and send it to target "8", the table would have instructions:

CMP (shift=12, IMM=0x863324) -- match against the first 24 bits
OR (shift=9, IMM=0x17bfdd) -- continue matching the second set of 24 bits
RZ 8 -- on "zero" (i.e. a match), route this packet to target "8"

IP Example

To match against an IP packet and route subnetwork 192.168.3.X to target 3, the table would use two steps. The first would determine if the packet were (or were not) an IP packet:

CMP (shift=2, IMM=0x0800)
MASK (shift=2, IMM=0x0ffff) -- make sure bits [23:16] of the prior command don't affect the comparison
JZ (ip_table) -- to jump to an IPv4 processing table

The IP table would then have:

STEP 14 (Shift 14 bytes through, bypassing the rest of the ethernet header)
CMP (shift=1, IMM=192.168.3) (Compare the IPv4 destination, bits [8 +: 24], against the given address)
RZ 3 (Route on a successful comparison to target "3")

Routing to a second destination, such as 192.168.17.X, to target 1 would only take two further instructions:

CMP (shift=1, IMM=192.168.17)  (Compare bits[8 +: 24] against the given destination address)
RZ 1 (Route on zero to target 1)

All other packets might then be routed to target 0.

RA 0. (Route Always to target zero)

In these examples, the "targets" will be defined with local meaning. For example, target zero might be set up as a catch all to be sent to a CPU for inspection, and potentially updating the routing table.

Performance monitoring

Performance counters are maintained within the table. Upon hitting a performance counter, the performance counter instruction will be re-read, and atomically incremented. How this happens will be dependent upon the bus in use. For example, on WB, the Wishbone cycle line will be held high while the performance counter is ready, a one added to it, and then the counter is written back to memory. With AXI, an exclusive access operation can be used to guarantee an atomic increment. This helps to insure that both CPU and router, and potentially other concurrent routers, will always have a consistent view of the performance counters.

Potential issues

The most obvious "issue" with this router is that it is memory based. Memory will need to be read to determine the optimal routing. While this is taking place, the packet will be forced to "stall" and wait for routing to complete. As long as packets are stored in virtual FIFOs (as they are within this design), such stalls shouldn't cause packets to drop, but they will delay packets moving through the system.

Because this approach is designed to look and feel like an instruction stream, a standard CPU instruction fetch may be used to feed it. This lends open the possibility of repurposing a pre-built cache.

As with any CPU, branches will cause pipeline stalls and hence delays.

There is no requirement in this proposal that any specific type of memory be used. The memory used could therefore come from on-chip memory, or equivalently an off-chip SDRAM type of memory, with expected performance advantages and disadvantages naturally following.

Conclusions

This table based "instruction-set" solution solves several problems with the current approach, and so I offer it here for anyone to review and/or discuss.

Dan

ZipCPU commented 3 days ago

One drawback of this approach is that it doesn't directly allow address modification. This is great for switches that don't need to modify their network address, but not so great for routers between networks. Hence, if a packet comes into the router from IP1 on network 1 destined for IP2 on network 2, then the network addresses should be changed as the packet switches from network 1 to network 2. Specifically, the network addresses (i.e. Ethernet MAC) from the prior link need to be adjusted to those of the new link where the source would now be the router's MAC address, and the destination would be that of IP2 (assuming it exists on network 2).

While it would be possible to map the "target" field to control a set of address modifiers and thus "correct" this lack, such packet modifications would need to be external to this routing algorithm at present.

A second drawback is found in the "Increment" instruction. For one, it places the performance metrics all over the table's memory area. This could be "fixed" by replacing the immediate field in the increment instruction with a performance table address. Likewise, there's no condition on the increment instruction--such as "If Z, increment." Hence, it would be difficult to capture metrics from the RZ and RNZ instructions separate from any metrics available for the RA instruction. Fixing this would require 2 more variants of the increment instruction.

chili-chips-ba commented 3 days ago

Assuming that this is a typo and that the processor is not using 30-bit, but rather 32-bit instructions, we wonder why only 32 bits, given that the accumulator is 128 bits. With 128-bit VLIW, the processor would be able to load the complete address, along with shift value, all in one cycle.

We may also consider a bit in this long instruction word for the control of RxFIFO advancement. Consider that the RxFIFO may come in different form-factors, such as 32 or 64 bits, where some of address shifts are not in space but in time, by advancing the RxFIFO.

Also consider expanding the instruction set so that this processor can be used to transfer the complete packet from RxFIFO to the calculated destination, that is serve as a mini DMA.

What's the vision for the compiler framework of this network processor? Do you see it programmed manually, in machine code, or in some kind of Assembly, or in a new, custom DSL, or in one of the already existing opensource DSLs with networking focus?!

ZipCPU commented 3 days ago

Yes, thank you, that was a typo. All instructions in my vision are 32b. The JZ and JNZ instructions therefore support 30b offsets to memory. This gives effectively allows the router to access up to 1GB of memory. If we forced all accesses to be aligned (a reasonable assumption), this might allow the router access to a full 4GB of memory.
Why not 128b VLIW instructions? Well, first, I hadn't thought about going that wide. Thinking about it now, I would note A) 1GbE only receives and processes 8b at a time. 128b VLIW instructions would be overkill for such an approach, and would therefore limit this solution to wider pipelines. B) Yes, this particular Eth10G design is set up for 10GbE instructions, and therefore processes either 64b/clock at 156MHz, 128b/clock at 78MHz, or 512b/clock at the memory rate of 100MHz. I see this as working on the memory bus clock, so it would be receiving and processing either 512b every five clocks or so or 128b/clock, depending on which side of the width adjuster it got placed on. C) Yes, a buffer/FIFO would be necessary. D) A 128b VLIW wouldn't be as useful as it sounds, since you still wouldn't be able to match all 128b, and since you'd need a masking flag. Hence, you might only match 96b with a 128b VLIW. Perhaps more interesting might be a variable length instruction word, allowing an 8b prefix followed by an immediate with up to 128b. Whether or not this would be "better" would still be up for analysis. E) Many architectures I've worked with don't have 128b paths to memory. This particular architecture has a 512b path, so ... this architecture is a bit unusual in that respect.
This routine would work in tandem with the RxFIFO. If the RxFIFO fed into the router, then the RxFIFO would nominally advance on every STEP instruction, and packets would be flushed through the system on either a ROUTE (RZ, RNZ, RA) or an illegal instruction (ILL).
I don't really see a need for a mini-DMA here. As I just mentioned, the Route and ILLegal instructions already serve the purpose of advancing the entire packet through the system.
As for the compiler, I foresee a couple of implementations. A) Initially, the tool set would consist of a basic assembler--such as one might write in an afternoon with FLEX. This assembler would also have a disassembly option. This initial tool would serve for initial bringup and subsequent debug--at least to the point where you knew the router worked. B) After working with the router for a while, I foresee the creation of a software library that can then write and compile instructions from a higher level language. C) Eventually, I'd expect the software library to translate tables in other formats into this machine language. D) The language proposed here is not really meant for the user, but rather for the router. As such, it fits the role of a machine language: Nobody would write this language. An assembly would be available for debug. Eventually, other higher level languages would do all the work.

As I see this in a block diagram sense, it would look something like the following:

20241114-prorouter-blockd

Packets would come in using the AXIN protocol. A small buffer would exist within the router to buffer the packet header until the destination target had been determined. If this buffer ever overflowed before the target was known, the packet would be discarded and an error condition would be generated--in much the same way as with illegal instructions.

Dan

chili-chips-ba commented 2 days ago

"... 1GbE only receives and processes 8b at a time. 128b VLIW instructions would be overkill for such an approach..."

Our Wireguard-1GE is actually, for multiple reasons, using 128 bits in the Data Plane Engine (DPE) core.

We have in the past used even 1K-bit wide tables of the keys to search for, so that multiple addresses can be compared on the same clock cycle.

The search key capacity and time to complete are critical parameters of a real-life networking product. We have for that reason also always used the BRAM for the search key storage. Reaching out to an external DRAM is typically too expensive for performance when one has to match against hundreds of keys in the time allotment of a minimum Ethernet packet or, even worse, an ATM cell.

A good state diagram for this processor would be worth thousand words 😉

chili-chips-ba commented 1 day ago

How about "SORT" instruction?

Many search algorithms count on certain table organization, and typically must reorganize table whenever an entry is entered or deleted. Here is one of them.

ZipCPU commented 1 day ago

"SORT" is a rather complicated operation for hardware, but much easier in software. It makes sense that the table might require some form or kind of sorting within it, but it also makes more sense that said sorting be done before the table is made active to the hardware.

chili-chips-ba commented 23 hours ago

That's exactly why we are rooting for the Harvard architecture of this Search-and-Route (SAR) processor!

By clearly separating Data from Instructions space, the design would facilitate co-processing scenarios, where a standard CPU would occasionally come in to make the life of the HW FSM easier.

These interventions would likely be for the following reasons: 1) add or delete a Data Table entry 2) sort the Data Table afterwards, per requirements of the SEARCH algorithm used by the HW FSM of the SAR Proc.

Then, we could take the Instructions to a high level that even forgoes the need for a compiler... such as by promoting them to

Commands:

Reset

Run / Pause_N

Config settings:

Table_Depth

(optional) Net_Protocol (MAC, IP, VLAN, ...)

(optional) Search_Algo (LPM, BBT, ...)

and Status:

Ready / Busy_N

This setup is more amenable to co-processing arrangements, effectively transforming the SAR Proc into SAR CoProc, or Accellerator.

That would arguably lead to a better performing system, getting away with instruction fetches, caches, bug catches and other degrading complications that come from programmability at machine level.

This also plays nicely with FPGA kind of programmability that can be exploited as an alternative to a SAR Proc program, or reduce the SAR CoProc config options to only _TableDepth. Everything else can be a different FPGA program that's compiled from the pin-compatible Plug-and-Play RTL module that the user can swap in or out of his SOC, by linking in the corresponding file.

We should also define the capacity and performance objectives for the Search part of the SAR CoProc. While commercial implementations accommodate thousands of entries, we could start with modest 1K at 10GE arrival rates...

ZipCPU commented 16 hours ago

I'm not sure I follow. Can you slow down, back up, and explain your thoughts please?

Thank you.

chili-chips-ba commented 16 hours ago

Rather then designing a specialized processor (Master), look at this design task through the lenses of a co-processor (intelligent and capable peripheral, but still a Slave).

Forget about instruction fetches, assembler, compiler, and all that jazz, and design it as an FSM that an external CPU can:

minimally configure
reset, start and stop

and which can:

return "busy" to the CPU.

Make it operate directly on local BRAM-based tables that the external CPU can manage and sort in order to facilitate the searches.

The bulk of its programmability, that is adaptability to the given use-case, comes from the softness of FPGA gates.

ZipCPU / eth10g