[RFC] New basis of operation: Reliable Datagram Pipe

whitequark commented 1 year ago

Currently we configure USB as 4 or 2 completely unrelated streams: two IN/two OUT and one IN/one OUT (with bigger buffers) and expect applets to fit into this scheme. We also have an I2C sideband where applets can add arbitrary registers.

This scheme is problematic for many reasons:

You only get full performance with one IN/one OUT pipe, and the difference in performance is drastic.
Also you can't currently get this on Windows. (#231 item 2)
There are serious race condition issues with the I2C registers that applets have to add workarounds for and which are a constant source of bugs. #147 points this out for SPI but it's true everywhere.
I2C registers are not reset with the applet.
With async applets, registers have even more race condition issues. Applet registers should be in the applet clock domain! But they are all in the I2C / main clock domain.
You can never have more than 4 applets using USB pipes. Usually you cannot have more than 2. Sometimes, more than 1 (a few applets use e.g. two IN pipes). This is a blocker for #180, which at the moment is just infeasible.
Having this weird scheme makes virtualizing Glasgow (e.g. to run the CLI on your PC and have the hardware be connected to your RPi elsewhere) very difficult, bordering on impossible.
It also drags USB semantics everywhere, making it hard to do a future migration to Ethernet.
Timeouts are... hard, because you don't have a good way to tell the applet to stop busy looping (like in https://github.com/GlasgowEmbedded/glasgow/issues/187) without just resetting it. You can use I2C registers for that but the protocol is awkward, and it introduces race conditions with async applets (doubly so if the clock is sourced externally).

Considering just the case of many applets and many pipes in isolation, a few solutions were proposed:

using constant overhead byte stuffing to add metadata
making the 1st byte of the USB package (which inherently delimits the stream) the applet number Both of these are bad in that the coding overhead of this in Python will be massive. Whatever scheme we use, we need it to be faster, not slower, and this means handling >40 MB/s with USB scheduling in pure Python. So what do we do?

I propose a scheme called RDP: Reliable Datagram Pipe.

It is essentially bootleg PCI Express.
Communication with the device is done through a dual-simplex channel.
Each individual datagram on the pipe (IN or OUT; both use the same format) consists of endpoint (uleb128), credits (uleb128), length (uleb128), and payload (bytes).
- The endpoint is an arbitrary number used for routing datagrams in the FPGA and in the software. Its assignment is out of scope for this proposal.
- The credits is a number of additional datagrams that can be sent in the opposite direction.
- The length is a number describing the length of the payload.
- The payload is an arbitrary sequence of bytes.
The unit of transmission is a datagram, i.e. one 4-byte datagram is not the same as two 2-byte datagrams sent to the same endpoint. This is not a stream oriented scheme.
- Many endpoints on the FPGA will contain no internal buffer, and will generate or parse the datagram payload on the fly.
- If you want to make a stream oriented scheme out of it anyway, you can ask the demultiplexer to give you a stream (in software) or ignore the end-of-packet signal of the stream (in gateware).
Datagrams are guaranteed to be delivered, but the latency (in either direction) is not bounded. In practice, what this means is that if you know that datagram number N has been delivered (e.g. through feedback), you also know that datagrams number 1..N-1 have been delivered. If you have no feedback you don't know when or if they're delivered.
- Communicating with a non-existent endpoint is undefined behavior.
Datagrams to the same endpoint are guaranteed to be delivered in-order.
- The order of delivery to different endpoints is unspecified.
Each endpoint has a maximum length for IN and OUT directions, which is communicated out of band.
- Exceeding the maximum length results in undefined behavior.
When a datagram is received in one (e.g. OUT) direction, it automatically consumes a credit for that direction. When the credits (the maximum/initial amount of which is communicated out of band) are exhausted the datagram source must pause. When the datagrams in that direction are processed, the next datagram that is sent in the opposite direction (e.g. IN) has a non-zero amount of credits attached to it. If there is no datagram to be sent in the opposite direction, a credit-only datagram (zero length, no payload, non-zero credit amount) can be sent.
- Credit-only datagrams carry no data and aren't forwarded to the application layer. To send an actual empty datagram, send it with no credits attached.
- If a packet is received but there are no resources to process it (for any reason, including "the sender did not respect the credit amount") the packet is defined to be dropped.
- If more credits are released than the maximum/initial amount, the behavior is undefined.

The scheme has the following advantages:

Coding overhead is of the kind that Python can handle very well. We already have many protocols that use type-length-payload encoding (e.g. JTAG) and we achieve very high bit rates with them because Python can concatenate strings really quickly and it can match and slice only slightly slower. Unlike COBS or first-byte-of-every-512-byte-chunk-is-the-address, this scheme can be easily implemented in Python in such a way that if you have large payloads on average, your transfer rate is limited by USB.
It isn't difficult to parse on the FPGA either.
- In fact it seems so straightforward (especially for small endpoint values) that I think the parsers for it should be often nested. I.e. an endpoint addresses an applet, and the first bytes of the payload are uleb128 of a sub-endpoint within an applet.
It has a straightforward mapping to the Internet protocol stack.
- Probably the two most reasonable one is "one TCP socket for everything", considering head-of-line blocking is not an issue.
Credit-based flow control means no head-of-line blocking and (with TCP encapsulation) no buffer bloat. (At least, outside of the Glasgow software.)
- Dropping datagrams that exceed the available credits means it's easy to do quasi-realtime tasks such as "streaming video and hoping the other side has enough bandwidth for it" without losing framing sync even if you do end up dropping datagrams.
It has an unlimited amount of endpoints, meaning we can support any number of applets that fit into the gateware.
Mapping an endpoint to an applet means that the communication with the gateware of that applet can be abstracted as an ordered sequence of datagrams instead of the current stream.
Using datagrams means that we can e.g. map each register to an endpoint and have it generate or overwrite its value with the payload bytes starting at the first one. This requires no additional buffering (only a state machine) and has no additional overhead.
Actually, most registers (the main exception is applet reset registers) will be mapped to sub-endpoints within the applet.
- You could potentially do everything with top-level endpoints, but that seems like it would really bloat the interconnect.
Asynchronous applets are trivial: they are attached only through their IN and OUT pipes and the async reset register. Any applet registers use sub-endpoints and exist in the applet clock domain.
There is no real need to make applets fit a contrived "SERDES+sideband" scheme anymore. Any sideband signals/registers are allocated to endpoints/sub-endpoints since they are so cheap.
To make a timeout, add another (sub-)endpoint that pokes the state machine. The length of a datagram has to be known in advance, ergo any transmission proceeds after buffering at the source (or is otherwise guaranteed to make forward progress, meaning that the interconnect isn't stuck waiting for the transmission to complete while it's waiting for an event that never happens).

Using uleb128 as the encoding of the endpoint, credit, length fields has the following advantages:

Very compact
We won't ever run out of them
Parsing overhead on the FPGA is trivial
Parsing overhead on the Python side is less trivial, but still not too bad even in pure Python
- This is simple and self-contained enough it can probably be easily accelerated with e.g. Cython
Only pay for what you use
- If you have just 2 endpoints you do not even generate the RTL decision tree bigger than 3 cases for the first byte

Downsides:

The uleb128 encoding may be somewhat costly to parse on the Python side
Unbounded latency (but we can't really make it bounded with Python and off-the-shelf OSes anyway)
- But here it is really unbounded. Cut-through routing means an applet can transmit several GB of data and for that entire time it will prevent anything else from being transmitted.
- Can be solved with fragmentation below applet level, but it gets complicated quick.

Open questions:

Is it worth it allowing multiple levels of cut-through routing within the FPGA? I.e. instead of a single endpoint number, have a sequence of them at the beginning of the packet. Each router parses the number and then immediately starts forwarding the rest to the next one. The same happens in reverse for transmission.
- Could be a good way to handle sub-endpoints without introducing complex credit accounting.

attie commented 1 year ago

This sounds very good - I also prefer it to both COBS and a packet header.

Some questions:

Is your expectation wrt Credits to primarily provide flow control, or is there more going on there?
Is it more important to declare the maximum payload size, or make the maximum buffer size?
- e.g: 2x 10-byte frames may fit in the 10-byte max payload, but the buffer is only 16-bytes and thus overflows, now what?
Do "credits" become "bytes", which instructs the other end how much data may be transferred before full, rather than being a more abstract packet count?

Can we address some of the "undefined behaviour", perhaps declaring that it resets / raises an exception / etc...?

Missing endpoint - applet ACKs incoming payload, no-ACK = error -> reset / exception
Oversied payload - applet ACKs each incoming byte, no-ACK / NACK = error -> reset / exception
Depleating credits - if credits are "space in buffer", then the "oversized payload" catches this one too

Unbounded latency (but we can't really make it bounded with Python and off-the-shelf OSes anyway)

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

Is it worth it allowing multiple levels of cut-through routing within the FPGA?

I was also wondering how to handle sub-endpoints better than placing header info in the payload... Given we have 128-bit IDs, Is it even worth worrying about multi-level routing, and just allow applets to register multiple endpoints (or a range like network masks / 192.168.0.0/24)? The endpoint is then masked, and the equivelant of the "Host ID" is then passed through to the applet at the start, like an "address". Alignment and stuff becomes important, but that shouldn't be too difficult to handle.

whitequark commented 1 year ago

Is your expectation wrt Credits to primarily provide flow control, or is there more going on there?

I am thinking of this as solely a mechanism for flow control.

Do "credits" become "bytes", which instructs the other end how much data may be transferred before full, rather than being a more abstract packet count?

I think this actually depends on the buffering strategy used by the gateware! If it uses packets (e.g. it double-buffers a maximum-sized RAM) then a credit would be a datagram. If it is a stream endpoint then a credit would be a byte. (This also answers the previous question, I think?)

Can we address some of the "undefined behaviour", perhaps declaring that it resets / raises an exception / etc...?

It does whatever. (In practice: routes a packet somewhere wrong; overflows an internal counter and gets a weird idea about the amount of credits; etc.) I think it's important to not spend too much gateware resources enforcing contracts that should never be violated.

Note that all of the undefined behavior happens on the (abstracted) channel between software and gateware. The end user (or applet developer) never encounters any of it.

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

I don't know. This seems like it could be potentially limiting to some applets. I think I'd like to understand how difficult it is to work around such a limitation being imposed before committing to it.

Given we have 128-bit IDs

The ID is just a cache key. I don't think it should go into the gateware; that would be a waste of resources of our tiny iCE40!

I was also wondering how to handle sub-endpoints better than placing header info in the payload...

I think I know how I want to do routing now.

I propose splitting the spec effectively in two parts: the routing part and the flow control part.

For the routing, I propose ditching the endpoint number entirely. Instead, the endpoint address becomes an arbitrary sequence of non-zero bytes that ends with a zero byte. (This is so that no address is a prefix of another address.) So the frame becomes: endpoint address (bytes), delimiter (zero byte), length (uleb128), credits (whatever), (rest of) payload. The length would include the credits field and any other management data.

In gateware, each address byte is parsed by a cut-through stream router that selects the next router, waits for a zero byte, then looks at the length, and goes back to the initial state after forwarding length bytes.
In software, you use any of the myriad ways to dispatch by prefix, and generally treat it as an opaque sequence.
The length is important for both the routers (for switch state management) and for the endpoints (for identifying delimiters).
Everything before the length is only important for the routers.
Everything that's after the length is only important for the endpoints.
Since it seems like there's going to be some diversity in flow control, treating it as a part of the payload (for the purely cut-through routers that never buffer anything other than to e.g. interface with the FX2) seems fair.

With this scheme, it is possible to build up arbitrary routing topologies in the gateware (to reflect the needs of e.g. clock domain transfer) and then treat the addresses that result as opaque byte sequences in software, which is kind of how you want to do it anyway. You could actually still make them valid uleb128, it's just no longer necessary since they are delimited at the end.

This means that sub-endpoints inside applets, which can be located e.g. after the CDC FIFO for async applets, are as cheap as normal endpoints are in the original proposal (since they are all treated uniformly anyway). It also means that routers never have to think about credits.

What do you think?

attie commented 1 year ago

I don't think it should go into the gateware; that would be a waste of resources of our tiny iCE40!

Fair.

In gateware, each address byte is parsed by a cut-through stream router that selects the next router, waits for a zero byte, then looks at the length, and goes back to the initial state after forwarding length bytes.

I like it a lot.

Presumably each router also absorbs the first byte in the stream, so the address/"route" gets shifted up/shorter each time?


           +---- "output port"
           V
source:       0x42 0x01 0x00 ...
router1: 0x42      0x01 0x00 ...
router2: 0x01           0x00 ...

whitequark commented 1 year ago

Yes, exactly. Each router would be a tiny FSM that is essentially linear:

after receiving the first byte, connects the input stream to the output stream indicated by it,
does nothing until it sees a 0x00,
latches the next number into a counter,
counts down with each byte transferred until it's 0 and then switches back to the reset state.

attie commented 1 year ago

Sounds very good to me!

whitequark commented 1 year ago

All right. We can prototype this once streams land in upstream Amaranth (and this will be a good test case for them too).

whitequark commented 1 year ago

latches the next number into a counter,

Actually, even this is wasteful. It would make a lot more sense to first pass the stream through a filter that asserts end-of-packet when length is reached, and then routers switch to the initial state on end-of-packet assertion.

This saves a (potentially) large counter in every router by instantiating it exactly one. And the router FSM becomes:

after receiving the first byte, connect the input stream to the right output stream (or null sink),
do nothing until end-of-packet is reached, then goto 1.

This means that routers are fantastically cheap. We can instantiate them completely automatically and without much regard for placement and routing. How cool is this?

attie commented 1 year ago

It would make a lot more sense to first pass the stream through a filter that asserts end-of-packet when length is reached, and then routers switch to the initial state on end-of-packet assertion.

This means that routers are fantastically cheap. We can instantiate them completely automatically and without much regard for placement and routing. How cool is this?

Yeah, I love it!

Flow-wise, I'm envisaging something like this. (Port selection mechanism is obviously not based on 255 instances per router)

whitequark commented 1 year ago

I'm thinking about it more generally. The applet would have a router in it! So for example, consider the system with two UART applets and one LPC applet (which is an async one). It could have a topology like this:

0: root router
- 0.0: reset for UART-A
- 0.1: router for UART-A
- 0.1.0: rx/tx pipe
- 0.1.1: baud rate register
- 0.1.2: error counter
- 0.2: reset for UART-B
- 0.3: router for UART-B
- 0.3.0: rx/tx pipe
- 0.3.1: baud rate register
- 0.3.2: error counter
- 0.4: reset for LPC
- 0.5: clock domain bridge for LPC
- 0.5.0: scratch register on LPC side (on external clock, for testing)
- 0.5.1: LPC applet (on external clock)
  - 0.5.1.0: rx/tx pipe
  - 0.5.1.1: some statistics...

attie commented 1 year ago

Yep, I think we're on the same page... I just drew the router as a distinct box, because it'd be an "off the shelf" component:

whitequark commented 1 year ago

Yep! And given that messages for the same applet should not be re-ordered (this new scheme actually calls the re-ordering policy into question somewhat... "only reorder based on first byte"? that's kind of gross) we can safely interleave register access with pipe commands and so on.

This is actually even more reminiscent of PCIe, except I guess a lot more... message-oriented redesign of it?

whitequark commented 1 year ago

Actually, looking further at it, it seems that it's logical to put length first. So the framing becomes: length, endpoint address, delimiter, (credits), payload.

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

I think I know how to handle this now. The length field (per above) now solely delimits the datagram stream, right? It became a separate entity that can be used elsewhere now, without involving addressing, credits, and so on.

So what we do is add some nesting. Let's consider the case of sending 100 KB to UART-A and 64 KB to UART-B. This could be encoded as:

<Length=64.01KB Endpoint=0.1.0 Payload=<Length=100KB Payload=[64KB of data ... (unterminated)>
<Length=64.01KB Endpoint=0.3.0 Payload=<Length=64KB Payload=[64KB of data]>>
<Length=36.01KB Endpoint=0.1.0 Payload=<(continuation) ... other 36 KB of data]>>

In gateware, the way this could be implemented is that connected to some of the router outputs (specifically in front of endpoints or groups of them where long packets are allowed) will be another EOP generator, exactly the same as the one in the front of the whole router tree. And instead of paying attention to the EOP signal at the input, it just looks at the length and counts! Meaning that each such point is a point where you can add fragmentation if you need it.

In software... this works exactly the same! A tree of Python objects that add framing when data passes through them.

I think it's really beautiful because it's completely composable. It's just two blocks, EOP generator and prefix router, connected in any way that seems useful for the design.

The other thing I really like about it is that if you do not need to fragment, you do not have to. Nothing stops you from forming the following packet:

<Length=1.00001GB Endpoint=0.1.0 Payload=<Length=1GB Payload=[1GB of data]>>

It will be incredibly effective, especially assuming you're stuffing the data in there using sendfile() or an equivalent. Yes, it can't be interrupted, but what if you don't even have any other applets there to interrupt you? This is in stark contrast with TCP/IP-style fragmentation/segmentation, which you have to do whether you like it or not because it's a correctness issue, not a latency issue.

I imagine that endpoints will state their maximum tolerable latency (e.g. "needs to be serviced at least every 10 ms") and the Glasgow infrastructure would insert EOP generators in the places where the data flow would have to be regularly interrupted to preserve fairness.

whitequark commented 1 year ago

this new scheme actually calls the re-ordering policy into question somewhat... "only reorder based on first byte"? that's kind of gross

I think what we can do here is twofold. First... do you know set_clock_groups -asynchronous from the Synopsys constraint files? Asserting that the groups of clock domains are independent from each other and need not be timed together. We can have the same thing for endpoints: a way to mark them asynchronous. Then they could be mapped to e.g. different TCP streams for when we have TCP/IP running on this thing.

Second, we can actually enforce this. We can make it possible to check that a single asyncio.Task never awaits two endpoints that are set as asynchronous to each other, and emit a warning (or even crash it) if it ever does.

The actual reordering policy will likely be still prefix based but formed based on the policy definition for endpoints. Ideally (as we reach higher bandwidths) we'd have some Rust code that yeets datagrams for endpoints from asynchronous groups into different threads (on revE this would be several instantiations of the hardware TCP state machine) and then each thread can run an asyncio reactor handling datagrams for its endpoint group. So it's not just multiple applets running simultaneously, but they could actually run in parallel! (Provided they're actually independent from each other.)

Which, combined with the bound on latency from the previous message, actually makes it seem like we can make multiple-applet configurations that do not have applets starve each other of either USB pipe bandwidth or of host resources? That would be really nice.

whitequark commented 1 year ago

Okay, it looks like I thought long and hard enough about this problem that I ended up finding a completely new (to myself) way to think about it ~which this margin is too narrow to contain~. I might have even better proposals soon! (Building on these particular ones, it seems.)

sorear commented 1 year ago

Are all of our writes posted? How do we establish ordering between an applet responding to a write and subsequent read data on another endpoint? In particular, if we read data after a reset do we know if the data was generated before or after the reset?

whitequark commented 1 year ago

Are all of our writes posted? How do we establish ordering between an applet responding to a write and subsequent read data on another endpoint?

The interconnect knows which endpoints may not have flows be reordered between them and ensures this doesn't happen. Consider a multiplexer which has two input streams pending. If the flows in these streams are declared asynchronous, it can switch at any point (in practice it will switch to maintain latency requirements) and head-of-line blocking isn't possible. If the flows are declared synchronous (all of the flows within one applet are synchronous unless declared otherwise) then head-of-line blocking is possible.

In particular, if we read data after a reset do we know if the data was generated before or after the reset?

But this isn't enough to solve reset endpoints specifically because it is possible (likely, even) that we want to reset the applet because it has filled every single buffer in the pipeline.

I thought I had a solution for this but I think it doesn't work out. I'll sleep on it; maybe it'll make sense then.

attie commented 1 year ago

Adding my words & diagrams from #glasgow on 2023-07-29, so they don't get lost.

I was going for more - a router consumes a byte, and addresses the downstream. it's cut through until downstream "releases" it then a "consumer(?)" signals back to indicate it's there and wants data... it can take whatever logic it likes to determine length, before releasing a "busy/more" type signal back upstream I'm always better with diagrams, will try to put something together before the meeting (this results in a series of bytes that describe the route into the FPGA, and then an endpoint that might be variable length (e.g: standardize on uleb128), fixed length with no len field in the payload, or a zero length "signal", etc...) for return path, routers add the port address

something like this (probably needs more thought / proving through) The endpoint then just connects the applet up instead of the output

this would result in a stream that has [ n Address Octets ][ n Payload Octets ], and can be interpreted as required e.g: 02 00 05 01 02 03 04 05 could be address 02 -> 00, followed by 05 (length) and 01 02 03 04 05 (payload)

whitequark commented 1 year ago

I don't think we should have an address bus at all--that's gates we could be using for something else.

GlasgowEmbedded / glasgow

[RFC] New basis of operation: Reliable Datagram Pipe #354