GlasgowEmbedded / glasgow

Scots Army Knife for electronics
BSD Zero Clause License
1.92k stars 187 forks source link

[RFC] New basis of operation: Reliable Datagram Pipe #354

Open whitequark opened 1 year ago

whitequark commented 1 year ago

Currently we configure USB as 4 or 2 completely unrelated streams: two IN/two OUT and one IN/one OUT (with bigger buffers) and expect applets to fit into this scheme. We also have an I2C sideband where applets can add arbitrary registers.

This scheme is problematic for many reasons:

Considering just the case of many applets and many pipes in isolation, a few solutions were proposed:

I propose a scheme called RDP: Reliable Datagram Pipe.

The scheme has the following advantages:

Using uleb128 as the encoding of the endpoint, credit, length fields has the following advantages:

Downsides:

Open questions:

attie commented 1 year ago

This sounds very good - I also prefer it to both COBS and a packet header.

Some questions:

Can we address some of the "undefined behaviour", perhaps declaring that it resets / raises an exception / etc...?

Unbounded latency (but we can't really make it bounded with Python and off-the-shelf OSes anyway)

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

Is it worth it allowing multiple levels of cut-through routing within the FPGA?

I was also wondering how to handle sub-endpoints better than placing header info in the payload... Given we have 128-bit IDs, Is it even worth worrying about multi-level routing, and just allow applets to register multiple endpoints (or a range like network masks / 192.168.0.0/24)? The endpoint is then masked, and the equivelant of the "Host ID" is then passed through to the applet at the start, like an "address". Alignment and stuff becomes important, but that shouldn't be too difficult to handle.

whitequark commented 1 year ago
  • Is your expectation wrt Credits to primarily provide flow control, or is there more going on there?

I am thinking of this as solely a mechanism for flow control.

  • Do "credits" become "bytes", which instructs the other end how much data may be transferred before full, rather than being a more abstract packet count?

I think this actually depends on the buffering strategy used by the gateware! If it uses packets (e.g. it double-buffers a maximum-sized RAM) then a credit would be a datagram. If it is a stream endpoint then a credit would be a byte. (This also answers the previous question, I think?)

Can we address some of the "undefined behaviour", perhaps declaring that it resets / raises an exception / etc...?

It does whatever. (In practice: routes a packet somewhere wrong; overflows an internal counter and gets a weird idea about the amount of credits; etc.) I think it's important to not spend too much gateware resources enforcing contracts that should never be violated.

Note that all of the undefined behavior happens on the (abstracted) channel between software and gateware. The end user (or applet developer) never encounters any of it.

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

I don't know. This seems like it could be potentially limiting to some applets. I think I'd like to understand how difficult it is to work around such a limitation being imposed before committing to it.

Given we have 128-bit IDs

The ID is just a cache key. I don't think it should go into the gateware; that would be a waste of resources of our tiny iCE40!

I was also wondering how to handle sub-endpoints better than placing header info in the payload...

I think I know how I want to do routing now.


I propose splitting the spec effectively in two parts: the routing part and the flow control part.

For the routing, I propose ditching the endpoint number entirely. Instead, the endpoint address becomes an arbitrary sequence of non-zero bytes that ends with a zero byte. (This is so that no address is a prefix of another address.) So the frame becomes: endpoint address (bytes), delimiter (zero byte), length (uleb128), credits (whatever), (rest of) payload. The length would include the credits field and any other management data.

With this scheme, it is possible to build up arbitrary routing topologies in the gateware (to reflect the needs of e.g. clock domain transfer) and then treat the addresses that result as opaque byte sequences in software, which is kind of how you want to do it anyway. You could actually still make them valid uleb128, it's just no longer necessary since they are delimited at the end.

This means that sub-endpoints inside applets, which can be located e.g. after the CDC FIFO for async applets, are as cheap as normal endpoints are in the original proposal (since they are all treated uniformly anyway). It also means that routers never have to think about credits.

What do you think?

attie commented 1 year ago

I don't think it should go into the gateware; that would be a waste of resources of our tiny iCE40!

Fair.

In gateware, each address byte is parsed by a cut-through stream router that selects the next router, waits for a zero byte, then looks at the length, and goes back to the initial state after forwarding length bytes.

I like it a lot.

Presumably each router also absorbs the first byte in the stream, so the address/"route" gets shifted up/shorter each time?


           +---- "output port"
           V
source:       0x42 0x01 0x00 ...
router1: 0x42      0x01 0x00 ...
router2: 0x01           0x00 ...
whitequark commented 1 year ago

Yes, exactly. Each router would be a tiny FSM that is essentially linear:

attie commented 1 year ago

Sounds very good to me!

whitequark commented 1 year ago

All right. We can prototype this once streams land in upstream Amaranth (and this will be a good test case for them too).

whitequark commented 1 year ago
  • latches the next number into a counter,

Actually, even this is wasteful. It would make a lot more sense to first pass the stream through a filter that asserts end-of-packet when length is reached, and then routers switch to the initial state on end-of-packet assertion.

This saves a (potentially) large counter in every router by instantiating it exactly one. And the router FSM becomes:

  1. after receiving the first byte, connect the input stream to the right output stream (or null sink),
  2. do nothing until end-of-packet is reached, then goto 1.

This means that routers are fantastically cheap. We can instantiate them completely automatically and without much regard for placement and routing. How cool is this?

attie commented 1 year ago

It would make a lot more sense to first pass the stream through a filter that asserts end-of-packet when length is reached, and then routers switch to the initial state on end-of-packet assertion.

This means that routers are fantastically cheap. We can instantiate them completely automatically and without much regard for placement and routing. How cool is this?

Yeah, I love it!

Flow-wise, I'm envisaging something like this. (Port selection mechanism is obviously not based on 255 instances per router)

image

whitequark commented 1 year ago

I'm thinking about it more generally. The applet would have a router in it! So for example, consider the system with two UART applets and one LPC applet (which is an async one). It could have a topology like this:

attie commented 1 year ago

Yep, I think we're on the same page... I just drew the router as a distinct box, because it'd be an "off the shelf" component:

image

whitequark commented 1 year ago

Yep! And given that messages for the same applet should not be re-ordered (this new scheme actually calls the re-ordering policy into question somewhat... "only reorder based on first byte"? that's kind of gross) we can safely interleave register access with pipe commands and so on.

This is actually even more reminiscent of PCIe, except I guess a lot more... message-oriented redesign of it?

whitequark commented 1 year ago

Actually, looking further at it, it seems that it's logical to put length first. So the framing becomes: length, endpoint address, delimiter, (credits), payload.

Is it worth imposing a maximum payload size, to help break up long busy periods / allow others in? Dynamically based on what is running, or a hard figure as part of the spec?

I think I know how to handle this now. The length field (per above) now solely delimits the datagram stream, right? It became a separate entity that can be used elsewhere now, without involving addressing, credits, and so on.

So what we do is add some nesting. Let's consider the case of sending 100 KB to UART-A and 64 KB to UART-B. This could be encoded as:

<Length=64.01KB Endpoint=0.1.0 Payload=<Length=100KB Payload=[64KB of data ... (unterminated)>
<Length=64.01KB Endpoint=0.3.0 Payload=<Length=64KB Payload=[64KB of data]>>
<Length=36.01KB Endpoint=0.1.0 Payload=<(continuation) ... other 36 KB of data]>>

In gateware, the way this could be implemented is that connected to some of the router outputs (specifically in front of endpoints or groups of them where long packets are allowed) will be another EOP generator, exactly the same as the one in the front of the whole router tree. And instead of paying attention to the EOP signal at the input, it just looks at the length and counts! Meaning that each such point is a point where you can add fragmentation if you need it.

In software... this works exactly the same! A tree of Python objects that add framing when data passes through them.

I think it's really beautiful because it's completely composable. It's just two blocks, EOP generator and prefix router, connected in any way that seems useful for the design.

The other thing I really like about it is that if you do not need to fragment, you do not have to. Nothing stops you from forming the following packet:

<Length=1.00001GB Endpoint=0.1.0 Payload=<Length=1GB Payload=[1GB of data]>>

It will be incredibly effective, especially assuming you're stuffing the data in there using sendfile() or an equivalent. Yes, it can't be interrupted, but what if you don't even have any other applets there to interrupt you? This is in stark contrast with TCP/IP-style fragmentation/segmentation, which you have to do whether you like it or not because it's a correctness issue, not a latency issue.


I imagine that endpoints will state their maximum tolerable latency (e.g. "needs to be serviced at least every 10 ms") and the Glasgow infrastructure would insert EOP generators in the places where the data flow would have to be regularly interrupted to preserve fairness.

whitequark commented 1 year ago

this new scheme actually calls the re-ordering policy into question somewhat... "only reorder based on first byte"? that's kind of gross

I think what we can do here is twofold. First... do you know set_clock_groups -asynchronous from the Synopsys constraint files? Asserting that the groups of clock domains are independent from each other and need not be timed together. We can have the same thing for endpoints: a way to mark them asynchronous. Then they could be mapped to e.g. different TCP streams for when we have TCP/IP running on this thing.

Second, we can actually enforce this. We can make it possible to check that a single asyncio.Task never awaits two endpoints that are set as asynchronous to each other, and emit a warning (or even crash it) if it ever does.

The actual reordering policy will likely be still prefix based but formed based on the policy definition for endpoints. Ideally (as we reach higher bandwidths) we'd have some Rust code that yeets datagrams for endpoints from asynchronous groups into different threads (on revE this would be several instantiations of the hardware TCP state machine) and then each thread can run an asyncio reactor handling datagrams for its endpoint group. So it's not just multiple applets running simultaneously, but they could actually run in parallel! (Provided they're actually independent from each other.)

Which, combined with the bound on latency from the previous message, actually makes it seem like we can make multiple-applet configurations that do not have applets starve each other of either USB pipe bandwidth or of host resources? That would be really nice.

whitequark commented 1 year ago

Okay, it looks like I thought long and hard enough about this problem that I ended up finding a completely new (to myself) way to think about it ~which this margin is too narrow to contain~. I might have even better proposals soon! (Building on these particular ones, it seems.)

sorear commented 1 year ago

Are all of our writes posted? How do we establish ordering between an applet responding to a write and subsequent read data on another endpoint? In particular, if we read data after a reset do we know if the data was generated before or after the reset?

whitequark commented 1 year ago

Are all of our writes posted? How do we establish ordering between an applet responding to a write and subsequent read data on another endpoint?

The interconnect knows which endpoints may not have flows be reordered between them and ensures this doesn't happen. Consider a multiplexer which has two input streams pending. If the flows in these streams are declared asynchronous, it can switch at any point (in practice it will switch to maintain latency requirements) and head-of-line blocking isn't possible. If the flows are declared synchronous (all of the flows within one applet are synchronous unless declared otherwise) then head-of-line blocking is possible.

In particular, if we read data after a reset do we know if the data was generated before or after the reset?

But this isn't enough to solve reset endpoints specifically because it is possible (likely, even) that we want to reset the applet because it has filled every single buffer in the pipeline.

I thought I had a solution for this but I think it doesn't work out. I'll sleep on it; maybe it'll make sense then.

attie commented 1 year ago

Adding my words & diagrams from #glasgow on 2023-07-29, so they don't get lost.


I was going for more - a router consumes a byte, and addresses the downstream. it's cut through until downstream "releases" it then a "consumer(?)" signals back to indicate it's there and wants data... it can take whatever logic it likes to determine length, before releasing a "busy/more" type signal back upstream I'm always better with diagrams, will try to put something together before the meeting (this results in a series of bytes that describe the route into the FPGA, and then an endpoint that might be variable length (e.g: standardize on uleb128), fixed length with no len field in the payload, or a zero length "signal", etc...) for return path, routers add the port address

image

something like this (probably needs more thought / proving through) The endpoint then just connects the applet up instead of the output

image

this would result in a stream that has [ n Address Octets ][ n Payload Octets ], and can be interpreted as required e.g: 02 00 05 01 02 03 04 05 could be address 02 -> 00, followed by 05 (length) and 01 02 03 04 05 (payload)

whitequark commented 1 year ago

I don't think we should have an address bus at all--that's gates we could be using for something else.