ZipCPU / wb2axip

Bus bridges and other odds and ends
495 stars 100 forks source link

Kicking off a transfer with AXIS2MM #42

Closed baileyji closed 2 years ago

baileyji commented 2 years ago

Hi Dan, Managed to get the core implemented without too much trouble and have (I think) a rudimentary python driver (gist here) for in with PYNQ (interestingly the core shows up as a PYNQ Heirarchy around a DefaultIP, I suspect because of the way vivado wraps RTL in block designs).

The short of this is "How does one move the core from "idle" to "busy"?

From the ILA I see that data is passing through the core (WDATA is toggling) and after writing an address with my driver it is read back as I'd expect and I see AWADDR outputting the value I set so I don't seem to have an endian issue (0x500000000, the base address of some MIG DDR4 I'm using).

After writing length I can read that back and what I set it to impacts AWLEN but no writes have been issued. the command register always seems to be reading 0x90000, which I infer is telling me the FIFO is 512 long (as I set in vivado).

AWLEN started out at 255 (and stayed there for any large transactions), which as I understand it, is not a legal value for a burst with 256b TDATA (anything above 128 would cross a 4k boundary). Even if it is formally allowed, I've repeatedly seen xilinx docs that indicate crossing that boundary is not a good idea, so I'll be keeping an eye on if the smartconnect or MiG chokes.

I've looked though the verilog to see how I should tell the core to start and short of a couple of bits you have labeled as "reserved" but in fact seem to be related to start and prestart I'm drawing a blank.

Apologies if these implementation details aren't familiar or are largely irrelevant. I figure more (clear) info is better than less.

Finally, quick note on how I'm planning to use this: I feed it a stream that asserts TLAST between every other and every 256th clock (not transaction). So the stream will have between 2 and 256 beats every 256 clocks (generally this will be steady state, i.e. the same 128 beats out of 256 possible will be there every time, over and over). The core does not accept back pressure (its TREADY input is tied high as its fed from and ADC). I'd configure axis2mm to sync on TLAST (a very nice bonus) and capture some multiple of 32 (bytes) N (active beats i.e. 128) M (the number of sets I want, ~0.5-4 GiB of data). I'd start the core and let it spew unmolested into the MIG (which does have enough bandwidth) with it finishing after the total number of bytes and hopefully not griping that it has seen M TLAST assertions along the way.

ZipCPU commented 2 years ago

To start a transfer, you'll need to write to the control register to set the busy bit. I tend to also automatically clear the error bit at the same time, since the design will not start once an error has been received until the error is acknowledged. You can find the start logic here, and learn of how I'm handling the AXI-lite control bus here--should you be interested. If you use a declaration like this:

typedef struct S2MM_S { unsigned a_control; unsigned a_unused1[3]; unsigned *a_dest; unsigned a_unused2; unsigned a_len; unsigned a_unused3; } S2MM;

and declare a variable as,

static volatile S2MM const _s2mm=((S2MM ) / your S2MM address /);

you should then be able to kick the transaction off with:

// Set the two high bits of the control register // Bit 31 starts the transaction // Bit 30 clears any prior transaction _s2mm->a_control = 0xc0000000;

If you look here, you'll see where I kick it off in a Verilator based simulation.

The control register offers other options for dealing with the IP as well. For example, if you want to make sure that the IP never drops samples, you can set r_continuous (bit 28). This will create backpressure (I know, you can't handle backpressure ..) when the design is idle or the FIFO is full. It will also ready itself for a next transfer even while the FIFO isn't empty. There's a third benefit to r_continuous: the design will then tell you about any overflows, by interrupting the CPU and setting the overflow flag.

If your design is busting the 4kB boundary with its burst size, then that is an error that I will fix. I can look at that today. As a temporary fix, you can set the LGMAXBURST local parameter to the maximum burst size for your data width. Something like:

localparam LGMAXBURST = $clog2(4096 * 8 / C_AXI_DATA_WIDTH);

A "proper" fix (which I'll work on) should clip this at a maximum of either 8 or half your FIFO size, but the above should work as a temporary fix if you already know that you are breaking the 4kB boundary. Once fixed, AWLEN should range from zero to (1<<LGMAXBURST)-1. (If your address isn't aligned with the burst size, then the first burst will attempt to align it --- so not all bursts will start with a full length.)

Beware of configuring the IP for packet sizes other than those you expect. Unlike Xilinx's IP, which will halt on TLAST, this IP will transfer the length you request independent of when/where TLAST shows up. I consider this a "feature", but it might surprise those who are familiar with Xilinx's IP. (Stopping on TLAST is a future capability I would consider adding as an option ...) If you sync on TLAST, the S2MM IP will then wait until it sees a TLAST before it starts collecting at the top of the next packet. This means that, if you configure the IP for a packet size of 256 and send a packet of size two followed by a packet of size 256, the IP will copy 256 items from the stream to memory--2 from the first packet and 254 from the next, before

The design does put a bit of a timing load on S_AXIS_TREADY, so if you struggle to meet timing then a skid buffer might be appropriate on the input AXI stream. I think Xilinx calls these AXI slices. It should be easy enough to add one to the IP if necessary.

Dan

baileyji commented 2 years ago

Excellent, that that did indeed kick it off. Interestingly it did appear to capture just fine with 8k bursts for a while. Now the core has entered a state were err and overflow are set and I can't get them to clear. Command is 0xc0890000 and writing err to clear either is having no effect or the error is resetting itself immediately. Either way no more captures.

I'll make the change to the local params you suggest as it will be very unlikely I'd ever capture less that 4096 bytes of data.

As for monitoring overflow, that sounds like a great idea, but it would require careful attention or using the interrupt if I follow correctly: overflow before complete indicates that the axi slave couldn't keep up, but it will be set as soon as the fifo fills up after the transfer is complete. So I've got only LGFIFO best case clocks to read the cmd register and tell the difference, right?

ZipCPU commented 2 years ago

Okay, so ... you can't clear overflow while the design is in operation. You may have to issue an abort to the design before being able to clear overflow. That seems to be an appropriate. To issue an abort, write 0x26000000 to the design and it will halt its transfer. You'll then need to restart it. Otherwise, the overflow bit is simply telling you that samples were dropped. Whether or not that is catastrophic depends on your application.

LGFIFO is a number from less than 32. The FIFO size isn't LGFIFO, though, it's (1<<LGFIFO). Once the transfer completes, you will have until this FIFO fills up to start your next transfer. The challenge is that ... the transfer won't complete until the last BVALID has been received. So, let's think this through: there's a round trip time between when the data is in the FIFO and ready to be sent to memory, to AWVALID && WVALID getting set to request the transfer, and when the acknowledgment (BVALID) comes back from this transfer. During this time, the FIFO can accumulate data. Once the last BVALID is returned, your application will get interrupted. (The FIFO can still be accumulating data.) You can then reconfigure the S2MM controller and start it again. (The FIFO will still be accumulating data during this time.) You can accumulate no more than (1<<LGFIFO) words of data during all this time before the FIFO overflows.

It might help your application to set LGFIFO to something a bit larger. 12 or 13 perhaps? All depending on how much block RAM hardware you have available to you. If the design forces you into distributed RAM, let me know, and I'll show you how to keep it in block RAM--and push that option to the top level in the next release.

Dan

baileyji commented 2 years ago

Right, I forgot about trying to issue an abort.

I'm pretty clear on LGFIFO (your documentation in the file is superb, though I'd add a note to r_busy that one can write to it st start things). Right now I've got the core using 3,627 LUT (2368/1.1% of which are LUTRAM) and 715 FF and haven't had any timing issues in this testbench. Knowing how to force it into BRAM would be good, we are using something like 50-70% of the device LUTRAM (Multichannel Xilinx FIRs that don't like BRAM), so any savings we can make there is valuable.

The nice thing about our design is that our control isn't time critical, most of my HLS blocks don't even have their interrupt out hooked up to save on that tiny bit of resource use. The only reason I'd need to go longer on the FIFO is if the SmartConnect/MIG can't keep up, but if that happens a (slightly larger) FIFO won't help against gigabytes of data. The situation I was alluding to was kicking off say a 2GiB transfer (that will take ~120 ms) and going of to do other things (when it completes I'd not be starting another one for macroscopic time. So if I don't get back to the core and read out the command register before 1<<LGFIFO clocks (1 us for LGFIFO=9) then I won't know if it overflowed because I didn't start it again or because it was stalled waiting on the slave. This isn't a requirement for me though as by design I'm never going to be in a situation where the slave can't keep up, and I'd see it in the analysis if somehow it didn't. I think it just means this particular flag isn't very useful in this use case where I'm being lazy and not interrupt driven.

ZipCPU commented 2 years ago

(your documentation in the file is superb, though I'd add a note to r_busy that one can write to it st start things)

Thanks! I am putting an update together based on your comments, and this will be in it. I'm also going to add the skid buffer on the input as well as trimming the maximum burst length to guarantee its always less than 4kB.

Knowing how to force it into BRAM would be good

There's a parameter that the synchronous FIFO supports, OPT_ASYNC_READ, that forces the memory into distributed RAMs when set. If you clear this parameter, then all memory accesses will be registered--allowing the synthesizer to use block RAM. A touch of combinatorial logic makes sure that the two paths are functionally identical so nothing other than the parameter would need to change.

Something else to keep in your back pocket is the fact that there is a scatter gather version of the DMA available. This allows the DMA to configure itself from an external table in system memory. While I haven't yet built this approach for the S2MM IP, it shouldn't be all that hard to do. I just haven't had a need to do so.

I've also thought about adding a high water mark to the IP, so you can tell the maximum FIFO fill at any time. Might be useful at some point, but ... I haven't had the opportunity (or need) for it (yet).

Dan

ZipCPU commented 2 years ago

Also, quick question: since you were looking heavily at resource usage, how does my S2MM compare to Xilinx's in terms of resource usage? About the same? More, less?

Thanks!

baileyji commented 2 years ago

Their blocks don't offer what I need, the closest is their AXIS DMA but it maxes out at 64MB without scatter gather, doesn't offer sync, and in SG has all sorts of timing issues at 512MHz (plus it is huge).

I've gotten several HLS AXIS2MM cores that all pass c/ co simulation and should make timing (in isolation) around 730MHz, depending on the design they use between ~1400 and 2200 LUTs and FF and 16 32k BRAMS, so probably a touch more than I'm managing in HLS in the best core I think should work.

The thing is I can't get them to work in situ: they never issue a write request on the bus and even after adding all kinds of debug probes to the innards I'm about out of ideas on how to edit the HLS code. I'm pretty sure its some internal control deadlock relating to fifos but without capturing it in co-sim its hard to figure out how to fix. See here if you are curious.

baileyji commented 2 years ago

Note too that the address bits for ABORT aren't in the docs. Id thought it was Its reasonably clear that they are [15:8], though after i failed to abort I'm interpreting your earlier comment more clearly. It wasn't clear that repurposed bits you had elsewhere.

baileyji commented 2 years ago

This is great, I've managed to get things working (as long as continuous=False).

There are three cycles of WREADY low after the first data beat of every burst of 128. That is a throughput limitation I don't think I had with a different HLS core, but that comes from the SmartConnnect/MIG, so it may be that somehow that got synthesized differently with this core vs another HLS core I'd tested. I'll need to look into this as the raw ADC datarate is 32bytes @ 512. The MIG is 64 bytes @ 300MHz with a touch of overhead but this is possible. I'm counting on the SmartConnect going from 256 -> 512 without a delay and thought I'd successfully does this with an older HLS core but it is possible I missed that I had dropped some samples. It might help if I up my data-width to 512 before your core to halve the number of axi transactions.

This has led me to discovering I can't get a reliable abort: writing the key will cause aborting to be set but it is never cleared. One possibility here is that at full datarate the master isn't keeping up and WDATA goes low, the core then overflows because I ignore the TREADY de-assertion, ad then the final axi transaction never finishes so when I try to abort it just hangs at aborting.

ZipCPU commented 2 years ago

Let me know about the WREADY issue.

From the IP side of the house, the startup should be:

  1. User sets the destination address and length, and then sets r_busy. This issues the start command and the IP then becomes busy.
  2. The S2MM core might then wait for a LAST signal--depending on how you configure it.
  3. Once it has synchronized to a packet, it will then wait until a burst of data is available in the FIFO. This is nominally once the FIFO is half full, but you can set the FIFO to be larger than two AXI burst's in size. If you do so, the IP will wait until a burst's worth of data is available before starting. Note that the first packet will not accumulate into the buffer until/unless r_continuous is set--which (in my vision) will get set on the second and/or subsequent invocations of the design.
  4. Once a burst's worth of data is available, AWVALID and WVALID will then be raised together. Once WVALID && WLAST are both accepted, AWVALID && WVALID will immediately be raised again. This can cause some hiccups with some interconnects--since I wait to raise subsequent AWVALIDS--however my measure of success has been that WVALID will remain raised throughout the entire transaction.
  5. If you issue an abort, it may take the design a while to stop. All of the AXI transactions need to finish, and all acknowledgments must be received.

Within a single AXIS2MM command, setting r_continuous should be fairly irrelevant. The whole purpose of r_continuous is to help you go from one command to the next without interruptions. If your goal is to transfer 2GB of data and be done, then the r_continuous flag won't help you much. You'll still get overflow warnings without it, they'll just be limited to overflows that take place while the transfer is in operation.

Upping your data width to 512 might help. I came across some more lint errors when doing so here, but nothing more. Tests at 512 bit widths are currently passing here.

Let me look into the abort issue from my end. I know I've demonstrated that I can start an abort sequence, but ... I'm not necessarily certain I've been able to finish one--so let me double check that.

Dan

ZipCPU commented 2 years ago

I have found a race condition with the abort. If you issue an abort command while nothing is in flight, it will lock up. 1) It won't issue any more requests, and 2) it never transitions out of busy unless it gets a last BVALID. There remains the possibility that there might be no data in transit, and so AWVALID and WVALID are both low and nothing is outstanding. Neither AWVALID nor WVALID will be raised again, but if nothing is outstanding then it will never transition back to idle.

This is easily fixed, and I'll add this to the fixes to be posted.

Looking into the packet synchronization criteria, I found a bug there too--a bit of a tautology between TVALID && TREADY but not writing to the FIFO, and writing to the FIFO being defined as TVALID && TREADY and synchronized. The correct logic there should be VALID && READY and the FIFO is not in reset.

These updates are now being regression tested. I'll post them once regression is complete (and successful).

Dan

baileyji commented 2 years ago

Ok, I just tried going to 512b TDATA and got a synth error at axis2mm.v:1432

part-select [7:0] out of range of prefix 'r_max_burst'

if (!M_AXI_AWVALID || M_AXI_AWREADY)
    axi_awlen  <= r_max_burst[7:0] - 8'd1;

The relevant param lines for me are (I think)

    localparam LGMAXBURST = $clog2(4096 * 8 / C_AXI_DATA_WIDTH);
    localparam  LGMAX_FIXED_BURST = (LGMAXBURST > 4) ? 4 : LGMAXBURST;
    localparam  MAX_FIXED_BURST = (1<<LGMAX_FIXED_BURST);

which all evaluate to 64.

ZipCPU commented 2 years ago

Try this commit. It should fix the issues you've listed above.

Dan

baileyji commented 2 years ago

I saw your commit and already have it synthesizing. Thanks!

ZipCPU commented 2 years ago

It turns out that a width of 512 is the key width to use when testing widths big enough to break the 4kB boundary, so let me add that to my regular proof regimen.

baileyji commented 2 years ago

Initial synthesis with of this commit (with your TUSER modification reapplied) does allow synthesis at 512b width but with greater usage. 1.9/2.0/2.2x LUTs/LUTRAM/FF. I changed the FIFO to registered and that brought things way down, ~1k LUT/FF and 3.5 36K brams and it made timing at 512MHz just fine. By cobbling together my own version of Xilinx's AXI Interconnect I've managed to use this core to write a steady stream of 4K bursts at 16.384GB/s for a total of 4GiB (well 512b shy as I set the length register to 32bits) to DDR4 via the xilinx MIG. Tahnks for creating something excellent.