enjoy-digital / litesata

Small footprint and configurable SATA core
Other
119 stars 33 forks source link

frontend/dma: support multi-sector transfer #25

Closed gsomlo closed 1 year ago

gsomlo commented 1 year ago

CAUTION: DO NOT APPLY (yet)!]

Allow up to 8 contiguous sectors to be transfered in a single DMA operation.

The max. 8 limit is based on my observation of the linux kernel's behavior, but seems rather arbitrary.

CAUTION: I also kinda sorta hacked the LiteSATA DMA frontend to increase the sector-buffer FIFO(s) by a factor of 8 (to fit up to 8 sectors). It might make more sense to decouple the FIFO size from the max. transfer size and have it work in a true "streaming" fashion, but for that I'd have to study LiteX streams, the SATA protocol, etc. in a LOT more detail :)

At the moment, I just want to start talking about this rather than push an actual change into upstream LiteSATA.

Statistics (EDIT: on nexys_video, with sata-gen 1, quad-core fpu-enabled rocket at 50MHz):

Signed-off-by: Gabriel Somlo gsomlo@gmail.com

gsomlo commented 1 year ago

One alternative to overcomplicating the litesata gateware would be to simply handle transfer sizes of either 1 or 8 sectors, and nothing else.

I've observed that each bvec worth of data to be transfered inside the linux drivers is either exactly one, or exactly eight sectors in size. I've added a warning print statement to notify us if ever the litesata linux driver is transfering a block that's sized other than 1 or 8 sectors.

But assuming that's rare, we could, instead of a 16-bit register containing an arbitrary sector count, use a single-bit register (or additional bit flag in an existing register) to specify whether the transfer will be 8 contiguous sectors or just one sector.

We can then use 1-sector-at-a-time mode for anything that's not a multiple of 8, and keep the gateware as simple as possible...

@enjoy-digital -- what do you think?

enjoy-digital commented 1 year ago

@gsomlo: Thanks for the RFC, I'll look at this on Monday.

gsomlo commented 1 year ago

On Sat, Aug 27, 2022 at 12:08:20AM -0700, enjoy-digital wrote:

@gsomlo: Thanks for the RFC, I'll look at this on Monday.

OK, one more thought -- if decoupling the buffer size from the size of the data unit being transferred (512 or 4096 bytes are the two values I've observed Linux use so far) would over-complicate the design too much, hard-coding buffers 8 times as large as until now might be made optional -- if the gateware is built with support for 4096-byte transfers as an option, there might be a flag indicating so in phy or identify or whatnot, and the DMA start register might have two bits instead of one, and the Linux driver might chose to do 4096-byte DMA transfers if the gateware indicates it has that capability...

Anyway, once you've had a chance to look over my "rough draft" maybe we can chat on IRC before deciding what the best option might be...

enjoy-digital commented 1 year ago

Hi @gsomlo,

sorry for the delay. It would in fact be possible to support multi-sectors transfer probably without increasing the FIFO depth (or by just doubling it):

I can do the implementation and do simple test with the LiteX BIOS and you'll then be able to test with the Linux driver. Does it make sense to you?

gsomlo commented 1 year ago

Hi @enjoy-digital,

It would in fact be possible to support multi-sectors transfer probably without increasing the FIFO depth (or by just doubling it):

Sounds interesting -- if you can "stream" through the FIFO (instead of what currently appears to be a store-and-forward mode of operation) we could then figure out the optimal FIFO size that's still economical in terms of FPGA memory utilization but gets us a good data transfer rate (i.e., without brute-force hard-coding the FIFO to 8x its current size, like what my naive experiment did :)

The largest individual data size transfered by the linux driver in its current form is 8 sectors (4096 bytes, or a standard memory page).

I'm not quite sure how likely it is that we'd need larger than 8-sector contiguous transfer capability without adding support for hardware-based scatter-gather lists (which would probably be overkill if the goal is to keep LiteSATA as simple as reasonably possible). @geertu: maybe you might be able to comment on that?

Either way, I'd be happy to test a streaming-FIFO PR against LiteSATA, and do measurements on FIFO length vs. transfer speeds to come up with good defaults!

enjoy-digital commented 1 year ago

Good, I'll work on implementing this then :)

geertu commented 1 year ago

512 bytes is the sector size on traditional HDDs. 4096 bytes is the sector size on some recent HDDs. As OpenRISC uses a page size of 8 KiB, I guess Linux will usually read 16 sectors on OpenRISC?

gsomlo commented 1 year ago

hey @enjoy-digital -- I happen to have a few days to focus on this, so I was wondering if either

  1. you have some unreleased work toward this goal -- streaming sectors through the sata frontend, as opposed to store-and-forward-ing one (or 8, or some fixed number) of sectors at a time.

or

  1. you have some high-level idea of how one would go about implementing such a change that you could share, and I could try and implement?

Thanks!

gsomlo commented 1 year ago

I've been playing with LiteSATA a bit over the last few days, trying to get myself reacquainted with its operation.

According to some tests I just ran, the portion of the proposed PR addressing disk reads (LiteSATASector1MemDMA) should work fine as-is: multiple sectors can be streamed through a sector buffer as small as the original single sector (logical_sector_size//port_bytes). I did test a 16-sector disk-to-RAM transfer through a single-sector sized buffer successfully, on both rocket and vexriscv!

I will now focus on the disk write portion (LiteSATAMem2SectorDMA), where the FSM loads the buffer in a separate loop (state READ-DATA-DMA) from the portion where the buffer is then pushed out to disk (state SEND-CMD-AND-DATA). If I could somehow merge these two FSM states into a single one that streams the entire multi-sector data chunk out to disk in a single loop, rather than the current "store-and-forward" mode, progress might happen :)

Speaking of disk writes. If, using rocket at 50MHz with --sata-gen 1, I write a disk sector:

litex> sata_write 4321000 xy
Memory dump:
0x11001ce0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001cf0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d00  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d10  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d20  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d30  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d40  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d50  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d60  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d70  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d80  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d90  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001da0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001db0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001dc0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001dd0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001de0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001df0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e00  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e10  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e20  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e30  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e40  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e50  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e60  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e70  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e80  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e90  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001ea0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001eb0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001ec0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001ed0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy

and then try to read it back out, I get:

litex> sata_read 4321000    
Memory dump:
0x11001cf0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d00  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d10  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d20  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d30  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d40  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d50  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d60  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d70  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d80  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001d90  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001da0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001db0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001dc0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001dd0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001de0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001df0  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e00  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e10  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e20  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e30  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e40  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e50  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e60  78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79  xyxyxyxyxyxyxyxy
0x11001e70  78 79 78 79 78 79 78 79 00 00 00 00 00 00 00 00  xyxyxyxy........
0x11001e80  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x11001e90  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x11001ea0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x11001eb0  00 00 00 00 00 00 00 00 e8 69 00 10 00 00 00 00  .........i......
0x11001ec0  d0 5b 00 10 00 00 00 00 c8 69 00 10 00 00 00 00  .[.......i......
0x11001ed0  e0 69 00 10 00 00 00 00 e0 25 00 10 00 00 00 00  .i.......%......
0x11001ee0  78 79 78 79 78 79 78 79 f8 1e 00 11 00 00 00 00  xyxyxyxy........

This is not reproducible using vexriscv at any frequency (50 or 100 MHz) or sata generation (1 or 2).

Funny thing is, even on rocket at 50MHz and sata-gen 1, the problem goes away if I enforce 512-byte alignment on the buffer being used:

diff --git a/litex/soc/software/bios/cmds/cmd_litesata.c b/litex/soc/software/bios/cmds/cmd_litesata.c
index 81199833..6ddc2e70 100644
--- a/litex/soc/software/bios/cmds/cmd_litesata.c
+++ b/litex/soc/software/bios/cmds/cmd_litesata.c
@@ -40,7 +40,8 @@ static void sata_read_handler(int nb_params, char **params)
 {
    unsigned int sector;
    char *c;
-   uint8_t buf[512];
+   uint8_t foo[1024];
+   uint8_t *buf = (uint8_t *)(((unsigned long)foo + 0x1ff) & 0xfffffe00);

    if (nb_params < 1) {
        printf("sata_read <sector>");
@@ -70,7 +71,8 @@ define_command(sata_read, sata_read_handler, "Read SATA sector", LITESATA_CMDS);
 static void sata_write_handler(int nb_params, char **params)
 {
    int i;
-   uint8_t buf[512];
+   uint8_t foo[1024];
+   uint8_t *buf = (uint8_t *)(((unsigned long)foo + 0x1ff) & 0xfffffe00);
    unsigned int sector;
    char *c;

I think this has something to do with stalls or similar delays caused by Rocket's internal cache consistency logic (since on rocket we're DMA-ing through the CPU block's dedicated axi port and back out to a dedicated RAM port, rather than via a shared bus, the way I think it's still set up on vexriscv).

So, DMA writes in particular seem to be timing and stall sensitive, a fact exposed by using Rocket's dma setup.

As mentioned earlier, I'll "zoom in" on LiteSATA disk write logic, and see what I can come up with in terms of converting store-and-forward buffered writes into streaming mode...

Any comments/clue/advice much welcome. @enjoy-digital please LMK if you have any thoughts :)

Thanks, --G

gsomlo commented 1 year ago

I tested this successfully on rocket and vexriscv, from the bios, and from linux on rocket only.

I still don't know about exactly why unaligned memory buffers cause garbage to be written when using rocket (but it's not a huge priority, since linux quite reliably uses aligned buffers when writing to disk) :)

I ended up modifying the disk-write (mem-to-sector) FSM to iterate over multiple single-sector store-and-forward operations, which allows for future improvements without further modifications to the software-visible API (i.e., the gateware register map).

@enjoy-digital, please take a look and apply if you think it's ready.

gsomlo commented 1 year ago

@enjoy-digital -- in addition to the overall review, please think about whether you have a better suggestion for the name of the sector-count register :) If we end up renaming it, I can respin both this PR and https://github.com/enjoy-digital/litex/pull/1715 to reflect that, before applying either, to reduce "churn"...

EDIT: I've tested multiple writes by filling 8192 bytes of RAM with a byte pattern, then doing a multi-sector RAM-to-SATA transfer using one of the new bios commands from https://github.com/enjoy-digital/litex/pull/1715 to write 16 sectors to disk. I was able to read back the data from the disk, by trying several of the 16 contiguous sectors at random.

I am getting some lock-ups in Linux when trying to copy large files between partitions of the sata disk, but not before significant portions of the data have been copying. I therefore suspect a kernel driver locking/mutual-exclusion issue rather than instability in the updated gateware, but am not yet 100% certain.

@enjoy-digital -- do you have some stress-testing procedure to determine if LiteSATA reads and writes are reliable?

I think we should try something like that before applying this, so if you don't have anything on your end I'll search around and implement something in the bios. E.g., loop around initializing some random number of contiguous 512-byte memory blocks, write them to the sata disk, read them back to someplace else in RAM, compare for differences, repeat ad nauseam with randomly chosen multi-sector chunk sizes, etc...

enjoy-digital commented 1 year ago

Hi @gsomlo,

this looks good. For the name of the additional register I would just avoid the abbreviation and call it sector_count. LiteSATA has mostly been used with directly FPGA logic driving it, so I mostly have test for this (BIST). Feel free to add some stress test in the BIOS for this, this would also be useful for the LiteSDCard and would allow us to detect eventual issue before using the core in Linux.

Thanks!

Florent

gsomlo commented 1 year ago

So far, testing with sata-gen=2 on vexriscv at 100MHz (on nexys video), it turns out that multisector store-and-forward writes are rock solid.

Multi-sector streaming reads only seem to work for up to 16 sectors, and increasing the buffer size doesn't seem to help.

I'm inclined to rewrite multisector reads the same way I've done writes, by iterating over single-sector reads in the gateware FSM.

That way we can reliably read an arbitrary number of sectors, without worrying about stalls and whatever else causes problems when streaming more than 16 sectors in read mode.

I'll test with rocket and sata-gen 1 at 50MHz as well, and rename the register.

I'll post here once that's done.

Thanks for the feedback! --Gabriel

gsomlo commented 1 year ago

@enjoy-digital -- please have another final look and apply if it all looks good to you.

I've decided to modify the FSMs to loop over single-sector transfers rather than try to use "true" streaming, even in the sector2mem (read) direction. But the software-visible register (CSR) interface should not require further changes in case we ever decide to try to optimize the gateware to do "for-real" streaming instead of "store-and-forward".

Either way, allowing the hardware/gateware to DMA multiple sectors at a time should allow e.g. Linux to work much more efficiently in interrupt-driven mode, with transfer speeds comparable to when polling mode is used.

Thanks, --Gabriel

enjoy-digital commented 1 year ago

Thanks a lot @gsomlo, this looks good to me!