Closed gsomlo closed 1 year ago
One alternative to overcomplicating the litesata gateware would be to simply handle transfer sizes of either 1 or 8 sectors, and nothing else.
I've observed that each bvec
worth of data to be transfered inside the linux drivers is either exactly one, or exactly eight sectors in size. I've added a warning print statement to notify us if ever the litesata linux driver is transfering a block that's sized other than 1 or 8 sectors.
But assuming that's rare, we could, instead of a 16-bit register containing an arbitrary sector count, use a single-bit register (or additional bit flag in an existing register) to specify whether the transfer will be 8 contiguous sectors or just one sector.
We can then use 1-sector-at-a-time mode for anything that's not a multiple of 8, and keep the gateware as simple as possible...
@enjoy-digital -- what do you think?
@gsomlo: Thanks for the RFC, I'll look at this on Monday.
On Sat, Aug 27, 2022 at 12:08:20AM -0700, enjoy-digital wrote:
@gsomlo: Thanks for the RFC, I'll look at this on Monday.
OK, one more thought -- if decoupling the buffer size from the size of the data unit being transferred (512 or 4096 bytes are the two values I've observed Linux use so far) would over-complicate the design too much, hard-coding buffers 8 times as large as until now might be made optional -- if the gateware is built with support for 4096-byte transfers as an option, there might be a flag indicating so in phy or identify or whatnot, and the DMA start register might have two bits instead of one, and the Linux driver might chose to do 4096-byte DMA transfers if the gateware indicates it has that capability...
Anyway, once you've had a chance to look over my "rough draft" maybe we can chat on IRC before deciding what the best option might be...
Hi @gsomlo,
sorry for the delay. It would in fact be possible to support multi-sectors transfer probably without increasing the FIFO depth (or by just doubling it):
I can do the implementation and do simple test with the LiteX BIOS and you'll then be able to test with the Linux driver. Does it make sense to you?
Hi @enjoy-digital,
It would in fact be possible to support multi-sectors transfer probably without increasing the FIFO depth (or by just doubling it):
Sounds interesting -- if you can "stream" through the FIFO (instead of what currently appears to be a store-and-forward mode of operation) we could then figure out the optimal FIFO size that's still economical in terms of FPGA memory utilization but gets us a good data transfer rate (i.e., without brute-force hard-coding the FIFO to 8x its current size, like what my naive experiment did :)
The largest individual data size transfered by the linux driver in its current form is 8 sectors (4096 bytes, or a standard memory page).
I'm not quite sure how likely it is that we'd need larger than 8-sector contiguous transfer capability without adding support for hardware-based scatter-gather lists (which would probably be overkill if the goal is to keep LiteSATA as simple as reasonably possible). @geertu: maybe you might be able to comment on that?
Either way, I'd be happy to test a streaming-FIFO PR against LiteSATA, and do measurements on FIFO length vs. transfer speeds to come up with good defaults!
Good, I'll work on implementing this then :)
512 bytes is the sector size on traditional HDDs. 4096 bytes is the sector size on some recent HDDs. As OpenRISC uses a page size of 8 KiB, I guess Linux will usually read 16 sectors on OpenRISC?
hey @enjoy-digital -- I happen to have a few days to focus on this, so I was wondering if either
or
Thanks!
I've been playing with LiteSATA a bit over the last few days, trying to get myself reacquainted with its operation.
According to some tests I just ran, the portion of the proposed PR addressing disk reads (LiteSATASector1MemDMA
) should work fine as-is: multiple sectors can be streamed through a sector buffer as small as the original single sector (logical_sector_size//port_bytes
). I did test a 16-sector disk-to-RAM transfer through a single-sector sized buffer successfully, on both rocket and vexriscv!
I will now focus on the disk write portion (LiteSATAMem2SectorDMA
), where the FSM loads the buffer in a separate loop (state READ-DATA-DMA
) from the portion where the buffer is then pushed out to disk (state SEND-CMD-AND-DATA
). If I could somehow merge these two FSM states into a single one that streams the entire multi-sector data chunk out to disk in a single loop, rather than the current "store-and-forward" mode, progress might happen :)
Speaking of disk writes. If, using rocket at 50MHz with --sata-gen 1
, I write a disk sector:
litex> sata_write 4321000 xy
Memory dump:
0x11001ce0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001cf0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d00 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d10 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d20 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d30 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d40 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d50 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d60 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d70 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d80 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d90 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001da0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001db0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001dc0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001dd0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001de0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001df0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e00 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e10 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e20 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e30 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e40 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e50 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e60 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e70 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e80 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e90 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001ea0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001eb0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001ec0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001ed0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
and then try to read it back out, I get:
litex> sata_read 4321000
Memory dump:
0x11001cf0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d00 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d10 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d20 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d30 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d40 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d50 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d60 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d70 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d80 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001d90 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001da0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001db0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001dc0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001dd0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001de0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001df0 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e00 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e10 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e20 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e30 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e40 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e50 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e60 78 79 78 79 78 79 78 79 78 79 78 79 78 79 78 79 xyxyxyxyxyxyxyxy
0x11001e70 78 79 78 79 78 79 78 79 00 00 00 00 00 00 00 00 xyxyxyxy........
0x11001e80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x11001e90 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x11001ea0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x11001eb0 00 00 00 00 00 00 00 00 e8 69 00 10 00 00 00 00 .........i......
0x11001ec0 d0 5b 00 10 00 00 00 00 c8 69 00 10 00 00 00 00 .[.......i......
0x11001ed0 e0 69 00 10 00 00 00 00 e0 25 00 10 00 00 00 00 .i.......%......
0x11001ee0 78 79 78 79 78 79 78 79 f8 1e 00 11 00 00 00 00 xyxyxyxy........
This is not reproducible using vexriscv at any frequency (50 or 100 MHz) or sata generation (1 or 2).
Funny thing is, even on rocket at 50MHz and sata-gen 1
, the problem goes away if I enforce 512-byte alignment on the buffer being used:
diff --git a/litex/soc/software/bios/cmds/cmd_litesata.c b/litex/soc/software/bios/cmds/cmd_litesata.c
index 81199833..6ddc2e70 100644
--- a/litex/soc/software/bios/cmds/cmd_litesata.c
+++ b/litex/soc/software/bios/cmds/cmd_litesata.c
@@ -40,7 +40,8 @@ static void sata_read_handler(int nb_params, char **params)
{
unsigned int sector;
char *c;
- uint8_t buf[512];
+ uint8_t foo[1024];
+ uint8_t *buf = (uint8_t *)(((unsigned long)foo + 0x1ff) & 0xfffffe00);
if (nb_params < 1) {
printf("sata_read <sector>");
@@ -70,7 +71,8 @@ define_command(sata_read, sata_read_handler, "Read SATA sector", LITESATA_CMDS);
static void sata_write_handler(int nb_params, char **params)
{
int i;
- uint8_t buf[512];
+ uint8_t foo[1024];
+ uint8_t *buf = (uint8_t *)(((unsigned long)foo + 0x1ff) & 0xfffffe00);
unsigned int sector;
char *c;
I think this has something to do with stalls or similar delays caused by Rocket's internal cache consistency logic (since on rocket we're DMA-ing through the CPU block's dedicated axi port and back out to a dedicated RAM port, rather than via a shared bus, the way I think it's still set up on vexriscv).
So, DMA writes in particular seem to be timing and stall sensitive, a fact exposed by using Rocket's dma setup.
As mentioned earlier, I'll "zoom in" on LiteSATA disk write logic, and see what I can come up with in terms of converting store-and-forward buffered writes into streaming mode...
Any comments/clue/advice much welcome. @enjoy-digital please LMK if you have any thoughts :)
Thanks, --G
I tested this successfully on rocket and vexriscv, from the bios, and from linux on rocket only.
I still don't know about exactly why unaligned memory buffers cause garbage to be written when using rocket (but it's not a huge priority, since linux quite reliably uses aligned buffers when writing to disk) :)
I ended up modifying the disk-write (mem-to-sector) FSM to iterate over multiple single-sector store-and-forward operations, which allows for future improvements without further modifications to the software-visible API (i.e., the gateware register map).
@enjoy-digital, please take a look and apply if you think it's ready.
@enjoy-digital -- in addition to the overall review, please think about whether you have a better suggestion for the name of the sector-count register :) If we end up renaming it, I can respin both this PR and https://github.com/enjoy-digital/litex/pull/1715 to reflect that, before applying either, to reduce "churn"...
EDIT: I've tested multiple writes by filling 8192 bytes of RAM with a byte pattern, then doing a multi-sector RAM-to-SATA transfer using one of the new bios commands from https://github.com/enjoy-digital/litex/pull/1715 to write 16 sectors to disk. I was able to read back the data from the disk, by trying several of the 16 contiguous sectors at random.
I am getting some lock-ups in Linux when trying to copy large files between partitions of the sata disk, but not before significant portions of the data have been copying. I therefore suspect a kernel driver locking/mutual-exclusion issue rather than instability in the updated gateware, but am not yet 100% certain.
@enjoy-digital -- do you have some stress-testing procedure to determine if LiteSATA reads and writes are reliable?
I think we should try something like that before applying this, so if you don't have anything on your end I'll search around and implement something in the bios. E.g., loop around initializing some random number of contiguous 512-byte memory blocks, write them to the sata disk, read them back to someplace else in RAM, compare for differences, repeat ad nauseam with randomly chosen multi-sector chunk sizes, etc...
Hi @gsomlo,
this looks good. For the name of the additional register I would just avoid the abbreviation and call it sector_count
. LiteSATA has mostly been used with directly FPGA logic driving it, so I mostly have test for this (BIST). Feel free to add some stress test in the BIOS for this, this would also be useful for the LiteSDCard and would allow us to detect eventual issue before using the core in Linux.
Thanks!
Florent
So far, testing with sata-gen=2 on vexriscv at 100MHz (on nexys video), it turns out that multisector store-and-forward writes are rock solid.
Multi-sector streaming reads only seem to work for up to 16 sectors, and increasing the buffer size doesn't seem to help.
I'm inclined to rewrite multisector reads the same way I've done writes, by iterating over single-sector reads in the gateware FSM.
That way we can reliably read an arbitrary number of sectors, without worrying about stalls and whatever else causes problems when streaming more than 16 sectors in read mode.
I'll test with rocket and sata-gen 1 at 50MHz as well, and rename the register.
I'll post here once that's done.
Thanks for the feedback! --Gabriel
@enjoy-digital -- please have another final look and apply if it all looks good to you.
I've decided to modify the FSMs to loop over single-sector transfers rather than try to use "true" streaming, even in the sector2mem (read) direction. But the software-visible register (CSR) interface should not require further changes in case we ever decide to try to optimize the gateware to do "for-real" streaming instead of "store-and-forward".
Either way, allowing the hardware/gateware to DMA multiple sectors at a time should allow e.g. Linux to work much more efficiently in interrupt-driven mode, with transfer speeds comparable to when polling mode is used.
Thanks, --Gabriel
Thanks a lot @gsomlo, this looks good to me!
CAUTION: DO NOT APPLY (yet)!]Allow up to 8 contiguous sectors to be transfered in a single DMA operation.
The max. 8 limit is based on my observation of the linux kernel's behavior, but seems rather arbitrary.
CAUTION: I also kinda sorta hacked the LiteSATA DMA frontend to increase the sector-buffer FIFO(s) by a factor of 8 (to fit up to 8 sectors). It might make more sense to decouple the FIFO size from the max. transfer size and have it work in a true "streaming" fashion, but for that I'd have to study LiteX streams, the SATA protocol, etc. in a LOT more detail :)
At the moment, I just want to start talking about this rather than push an actual change into upstream LiteSATA.
Statistics (EDIT: on nexys_video, with
sata-gen 1
, quad-core fpu-enabled rocket at 50MHz):Signed-off-by: Gabriel Somlo gsomlo@gmail.com