RocketChip: MEM port to LiteDRAM width matching (internal vs. LiteX AXIUpConverter)

gsomlo commented 1 year ago

First, let's start with a diagram of how RocketChip is wired into LiteX:

Rocket has a MEM port which we directly wire to LiteDRAM in a point-to-point AXI link.

MMIO (LiteX CSR) access is routed over a separate port wired to the LiteX wishbone bus.

Finally, DMA is routed through the RocketChip, which is supposed to maintain coherence between LiteDRAM and its internal caches.

RocketChip can have its various AXI port widths configured independently. We currently keep MMIO and DMA at a constant 64 bit width, and configure different (sub-)variants of MEM port width 64, 128, 256, and 512 bit width to match the various FPGA development boards' LiteDRAM port widths.

If the chosen Rocket (sub-)variant's MEM port doesn't match LiteDRAM's port width, Litex adds an AXIUpConverter and issues a warning: https://github.com/enjoy-digital/litex/blob/master/litex/soc/integration/soc.py#L1564

The problem: When using a rocket variant with a 64-bit-wide MEM port and a wider (128 or more) LiteDRAM (with the LiteX supplied AXIUpConverter), memory accesses by the rocket core itself seem to work fine, but DMA (booting/accessing LiteSDCard or LiteSATA) fails, i.e., the data loaded from external devices ends up corrupted. I have not (yet) figured out the nature of this corruption. So far, the symptom is that I can boot such a system via netboot, but if I try loading the same binaries from sdcard or sata, the process hangs at "liftoff".

The other problem is that when Rocket is augmented with its own internal L2 cache, it only works with a 64-bit wide MEM port, which precludes its use with SoCs using a wider LiteDRAM. Related tickets are here:

(The latter is me inquiring about assumptions/restrictions about Rocket's ports' relative widths).

While the rocket L2 cache's inability to work with wider MEM ports is a problem, so is the fact that the LiteX AXIUpConverter seems to break DMA transfers routed through the Rocket chip, which is what this ticket is about...

I'll try to do some more troubleshooting as I get spare cycles, but any help/ideas/suggestions are very much welcome!

gsomlo commented 1 year ago

The thing I find particularly confusing is that, whether I've built my bitstream with --cpu-mem-width 1 (LiteX provides AXIUpConverter) or with --cpu-mem-width 2 (Rocket variant has matching 128-bit MEM port width for e.g. nexys video), if I read some random sata or sdcard block (with sata_read X or sdcard_read Y) the data displayed is the same (for the same value of X or Y, but native rocket width matching is the only one that boots (passes "liftoff") if booted from sata or sdcard.

I therefore continue to have no idea how come netboot (no dma) works in both cases, but sdcard/sata booting only works for rocket variants that natively match LiteDRAM width, and not when LiteX provides its own AXIUpConverter.

n-kremeris commented 1 year ago

@gsomlo I can confirm that booting Linux from SD with --cpu-mem-width 2 (128 bits) did work when we tested it (~2 months ago) on Digilent Nexys Video (no L2 cache)

gsomlo commented 1 year ago

booting Linux from SD with --cpu-mem-width 2 (128 bits) did work [...] on Digilent Nexys Video (no L2 cache)

If you get a chance to try with --cpu-mem-width 1 (64-bit width, using the LiteX AXIUpConverter), I expect you can boot via netboot and run Linux, which means CPU-to-RAM works fine through the AXIUpConverter. You will not be able to boot from sdcard, which means somehow device-to-RAM DMA is compromised by (at least) one as-of-yet unknown component in the path, in an as-of-yet unknown way... If you get a chance to test/confirm this part as well, that'd be a nice sanity check.

This DMA problem does not occur when Rocket itself matches the width of LiteDRAM (AXIUpConverter is not used).

So far I've been keeping the DMA and MMIO port width at 64 bits in all cases. I have to wonder if matching DMA <-> MEM port width would help (i.e., whether it's an "unwritten" assumption in the Rocket sources). Unfortunately Chisel isn't (yet) readable to me, so I'll have to keep testing random combinations and crazy hypotheses to find out what the actual "terms and conditions" are... :)

gsomlo commented 1 year ago

Ran some tests today. Since LiteETH isn't using DMA, I can netboot linux on both the 2x (128-bit) and 1x (64-bit) MEM port versions (on the nexys_video). Both are detecting litesata and mmcblk0 (SATA, and, respectively, LiteSDCard) devices. The 2x version also successfully detects partitions on these devices, whereas the 1x version does not (DMA device-to-memory transfers are corrupted somehow). Since litesata is detected, I decided to run:

dd if=/dev/litesata bs=1M count=1 | hexdump

on both versions. The correct output (from the 2x wide version) is:

0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00001b0 0000 0000 0000 0000 7b5b 0558 0000 2000
00001c0 0021 3506 0570 0800 0000 0000 0040 3500
00001d0 0571 fe83 ffff 0800 0040 5170 1cdc 0000
00001e0 0000 0000 0000 0000 0000 0000 0000 0000
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

When I ran the same command on the 1x wide version (which uses LiteX AXIUpConverter between the MEM port and LiteDRAM, successfully I might add, since it can boot linux from ethernet that way!), I get this:

0000000 0000 0000 0000 0000 2efc 804f ffff ffff
*
00001b0 7b5b 0558 0000 2000 2efc 804f ffff ffff
00001c0 0000 0000 0040 3500 2efc 804f ffff ffff
00001d0 0040 5170 1cdc 0000 2efc 804f ffff ffff
00001e0 0000 0000 0000 0000 2efc 804f ffff ffff
00001f0 0000 0000 0000 aa55 2efc 804f ffff ffff
0000200 0000 0000 0000 0000 0003 0000 0002 0000
*
0000220 0000 0000 0000 0000 0005 0000 0000 0000
0000230 0000 0000 0000 0000 0000 0000 0000 0000
*
00002c0 0000 0000 0000 0000 d000 0003 ffc8 ffff
*
00002e0 0000 0000 0000 0000 8000 0033 ffc8 ffff
00002f0 0000 0000 0000 0000 0002 0000 0001 0000
0000300 0000 0000 0000 0000 0338 02d0 ffd8 ffff
*
0000360 0000 0000 0000 0000 eb80 014f ffd8 ffff
0000370 0000 0000 0000 0000 0000 0000 0000 0000
*
00003b0 0000 0000 0000 0000 cb20 bd74 0000 0000
*
00003e0 0000 0000 0000 0000 0000 0800 60dc 0001
... <keeps printing garbage non-zero data>

or, on a second attempt:

0000000 0000 0000 0000 0000 30de 8001 ffff ffff
*
00001b0 7b5b 0558 0000 2000 30de 8001 ffff ffff
00001c0 0000 0000 0040 3500 30de 8001 ffff ffff
00001d0 0040 5170 1cdc 0000 30de 8001 ffff ffff
00001e0 0000 0000 0000 0000 30de 8001 ffff ffff
00001f0 0000 0000 0000 aa55 30de 8001 ffff ffff
0000200 0000 0000 0000 0000 8000 0034 ffc8 ffff
0000210 0000 0000 0000 0000 0000 0000 0000 0000
*
0000230 0000 0000 0000 0000 137d 0000 0000 0000
0000240 0000 0000 0000 0000 5c00 1f4d 0005 0000
*
0000260 0000 0000 0000 0000 b020 8001 ffff ffff
0000270 0000 0000 0000 0000 0000 0000 0000 0000
*
00002f0 0000 0000 0000 0000 d000 0003 ffc8 ffff
0000300 0000 0000 0000 0000 001a 8000 001a 0000
... <keeps printing garbage non-zero data>

The second batch of 8 bytes (out of a 16 byte group) appear to be corrupted, and at some point a shift by 8 bytes occurs later on in the data, with garbage being read in on an ongoing basis.

I still don't know how/where the corruption occurs, but figured I'd start with getting an idea of what it even looks like.

Investigation continues... :)

Dolu1990 commented 1 year ago

Issue is very likely the AXI4 upsizer : Looking at the AXI4 upsizer code, i can see :

        # W Channel.
        w_converter = stream.StrideConverter(
            description_from = [("data", dw_from), ("strb", dw_from//8)],
            description_to   = [("data",   dw_to), ("strb",   dw_to//8)],
        )
        self.submodules += w_converter
        self.comb += axi_from.w.connect(w_converter.sink, omit={"id", "dest", "user"})
        self.comb += w_converter.source.connect(axi_to.w)
        self.comb += axi_to.w.id.eq(axi_from.w.id)
        self.comb += axi_to.w.dest.eq(axi_from.w.dest)
        self.comb += axi_to.w.user.eq(axi_from.w.user)

Thing is, that code does not seem to take in acount the alignement of the AW channel address, which is i think the issue. Can't make an AXI upsizer without the W channel checking the alignement of the address.

I think the R channel handeling also has the same issue.

Dolu1990 commented 1 year ago

So, i tested the NaxRiscv SoC with a matching AXI data width to avoid the litex upsizer. This workaround the issue well, can get the SDCARD to work in linux.

Proper fix would be to reimplement the litex AXI4 upsizer.

gsomlo commented 1 year ago

On Tue, Sep 05, 2023 at 07:36:45AM -0700, Dolu1990 wrote:

Proper fix would be to reimplement the litex AXI4 upsizer.

The interesting/confusing question I have is:

How come this only manifests when routing DMA through the CPU, whereas CPU-to-litedram accesses through the upsizer work fine?

I'm guessing the answer lies in the fine details of the AXI protocol vs. the specific upsizer implementation thereof, but it's still weird... :)

Dolu1990 commented 1 year ago

How come this only manifests when routing DMA through the CPU, whereas CPU-to-litedram accesses through the upsizer work fine?

Reason is that the CPU (Rocket L2 / NaxRiscv) are using write-back caches => only 64 bytes aligned accesses are done, the current upsizer is ok for those. The DMA accesses are only 4 bytes (at least with the sdcard), which make the issue appear.

I'm guessing the answer lies in the fine details of the AXI protocol

The more i understand the nasty details about AXI, the more i love tilelink XD

gsomlo commented 1 year ago

My plan moving forward is to remove any rocket variants with a mem port wider than 64-bit from pythondata-cpu-rocket, after first ensuring that the LiteX AXI up-converter is fixed (whether by myself at some point when I get enough cycles, or if someone else pre-empts me and does it first (@enjoy-digital @trabucayre @Dolu1990) :)

In the long term, finding a rv64gc model with fpu, mmu, and s-mode support capable of running e.g. Fedora, other than rocket (preferably one written directly in verilog, so I don't have to learn two additional object oriented languages to understand how it all works) will be a personal goal of mine :) Either way, having a fully compliant up-converter for wider LiteDRAM boards will come in handy, so that's the highest priority...

Dolu1990 commented 1 year ago

@gsomlo Just an recent update, got Dual core NaxRiscv to run debian well on Digilent nexys video, including DMA peripherals (sdcard + USB OHCI host).

I documented it here : https://github.com/SpinalHDL/NaxSoftware/tree/main/debian_litex

note the patch to the litex sdcard driver : sed -i 's/SD_SLEEP_US 5/SD_SLEEP_US 0/g' drivers/mmc/host/litex_mmc.c in order to multiply the speed by 8

enjoy-digital commented 1 year ago

@gsomlo: I was also going to suggest doing some tests with @Dolu1990's NaxRiscv :)

We are currently busy with other projects with @trabucayre, but will try to reproduce @Dolu1990's results very soon and try to help with the tests/integration.

gsomlo commented 1 year ago

@dolu: I'm really excited about NaxRiscv and its potential to handle "real" distros, and, most importantly, being a potential replacement to Rocket :D I'm planning on finding a SpinalHDL tutorial to determine for myself if understanding it is materially less opaque than trying to read Chisel :)

Anyway, regarding this:

sed -i 's/SD_SLEEP_US 5/SD_SLEEP_US 0/g' drivers/mmc/host/litex_mmc.c

Why do you think reducing SD_SLEEP_US to 0 makes such a big difference for you? Any chance you're not using IRQ for the sdcard? As far as I'm able to tell, once a DMA transfer completes and you're woken up by IRQ, the "poll interval" shouldn't matter, as your "DONE" bits are set everywhere and the poll function succeeds on the first try.

The interval "should" only come into play when you're really polling, in which case doing it harder (at shorter intervals) will let you know that your transfer is done earlier (at the expense of spinning the CPU, of course).

Any thoughts? (We can of course take this off-line, as it's a bit off-topic for this issue :D)

Dolu1990 commented 1 year ago

@gsomlo Keep in mind the SpinalHDL tilelink stuff is very recent, and some corner are ruf XD I plan to implement / test a L2 cache, then i will do a seconde round to polish the SoC side of things

Here is a talk about it :

https://www.youtube.com/watch?v=hVi9xOGuuek&list=PLUg3wIOWD8yqkkVd286RIiAxE1ppAJp2j&ab_channel=FOSSiFoundation
Here is a slide about a minimal SoC code : https://youtu.be/hVi9xOGuuek?list=PLUg3wIOWD8yqkkVd286RIiAxE1ppAJp2j&t=641

Why do you think reducing SD_SLEEP_US to 0 makes such a big difference for you?

The performance issue seems to be related to : https://github.com/litex-hub/linux/blob/f0b0d403eabbe135d8dbb40ad5e41018947d336c/drivers/mmc/host/litex_mmc.c#L107

Which in my mesurements, was only wakeup to check stuff on the next kernel timer tick (4ms), slowing down things a lot. (instead of just a few micro seconds) The IRQ was functional in the test, i mean, i had mmc interrupts counting up in /proc/interrupts

With that patch it went from 400 KB/s to 4000 KB/s

As far as I'm able to tell, once a DMA transfer completes and you're woken up by IRQ, the "poll interval" shouldn't matter, as your "DONE" bits are set everywhere and the poll function succeeds on the first try.

Maybe i had a old version of the driver, which wasn't using the IRQ for that specific part, or i missed something else :).

gsomlo commented 10 months ago

For anyone who happens to find this thread, removing this block: https://github.com/enjoy-digital/litex/blob/master/litex/soc/integration/soc.py#L1635-L1650 (and using Wishbone instead of native AXI for the cpu-to-litedram width up-converter) appears to work correctly, if somewhat "choppy" and suboptimal from a performance standpoint.

enjoy-digital / litex

RocketChip: MEM port to LiteDRAM width matching (internal vs. LiteX AXIUpConverter) #1753