enjoy-digital / litex

Build your hardware, easily!
Other
2.89k stars 556 forks source link

large, non-4-byte multiple CSRs vs "array" accessors #546

Open gsomlo opened 4 years ago

gsomlo commented 4 years ago

Right now, large CSRs spread across multiple "simple CSRs" whose size is NOT a multiple of 4 bytes are distributed in a counter-intuitive way. For example, a 17 byte CSR whose hardware value is:

foo = 0x0102030405060708091011121314151617

will be represented starting with address addr, on designs with csr_data_width == 8 like this:

addr   addr+align   ...
 |     |
 v     v
 (01), (02), (03), ..., (17)

(so far, so good, nothing weird happening yet).

However, with csr_data_width == 32 we have:

addr   addr+align   ...
 |     |
 v     v
 (01), (02030405), (06070809), (10111213), (14151617)

NOTE: this is endianness-agnostic, as a number like (0a0b0c0d) represents the value stored at an appropriately aligned memory location in cpu-native endianness.

This means that the "array" accessors in litex/soc/software/include/hw/common.h are wrong unless the CSR size is a multiple of 4 bytes. They assume (currently wrongly so) that the CSR would be stored like this (in what some might assume to be a more "natural" or "intuitive" way):

addr         addr+align   ...
 |           |
 v           v
 (01020304), (05060708), (09101112), (13141516), (17)

The question is, of course, whether to fix the way CSRs are allocated for csr_data_width > 8 (basically, implement the second, "natural/intuitive" way of distribution), or rather fix the array accessors?

I'm pretty strongly biased in favor of fixing CSR allocation and making the array accessors (retroactively) correct. Major advantage to that would be that we could keep things like this:

https://github.com/enjoy-digital/litex/blob/master/litex/soc/software/liblitesdcard/sdcard.c#L231

from proliferating outside of litex/soc/software/include/hw/common.h, and prevent loss of endianness/alignment/csr_data_width agnostic access to hardware registers (currently sdcard only works on 32-bit CPUs, due to the above-referenced code fragment.

@enjoy-digital @mateusz-holenko @ozbenh -- any thoughts?

enjoy-digital commented 4 years ago

@gsomlo: thanks for the analysis, fixing the gateware seems indeed more logical/intuitive.

enjoy-digital commented 4 years ago

@gsomlo: can you explain your analysis a bit? On a large register, the LSB is the CSR simple with the higher index, so the gateware should be implementing this:

addr         addr+align   ...
 |           |
 v           v
 (xxxxxx01), (02030405), (06070809), (10111213), (14151617)

So i'm not sure there is an issue in the implementation, what you see could just be an artifact of the endianness: your addr is not really addr but addr+3.

gsomlo commented 4 years ago

So i'm not sure there is an issue in the implementation, what you see could just be an artifact of the endianness: your addr is not really addr but addr+3.

That's only true on 32-bit big-endian (e.g., mor1kx). Here's how a 17-byte scratch CSR defined as

self._scrsdtest= CSRStorage(136, reset=0x0102030405060708091011121314151617)

is laid out in the address space on (vexriscv, mor1kx, rocket) x (csr_data_with \in {8, 32}):

vexriscv (8-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x82000014,17,rw
0x82000010              01 00 00 00 02 00 00 00 03 00 00 00      ............
0x82000020  04 00 00 00 05 00 00 00 06 00 00 00 07 00 00 00  ................
0x82000030  08 00 00 00 09 00 00 00 10 00 00 00 11 00 00 00  ................
0x82000040  12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00  ................
0x82000050  16 00 00 00 17 00 00 00                          ........

vexriscv (32-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x82000008,5,rw
0x82000000                          01 00 00 00 05 04 03 02          ........
0x82000010  09 08 07 06 13 12 11 10 17 16 15 14              ............

mor1kx (8-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x82000014,17,rw
0x82000010              00 00 00 01 00 00 00 02 00 00 00 03      ............
0x82000020  00 00 00 04 00 00 00 05 00 00 00 06 00 00 00 07  ................
0x82000030  00 00 00 08 00 00 00 09 00 00 00 10 00 00 00 11  ................
0x82000040  00 00 00 12 00 00 00 13 00 00 00 14 00 00 00 15  ................
0x82000050  00 00 00 16 00 00 00 17                          ........

mor1kx (32-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x82000008,5,rw
0x82000000                          00 00 00 01 02 03 04 05          ........
0x82000010  06 07 08 09 10 11 12 13 14 15 16 17              ............

rocket (8-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x12000028,17,rw
0x12000020                          01 00 00 00 00 00 00 00          ........
0x12000030  02 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00  ................
0x12000040  04 00 00 00 00 00 00 00 05 00 00 00 00 00 00 00  ................
0x12000050  06 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
0x12000060  08 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................
0x12000070  10 00 00 00 00 00 00 00 11 00 00 00 00 00 00 00  ................
0x12000080  12 00 00 00 00 00 00 00 13 00 00 00 00 00 00 00  ................
0x12000090  14 00 00 00 00 00 00 00 15 00 00 00 00 00 00 00  ................
0x120000a0  16 00 00 00 00 00 00 00 17 00 00 00 00 00 00 00  ................

rocket (32-bit CSRs):
    - csr_register,ctrl_scrsdtest,0x12000010,5,rw
0x12000010  01 00 00 00 00 00 00 00 05 04 03 02 00 00 00 00  ................
0x12000020  09 08 07 06 00 00 00 00 13 12 11 10 00 00 00 00  ................
0x12000030  17 16 15 14 00 00 00 00                          ........

Are you saying that the large-CSR fill policy is to:

  1. compute how many contiguous simple CSRs are needed
  2. allocate a contiguous set of simple CSRs
  3. start filling them beginning with the high address simple CSR, counting down
  4. potentially only partially fill the lowest address simple CSR
  5. set the "official" address for the large-CSR to be the low address in the sequence

If that's true, then that explains the current behavior when the size (in bytes) is not a multiple of csr_data_width/8

My (uninformed) intuition assumed we'd instead:

  1. compute how many contiguous simple CSRs we need
  2. allocate
  3. start filling from the low address simple CSR
  4. if not an exact multiple, partially fill the high address simple CSR

This matches the above when things are an exact multiple of csr_data_width, and we didn't have any non-multiples before https://github.com/enjoy-digital/litex/commit/0db3506997469a34ace453210bea640c6227cf05 [] so it never occurred to me there's a difference, so this assumption ended up being codified in the `csr_[rd|wr]buf()methods inlitex/soc/software/include/hw/common.h`. I'm happy to fix the methods if we decide to keep the current fill policy (but they'd likely end up looking even uglier than they do now ;)

[*] edited to point to the correct commit ID

mithro commented 4 years ago

As I mentioned at https://github.com/enjoy-digital/litex/issues/544 -- there are actually 3 data widths going on here; (a) CPU native data width (b) CPU bus width (c) CSR bus width

Generally (a) and (b) are matched but not always (64bit CPU with 32bit bus exists).

ozbenh commented 4 years ago

On Fri, 2020-05-29 at 11:27 -0700, gsomlo wrote:

Right now, large CSRs spread across multiple "simple CSRs" whose size is NOT a multiple of 4 bytes are distributed in a counter-intuitive way. For example, a 17 byte CSR whose hardware value is: foo = 0x0102030405060708091011121314151617 will be represented starting with address addr, on designs with csr_data_width == 8 like this: addr addr+align ... | | v v (01), (02), (03), ..., (17) (so far, so good, nothing weird happening yet).

Actually it is weird for a LE CPU, one would expect the LSB first.

However, with csr_data_width == 32 we have: addr addr+align ... | | v v (01), (02030405), (06070809), (10111213), (14151617)

NOTE: this is endianness-agnostic, as a number like (0a0b0c0d) represents the value stored at an appropriately aligned memory location in cpu-native endianness.

So you are basically right for the organization but I'm not sure I agree with you on endianness:

Any piece of HW should have a well defined HW regardless of the CPU endianness. Anything else leads to crazy. Linux fought that battle over and over again with ridiculous things like OHCI controllers having big endian registers on some freescale parts (imagine mixing an SoC OHCI and a PCI OHCI in the same system ?) forcing drivers to have conditionals on every accessor.

Just define CSRs to be one endian once and for all. SW swaps aren't that expensive and CSRs shouldn't be preformance critical.

However, if "foo" is a byte stream then this is a completely different kettle of fish. A byte stream accessed via a wider-than-byte "window" doesn't have an endianness per-se and bytes should be represented in ascending address order. Ignoring that stray 01 in your example and focusing on 02030405 just for the sake of this description, that means that 02 should be at the first address such as if the CPU, regardless of its endian, reads that word and writes it back to memory, 02 will apepar to have landed in the first address byte.

What that means is that such a representation typically requires non-swapping accesses regardless of the CSR vs CPU endianness. The CSRs should present the byte with the lower address onto the byte lane that correspond to the lower address on that bus.

This is why for example in Linux, "fifo" accessors (readsl/writesl) are non-swapping regardless of the endianness of the architecture.

This means that the "array" accessors in litex/soc/software/include/hw/common.h are wrong unless the CSR size is a multiple of 4 bytes. They assume (currently wrongly so) that the CSR would be stored like this (in what some might assume to be a more "natural" or "intuitive" way): addr addr+align ... | | v v (01020304), (05060708), (09101112), (13141516), (17)

Right, this is a more logical way of doing things, minus the endianness related comments above :)

The question is, of course, whether to fix the way CSRs are allocated for csr_data_width > 8 (basically, implement the second, "natural/intuitive" way of distribution), or rather fix the array accessors?

Fix the HW :)

I'm pretty strongly biased in favor of fixing CSR allocation and making the array accessors (retroactively) correct. Major advantage to that would be that we could keep things like this:

https://github.com/enjoy-digital/litex/blob/master/litex/soc/software/liblitesdcard/sdcard.c#L231

The above code is somewhat broken for endianness if you follow the "sane" rule of CSRs always being of one well defined endianness since csr_simple_read() could potentially perform a byte swap which for that type of usage is wrong as explained above.

Again provided the HW lays things out sanely :)

That said, you are right here:

from proliferating outside of litex/soc/software/include/hw/common.h, and prevent loss of endianness/alignment/csr_data_width agnostic access to hardware registers (currently sdcard only works on 32-bit CPUs, due to the above-referenced code fragment. @enjoy-digital @mateusz-holenko @ozbenh -- any thoughts?

Yes, there should be definitely such accessors to "memcpy_csr" regions that are more than one CSR wide.

I think however that it's a mistake to treat such as CSR as a "large 17 bytes quantity" but rather it should be treated as a "17 bytes memory" with funky offsets between the words as explained above.

Also note that the traditional way in HW of doing that sort of thing is via a FIFO rather than an array of registers but I digress :)

Cheers, Ben.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ozbenh commented 4 years ago

On Fri, 2020-05-29 at 13:47 -0700, enjoy-digital wrote:

@gsomlo: can you explain your analysis a bit? On a large register, the LSB is the CSR simple with the higher index, so the gateware should be implementing this: addr addr+align ... | | v v (xxxxxx01), (02030405), (06070809), (10111213), (14151617) So i'm not sure there is an issue in the implementation, what you see could just be an artifact of the endianness: your addr is not really addr but addr+3.

Endianness should have nothing to do with it as I wrote earlier but sadly it does I think.

It's still wrong IMHO to have the first address contain a stray byte but it makes sense if you treat the CSR as a whole large 17-byte sized number in big endian format.

However it shouldn't be like that. You shouldn't have such 'large numbers' unless in exceptional cases.

This should be treated as a stream of bytes (which it really is for the sd-cards as far as I can tell) in which case it should be endian neutral and stores in ascending byte address order (and thus left packed in the above diagram).

Cheers, Ben.

mithro commented 4 years ago

@ozbenh / @ozbenh -- CSRs should be implemented in whatever is the most resource efficient method on the FPGA and then software provided to make accessing them easy.

mithro commented 4 years ago

BTW - A "17 byte long CSR" is probably not a great way to implement whatever was being implemented.

ozbenh commented 4 years ago

@mithro "whatever is the most resource efficient for an FPGA" will lead to a number of software inefficiencies and will simply not scale if you start targeting OSes like Linux. It's also completely besides the point(s) I'm trying to make here. None of what I suggest above leads to any significant resource usage difference, it's purely how the lanes are organised.

Do you have actual valuable input into this discussion otherwise ?

enjoy-digital commented 4 years ago

@gsomlo: the CSR fill policy is indeed the first one you are describing, the gateware side is implemented here:

@ozbenh: Large CSR like the 17-bytes one are generally avoided and i was planning to rework the interface since not optimal for this reason but also for resource usage & CDC, but i first wanted we have first version of the gateware/software working before changing things, so we still have to with it.

I'm not able to have closer look until next Tuesday but in a first time 1) we should probably fix the accessors to match the gateware implementation and in a second time 2) discuss what would be the best for software/OSes support from @ozbenh background/experience.

If 2) requires gateware changes, we'll still have to be able to generate gateware with a similar behaviour than the current implementation since some users are using their own software/CSR accessors and changing the gateware implementation would break it.