enjoy-digital / litex

Build your hardware, easily!
Other
2.87k stars 551 forks source link

CSR accessor performance / feature / build Regressions #365

Closed xobs closed 4 years ago

xobs commented 4 years ago

Due to a recent change, the CSR accessor methods have undergone a large performance regression. Additionally, the Etherbone library is now incompatible.

Performance Regression

Previously, CSRs were defined similarly to:

static inline void csr_writel(uint32_t value, uint32_t addr)
{
    *((volatile uint32_t *)addr) = value;
}

This allowed the compiler to inline writes very easily. For example, in Fomu we have the following function:

static void spi_single_tx(uint8_t out) {
    int bit;

    for (bit = 7; bit >= 0; bit--) {
        if (out & (1 << bit)) {
            lxspi_bitbang_write((0 << PIN_CLK) | (1 << PIN_MOSI));
            lxspi_bitbang_write((1 << PIN_CLK) | (1 << PIN_MOSI));
            lxspi_bitbang_write((0 << PIN_CLK) | (1 << PIN_MOSI));
        } else {
            lxspi_bitbang_write((0 << PIN_CLK) | (0 << PIN_MOSI));
            lxspi_bitbang_write((1 << PIN_CLK) | (0 << PIN_MOSI));
            lxspi_bitbang_write((0 << PIN_CLK) | (0 << PIN_MOSI));
        }
    }
}

This would get compiled down to the following function:

(gdb) disassemble spi_single_tx
Dump of assembler code for function spi_single_tx:
   0x0000014c <+0>:     li      a4,7
   0x00000150 <+4>:     lui     a5,0xe0008
   0x00000154 <+8>:     li      a6,2
   0x00000158 <+12>:    li      a2,1
   0x0000015c <+16>:    li      a7,3
   0x00000160 <+20>:    li      a1,-1
   0x00000164 <+24>:    sra     a3,a0,a4
   0x00000168 <+28>:    andi    a3,a3,1
   0x0000016c <+32>:    beqz    a3,0x188 <spi_single_tx+60>
   0x00000170 <+36>:    sw      a2,-2048(a5) # 0xe0007800
   0x00000174 <+40>:    sw      a7,-2048(a5)
   0x00000178 <+44>:    sw      a2,-2048(a5)
   0x0000017c <+48>:    addi    a4,a4,-1
   0x00000180 <+52>:    bne     a4,a1,0x164 <spi_single_tx+24>
   0x00000184 <+56>:    ret
   0x00000188 <+60>:    sw      zero,-2048(a5)
   0x0000018c <+64>:    sw      a6,-2048(a5)
   0x00000190 <+68>:    sw      zero,-2048(a5)
   0x00000194 <+72>:    j       0x17c <spi_single_tx+48>
End of assembler dump.
(gdb)

With the most recent changes to the CSR accessors, this is now expanded to:

(gdb) disassemble spi_single_tx
Dump of assembler code for function spi_single_tx:
   0x000001dc <+0>:     addi    sp,sp,-16
   0x000001e0 <+4>:     sw      s0,8(sp)
   0x000001e4 <+8>:     sw      s1,4(sp)
   0x000001e8 <+12>:    sw      s2,0(sp)
   0x000001ec <+16>:    sw      ra,12(sp)
   0x000001f0 <+20>:    mv      s2,a0
   0x000001f4 <+24>:    li      s0,7
   0x000001f8 <+28>:    li      s1,-1
   0x000001fc <+32>:    sra     a5,s2,s0
   0x00000200 <+36>:    andi    a5,a5,1
   0x00000204 <+40>:    beqz    a5,0x240 <spi_single_tx+100>
   0x00000208 <+44>:    li      a0,1
   0x0000020c <+48>:    jal     ra,0x1c8 <lxspi_bitbang_write>
   0x00000210 <+52>:    li      a0,3
   0x00000214 <+56>:    jal     ra,0x1c8 <lxspi_bitbang_write>
   0x00000218 <+60>:    li      a0,1
   0x0000021c <+64>:    addi    s0,s0,-1
   0x00000220 <+68>:    jal     ra,0x1c8 <lxspi_bitbang_write>
   0x00000224 <+72>:    bne     s0,s1,0x1fc <spi_single_tx+32>
   0x00000228 <+76>:    lw      ra,12(sp)
   0x0000022c <+80>:    lw      s0,8(sp)
   0x00000230 <+84>:    lw      s1,4(sp)
   0x00000234 <+88>:    lw      s2,0(sp)
   0x00000238 <+92>:    addi    sp,sp,16
   0x0000023c <+96>:    ret
   0x00000240 <+100>:   li      a0,0
   0x00000244 <+104>:   jal     ra,0x1c8 <lxspi_bitbang_write>
   0x00000248 <+108>:   li      a0,2
   0x0000024c <+112>:   jal     ra,0x1c8 <lxspi_bitbang_write>
   0x00000250 <+116>:   li      a0,0
   0x00000254 <+120>:   j       0x21c <spi_single_tx+64>
End of assembler dump.
(gdb) disassemble lxspi_bitbang_write
Dump of assembler code for function lxspi_bitbang_write:
   0x000001c8 <+0>:     mv      a1,a0
   0x000001cc <+4>:     lui     a0,0xe0008
   0x000001d0 <+8>:     li      a2,0
   0x000001d4 <+12>:    addi    a0,a0,-2048 # 0xe0007800
   0x000001d8 <+16>:    j       0x20 <_csr_wr>
End of assembler dump.
(gdb) disassemble _csr_wr
Dump of assembler code for function _csr_wr:
   0x00000020 <+0>:     sw      a1,0(a0)
   0x00000024 <+4>:     ret
End of assembler dump.
(gdb)

Each single-instruction store is turned into two function calls and a load, which increases the instruction count 9x (ignoring penalties for jumps and returns).

Feature regression

It can be extremely useful to use libeb-c with the wishbone bridge in order to prototype and debug code on a local machine before porting it to an embedded softcore. Previously, the functions csr_(read|write)[bwl]() were conditionally defined if CSR_ACCESSORS_DEFINED was not defined. This allowed for C code to be compiled both for the device itself, as well as on a host system connected to a wishbone bridge simply by defining the macro CSR_ACCESSORS_DEFINED and providing replacement functions.

With the most recent patch, this can no longer be done. Instead it inserts an unconditional call to _csr_(rd|wr) which does direct array indexing.

Build regression

The new function appears to require a new symbol when compiling with 8-bit CSRs:

d:/software/fomu-toolchain-windows-v1.5.3/bin/../lib/gcc/riscv64-unknown-elf/8.3.0/../../../../riscv64-unknown-elf/bin/ld.exe: C:\Users\smcro\AppData\Local\Temp\ccK4Ba47.ltrans0.ltrans.o: in function `.L0 ':
D:\Code\Fomu\foboot\hw\build\software\bios/../../../../sw/include/hw/common.h:78: undefined reference to `__lshrdi3'
collect2.exe: error: ld returned 1 exit status

This appears to be due to the use of 64-bit parameters on a 32-bit system. Changing the argument to a 32-bit parameter (i.e. static inline void _csr_wr(unsigned long *a, uint32_t v, int csr_bytes)) fixes this build issue, but it's not clear why this needs to be a 64-bit value.

gsomlo commented 4 years ago

As I mentioned in the other thread, your example;

static inline void csr_writel(uint32_t value, uint32_t addr)
{
    *((volatile uint32_t *)addr) = value;
}

was broken before PR #339 -- on any design with csr_data_width < 32 this function would throw away any most-significant bits beyond csr_data_width, independently of how nicely the compiler would be able to inline it during code generation...

The new accessors were meant to work on any CPU word width (32 vs. 64), and on any csr_data_width: (8, 16, 32), as well as serve as self-documentation of how CSRs of various total widths are scattered/gathered across subregisters.

I guess the slightly larger footprint on some combinations of (cpu-word-width, csr_data_width) is due to the methods being designed to work on all such combinations, trading optimal generated-code footprint for flexibility.

Happy to see further improvement, though with a strong preference toward anything that can handle both (Rocket_64_bit + 32bit csr_datawidth) and (vexriscv_32_bit + 8bit csr_datawidth) :)

xobs commented 4 years ago

That's how it's supposed to work.

For example, the CTRL block has a register called SCRATCH. This is a 32-bit register. Litex generates a function called ctrl_scratch_read() that does four 32-bit reads and combines them together:

static inline unsigned int ctrl_scratch_read(void) {
    unsigned int r = csr_readl(0xe0000004L);
    r <<= 8;
    r |= csr_readl(0xe0000008L);
    r <<= 8;
    r |= csr_readl(0xe000000cL);
    r <<= 8;
    r |= csr_readl(0xe0000010L);
    return r;
}

In lxsocdoc, we expand this one logical register into four 8-bit registers when documenting these registers, to reflect the fact that there are four 32-bit CSRs that each have 8-bits of data, which maps to one logical 32-bit CSR: https://rm.fomu.im/ctrl.html#ctrl-scratch3

xobs commented 4 years ago

Similarly, the write function looks like:

static inline void ctrl_scratch_write(unsigned int value) {
    csr_writel(value >> 24, 0xe0000004L);
    csr_writel(value >> 16, 0xe0000008L);
    csr_writel(value >> 8, 0xe000000cL);
    csr_writel(value, 0xe0000010L);
}

It's using csr_writel() because the CPU is 32-bits, but it could just as easily use csr_writeb() for this particular instance, since the csr_data_width is 8.

enjoy-digital commented 4 years ago

@xobs: thanks for the investigation and sorry for the regressions. @gsomlo has been doing great work at making CSR access functions generic enough to be used in all cases (CPU endianness, CPU's data_width, CSR's data_width), it already solved use-cases that were not possible before (On Rocket, but also on others cases: it was not possible to do the DDR3 calibration on a system with csr_data_width=32 before and it was for example preventing LiteDRAM + LitePCIe to be used together on the NeTV2 which is now possible and working: https://github.com/enjoy-digital/netv2/blob/master/netv2.py)

Now that things are generic enough, we need to make sure the use-cases you were using before are still possible and that specific configurations that are are widely used (32-bit CPU, 8-bit CSR data width) are optimized. Sorry for the inconvenience, i'll work on this.

mithro commented 4 years ago

https://godbolt.org/ might be helpful here.

gsomlo commented 4 years ago

@xobs, it feels to me like the primary concern is the "suboptimal" dedicated accessors in "include/generated/csr.h". If so, we could fix that in "soc/integration/export.py" by generating something like (e.g., for the 32-bit scratch register):

static inline void ctrl_scratch_write(uint32_t v) {
    *((uint8_t *)0xe0000004L) = v >> 24;
    *((uint8_t *)0xe0000008L) = v >> 16;
    ....
}

if csr_data_width == 8, OR this:

static inline void ctrl_scratch_write(uint32_t v) {
    *((uint16_t *)0xe0000004L) = v >> 16;
    *((uint16_t *)0xe000000cL) = v ;
}

if csr_data_width == 16, OR simply

static inline void ctrl_scratch_write(uint32_t v) {
    *((uint32_t *)0xe0000004L) = v;
}

if csr_data_width == 32. Since "soc/integration/export.py" knows the value of csr_data_width, it can generate the "correct" dedicated accessor, and since "generated/csr.h" is, well, generated, it shouldn't really matter that we're not "encapsulating" those casts.

If that sounds acceptable, I can cook up something over the next day or two.

IMHO, it feels more "honest" this way, because, to me, having something like:

static inline void csr_writel(uint32_t value, uint32_t addr)
{
    *((volatile uint32_t *)addr) = value;
}

is in itself breaking the "least-WTF" design principle, where there exists implicit unstated knowledge that'd tell one whether none, half, or three quarters of the bytes in play woud be silently discarded...

I would, therefore, prefer not to revert the changes in "soc/software/include/hw/common.h", if that's an acceptable compromise. Please advise.

enjoy-digital commented 4 years ago

@gsomlo: that seems a good compromise but @xobs still needs to be able to override the CSR sub-register accessors, so instead of doing directly the memory access in each xxyy_write/read function, maybe add a macro for the CSR sub-register accessor and use it in the xxyy_write/read functions.

xobs commented 4 years ago

It looks like what's going on is we're moving register accessors from a compile-time decision to a runtime decision. Everything is now gated by _csr_wr() and _csr_rd(). With the change, any csr.h can be used with a platform of any endianness and bit size, simply by swapping out hw/common.h. I'm not sure if this is a good thing or not.

But it does mean that we can simply replace _csr_wr() and _csr_rd() with mocked values at runtime, similarly to how we do now with csr_writel() and `csr_readl().

One nice property of the previous csr_readl()/csr_writel() approach is that it didn't assume CSR widths were 8/16/32 bits. Litex could conceivably use 9-bit CSRs and it would still work. The upper bits would be "don't care" just like they are currently.

Also, your example has a tiny error. For the 16-bit accessor, the address would be 0xe0000008L and not 0xe000000cL:

static inline void ctrl_scratch_write(uint32_t v) {
    *((uint16_t *)0xe0000004L) = v >> 16;
    *((uint16_t *)0xe0000008L) = v ;
}

I suppose I'm not understanding the problem that the patch was trying to solve. csr_writel() and csr_readl() ignored the upper 24 bits because the registers they were writing to were only 8-bits.

gsomlo commented 4 years ago

I suppose I'm not understanding the problem that the patch was trying to solve. csr_writel() and csr_readl() ignored the upper 24 bits because the registers they were writing to were only 8-bits.

Hmm. The existence of csr_[read|write][bw]() suggested they were all methods advertising the ability to read/write 8, 16 (or 32, in the case of readl/writel) bits to/from a CSR register, and thus needed to be fixed, since they were clearly not doing that.

I think your use of csr_writel() and csr_readl() is more in line with the MMPTR macro in "common.h", which used to be

#define MMPTR(a) (*((volatile unsigned int *)(a)))

and now has been changed to unsigned long (to work independently of the cpu word width).

I wonder if we generate "csr.h" accessors based on MMPTR (or some dedicated inline function pair that would replace it, so you could override that the way you used to override the old accessors) instead of csr_[read|write]l() and left the rest of common.h as-is (for the purpose of offering user-friendly whole-CSR accessors that work across the whole (cpu-width x csr_data_width) matrix, we could all have what we need. What do you think?

(Also, somewhat semi-related to the topic at hand, any thoughts on what Linux kernel support for CSRs would/should look like? If, e.g., csr_data_width were a hard-coded define (via Kconfig), one could implement accessors as a (cpu-width x csr_data_width) matrix of hardcoded macros or inline functions. However, it might be more useful to have csr_data_width as a run-time variable grabbed from e.g. DTB, in which case we'd need something more along the lines of the new code in common.h.)

xobs commented 4 years ago

Right, MMPTR(a) suffered from two problems:

1) It conflated reading and writing, making it impossible to overload, and 2) It was always inlined, making it difficult to replace with an external function

The patch to replace it with csr_readl() and csr_writel() used those names because, like the macro, it was accessing 32-bit values. You're right in that those names are not appropriate for 64-bit machines, so I agree the name should be changed at least. Maybe just csr_read() and csr_write()?

I'm also unsure of the target demographic. Do you think users will call _csr_rd((unsigned long *)CSR_CTRL_SCRATCH_ADDR, 4); to read the CTRL_SCRATCH register rather than calling the generated ctrl_scratch_read() accessor? Or did ctrl_scratch_read() not work on RV64 when you tried it?

For Linux, you won't be able to use the same accessor without some device tree munging. For example, with 8-bit CSRs, CSR_CTRL_BUS_ERRORS_ADDR is 0xe0000014L, but if you're using 32-bit CSRs then it is at 0xe0000008L. You would need three pieces of information: csr_data_width, csr_addr, and csr_size. You could attach these three pieces of information to the device tree and come up with a csr_readl() and csr_writel() in the kernel that does what you originally thought the litex one did, and stripe accesses across multiple registers.

mithro commented 4 years ago

My understanding was that; (1) If the CSR bus width was known at compile time, then the compiler should have optimized stuff down to direct register reads/writes. (2) If the CSR bus width was not known at compile time (because you wanted to read it from the DTB or something) then it would fall back to more complicated logic.

I'm pretty sure this is possible to make happen.

If you wanted to be super advanced, you could actually read the CSR bus width once and do a "fixup" to the correct function calls, dramatically reducing the cost.

gsomlo commented 4 years ago

@mithro: I was kinda hoping that the new, "flexible" accessors could be optimized automatically like that. However, I'm not a professional compiler-optimizer jockey... :)

Then there's also the consideration that in a place like the LiteX bios we shouldn't focus too much on speed (because it's code that runs only once before we presumably load something like a Linux kernel), although size might matter in tight spaces.

That versus the actual Linux kernel, where CPU word-width is a global constant, csr-data-width is a SoC property (unless we decide to make it a global constant as well), and individual CSRs' base address and total-width are properties of their respective device nodes. I guess I'd put flexibility and speed as my equally-most-important criteria, with code size as something I'm willing to compromise on.

I also figured that LiteX itself should have an authoritative piece of reference code describing how the whole CSR scheme works overall, which was the other half of my motivation for PR #339 (besides the csr_[read|write]bw() breakage).

mithro commented 4 years ago

@gsomlo - BTW Have you seen lxsocdoc stuff that @xobs did? He also cares about having the CSR access methods documented correctly! He is also generating SVD for rust bindings.

mithro commented 4 years ago

@gsomlo - It's probably the case we just need to tweak some code to get the compiler to do the right thing.

gsomlo commented 4 years ago

BTW Have you seen lxsocdoc stuff that @xobs did? He also cares about having the CSR access methods documented correctly! He is also generating SVD for rust bindings.

Yeah, I've seen it and I think it's great. It's mostly complementary with what I'm worried about, though (documenting what each CSR is for on one hand, vs. documenting how CSRs work under the hood in general, in my case)

enjoy-digital commented 4 years ago

With https://github.com/enjoy-digital/litex/pull/366 merged, @xobs do you still see regressions for your use case? Even if so, we could keep this issue open (eventually change the title) to continue the discussion on CSR accessors in the different cases (with BIOS, OSes, ...)

xobs commented 4 years ago

With #366 merged, it doesn't build at all:

In file included from ../../../../sw/src/rgb.c:2:
../include/generated/csr.h: In function 'ctrl_reset_read':
../include/generated/csr.h:20:9: warning: implicit declaration of function 'csr_read_simple'; did you mean 'csr_rd_uint64'? [-Wimplicit-function-declaration]   return csr_read_simple(0xe0000000L);
         ^~~~~~~~~~~~~~~
         csr_rd_uint64
../include/generated/csr.h: In function 'ctrl_reset_write':
../include/generated/csr.h:23:2: warning: implicit declaration of function 'csr_write_simple'; did you mean 'csr_wr_uint64'? [-Wimplicit-function-declaration]  csr_write_simple(v, 0xe0000000L);
  ^~~~~~~~~~~~~~~~
  csr_wr_uint64
"  CC       ../../../../sw/third_party/libbase/uart.c   uart.o"
In file included from ../../../../sw/include/system.h:59,
                 from ../../../../sw/include/irq.h:8,
                 from ../../../../sw/third_party/libbase/uart.c:2:
../include/generated/csr.h: In function 'ctrl_reset_read':
../include/generated/csr.h:20:9: warning: implicit declaration of function 'csr_read_simple'; did you mean 'csr_rd_uint64'? [-Wimplicit-function-declaration]   return csr_read_simple(0xe0000000L);
         ^~~~~~~~~~~~~~~
         csr_rd_uint64
../include/generated/csr.h: In function 'ctrl_reset_write':
../include/generated/csr.h:23:2: warning: implicit declaration of function 'csr_write_simple'; did you mean 'csr_wr_uint64'? [-Wimplicit-function-declaration]  csr_write_simple(v, 0xe0000000L);
  ^~~~~~~~~~~~~~~~
  csr_wr_uint64

Let me see why...

xobs commented 4 years ago

My fault. Forgot to update common.h...

xobs commented 4 years ago

The software definitely is looking much better, with the function back down to its original size (albeit offset by 4 bytes for some reason. Hmm...):

(gdb) disassemble spi_single_tx
Dump of assembler code for function spi_single_tx:
   0x00000150 <+0>:     li      a4,7
   0x00000154 <+4>:     lui     a5,0xe0008
   0x00000158 <+8>:     li      a6,2
   0x0000015c <+12>:    li      a2,1
   0x00000160 <+16>:    li      a7,3
   0x00000164 <+20>:    li      a1,-1
   0x00000168 <+24>:    sra     a3,a0,a4
   0x0000016c <+28>:    andi    a3,a3,1
   0x00000170 <+32>:    beqz    a3,0x18c <spi_single_tx+60>
   0x00000174 <+36>:    sw      a2,-2048(a5) # 0xe0007800
   0x00000178 <+40>:    sw      a7,-2048(a5)
   0x0000017c <+44>:    sw      a2,-2048(a5)
   0x00000180 <+48>:    addi    a4,a4,-1
   0x00000184 <+52>:    bne     a4,a1,0x168 <spi_single_tx+24>
   0x00000188 <+56>:    ret
   0x0000018c <+60>:    sw      zero,-2048(a5)
   0x00000190 <+64>:    sw      a6,-2048(a5)
   0x00000194 <+68>:    sw      zero,-2048(a5)
   0x00000198 <+72>:    j       0x180 <spi_single_tx+48>
End of assembler dump.
(gdb)
gsomlo commented 4 years ago

OK, so I kept quietly obsessing over this while stuck in a bunch of meetings over the last week or so :)

So, I finally decided to try a side-by-side comparison of ctrl_scratch_[read|write]() (based on csr_[read|write]_simple() and shifts generated in hardcoded form by soc/integration/export.py, i.e., the thing everyone appears to be OK with) vs. the new csr_[rd|wr]_uint32() based on the new "universal" _csr_[rd|wr]() and num_subregs() in soc/software/include/hw/common.h, i.e., the thing that reportedly caused everyone's regressions).

I added the following patch to soc/software/bios/main.c to force observable "nop" instructions to be inserted into main.o and allow me to pick apart the code being generated for each of the accessors being used:

diff --git a/litex/soc/software/bios/main.c b/litex/soc/software/bios/main.c
index e4b92216..d2d4fe92 100644
--- a/litex/soc/software/bios/main.c
+++ b/litex/soc/software/bios/main.c
@@ -635,6 +635,22 @@ int main(int i, char **c)
                printf("Memory initialization failed\n");
        printf("\n");
 #endif
+       int foo;
+       __asm__ volatile("addi x0, x0, 10");
+       foo = ctrl_scratch_read();
+       __asm__ volatile("addi x0, x0, 11");
+       printf("ctrl_scratch_read: 0x%08x\n", foo);
+       foo += 1;
+       __asm__ volatile("addi x0, x0, 12");
+       ctrl_scratch_write(foo);
+       __asm__ volatile("addi x0, x0, 13");
+       foo = csr_rd_uint32(0x82000004L);
+       __asm__ volatile("addi x0, x0, 14");
+       printf("csr_rd_uint32: 0x%08x\n", foo);
+       foo += 1;
+       __asm__ volatile("addi x0, x0, 15");
+       csr_wr_uint32(foo, 0x82000004L);
+       __asm__ volatile("addi x0, x0, 16");

        if(sdr_ok) {
                printf("--============== \e[1mBoot\e[0m ==================--\n");

I then built a vexriscv-based SoC:

litex/litex/tools/litex_sim.py --csr-data-width [8|32] --cpu-type vexriscv

and disassembled main.o with riscv64-unknown-linux-gnu-objdump.

The results I got were interesting, as the accessors were indistinguishable from each other:

# this is shared across all examples:
lui     s1,0x82000

# csr_data_width == 8bit:

#ctrl_scratch_read():
lw      a1,4(s1) 
lw      s0,8(s1)
slli    a1,a1,0x8               
or      a1,a1,s0                
lw      s0,12(s1)               
slli    a1,a1,0x8       
or      a1,a1,s0
lw      s0,16(s1)
slli    a1,a1,0x8
or      s0,a1,s0                

#ctrl_scratch_write(a1):
srli    a5,a1,0x18
sw      a5,4(s1)                
srli    a5,a1,0x10
sw      a5,8(s1)
srli    a5,a1,0x8               
sw      a5,12(s1)               
sw      a1,16(s1)               

#csr_rd_uint32():
lw      a1,4(s1)
lw      s0,8(s1)                
slli    a1,a1,0x8
or      s0,s0,a1       
slli    a1,s0,0x8
lw      s0,12(s1)               
or      s0,s0,a1                
lw      a1,16(s1)               
slli    s0,s0,0x8               
or      s0,a1,s0                

#csr_wr_uint32(a1):
srli    a5,a1,0x18
sw      a5,4(s1)                
srli    a5,a1,0x10              
sw      a5,8(s1)                
srli    a5,a1,0x8               
sw      a5,12(s1)               
sw      a1,16(s1)               

# csr_data_width == 32bit:

#ctrl_scratch_read():
lw      s2,4(s1)

#ctrl_scratch_write(s2):
sw      s2,4(s1)

#csr_rd_uint32():
lw      s2,4(s1)

#csr_wr_uint32(s2):
sw      s2,4(s1)

I didn't even tweak any optimization flags in the bios makefile, so this is straight out-of-the-box. So now I'm wondering what "external" force may have caused people's observed regressions, and/or what else I might be missing :)

I could see myself maybe augmenting the "new" accessors by sprinkling in some additional "const" and "volatile" keywords, e.g.:

diff --git a/litex/soc/software/include/hw/common.h b/litex/soc/software/include/hw/common.h
index 708f3075..b0a8552a 100644
--- a/litex/soc/software/include/hw/common.h
+++ b/litex/soc/software/include/hw/common.h
@@ -64,13 +64,13 @@ static inline unsigned long csr_read_simple(unsigned long a)
  *  |  4  | 1 1 1 1 2 2 2 2 |
  *  |  8  | 1 1 1 1 1 1 1 1 |
  *  +-----+-----------------+ */
-static inline int num_subregs(int csr_bytes)
+static inline int num_subregs(const int csr_bytes)
 {
        return (csr_bytes - 1) / CSR_DW_BYTES + 1;
 }

 /* Read a CSR of size 'csr_bytes' located at address 'a'. */
-static inline uint64_t _csr_rd(unsigned long *a, int csr_bytes)
+static inline uint64_t _csr_rd(volatile unsigned long *a, const int csr_bytes)
 {
        uint64_t r = a[0];
        for (int i = 1; i < num_subregs(csr_bytes); i++) {
@@ -81,7 +81,7 @@ static inline uint64_t _csr_rd(unsigned long *a, int csr_bytes)
 }

 /* Write value 'v' to a CSR of size 'csr_bytes' located at address 'a'. */
-static inline void _csr_wr(unsigned long *a, uint64_t v, int csr_bytes)
+static inline void _csr_wr(volatile unsigned long *a, uint64_t v, const int csr_bytes)
 {
        int ns = num_subregs(csr_bytes);
        for (int i = 0; i < ns; i++)
@@ -92,42 +92,42 @@ static inline void _csr_wr(unsigned long *a, uint64_t v, int csr_bytes)

 static inline uint8_t csr_rd_uint8(unsigned long a)
 {
-       return _csr_rd((unsigned long *)a, sizeof(uint8_t));
+       return _csr_rd((volatile unsigned long *)a, sizeof(uint8_t));
 }

 static inline void csr_wr_uint8(uint8_t v, unsigned long a)
 {
-       _csr_wr((unsigned long *)a, v, sizeof(uint8_t));
+       _csr_wr((volatile unsigned long *)a, v, sizeof(uint8_t));
 }
...

to prevent some crazy compiler flags from optimizing anything away, but for now I'm confused...

@xobs, @bunnie, @mithro, @enjoy-digital, what do you all think?

xobs commented 4 years ago

Here's a minimal testcase, taken from real-world code and trimmed down:

#define CONFIG_CSR_ALIGNMENT 32

#include <hw/common.h>

static inline void lxspi_bitbang_write(uint32_t v) {
#ifdef OLD_STYLE_ACCESSORS
#warning "Using old-style accessor"
        csr_write_simple(v, 0xe0007800L);
#else
#warning "Using new-style accessor"
        csr_wr_uint32(v, 0xe0078000L);
#endif
}

int main(int argc, char **argv) {
        int bit;
        unsigned char out = argc;

        for (bit = 7; bit >= 0; bit--) {
                if (out & (1 << bit)) {
                        lxspi_bitbang_write(3);
                        lxspi_bitbang_write(1);
                } else {
                        lxspi_bitbang_write(2);
                        lxspi_bitbang_write(0);
                }
        }
}

I compiled this with -Os, and set CONFIG_CSR_DATA_WIDTH=8 (although in practice, it doesn't matter since it's only a single uint8_t wide).

With the csr_rd_uint8 accessor, the code size is 112 bytes, and includes calls to lxspi_bitbang_write:

xobs@Cuboid:~/Code/FPGA/csr-test$ riscv64-unknown-elf-gcc -DCONFIG_CSR_DATA_WIDTH=8 -march=rv32i -mabi=ilp32 test.c -o test -I. -ggdb3 -Os && riscv64-unknown-elf-gdb test -ex 'disassemble main' -ex 'quit'
test.c: In function 'lxspi_bitbang_write':
test.c:10:2: warning: #warning "Using new-style accessor" [-Wcpp]
 #warning "Using new-style accessor"
  ^~~~~~~
GNU gdb (GDB) 8.2.90.20190228-git
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=riscv64-unknown-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test...
Dump of assembler code for function main:
   0x00010074 <+0>:     addi    sp,sp,-16
   0x00010078 <+4>:     sw      s0,8(sp)
   0x0001007c <+8>:     sw      s1,4(sp)
   0x00010080 <+12>:    sw      s2,0(sp)
   0x00010084 <+16>:    sw      ra,12(sp)
   0x00010088 <+20>:    li      s0,7
   0x0001008c <+24>:    andi    s2,a0,255
   0x00010090 <+28>:    li      s1,-1
   0x00010094 <+32>:    sra     a5,s2,s0
   0x00010098 <+36>:    andi    a5,a5,1
   0x0001009c <+40>:    beqz    a5,0x100d4 <main+96>
   0x000100a0 <+44>:    li      a0,3
   0x000100a4 <+48>:    jal     ra,0x101f8 <lxspi_bitbang_write>
   0x000100a8 <+52>:    li      a0,1
   0x000100ac <+56>:    addi    s0,s0,-1
   0x000100b0 <+60>:    jal     ra,0x101f8 <lxspi_bitbang_write>
   0x000100b4 <+64>:    bne     s0,s1,0x10094 <main+32>
   0x000100b8 <+68>:    lw      ra,12(sp)
   0x000100bc <+72>:    lw      s0,8(sp)
   0x000100c0 <+76>:    lw      s1,4(sp)
   0x000100c4 <+80>:    lw      s2,0(sp)
   0x000100c8 <+84>:    li      a0,0
   0x000100cc <+88>:    addi    sp,sp,16
   0x000100d0 <+92>:    ret
   0x000100d4 <+96>:    li      a0,2
   0x000100d8 <+100>:   jal     ra,0x101f8 <lxspi_bitbang_write>
   0x000100dc <+104>:   li      a0,0
   0x000100e0 <+108>:   j       0x100ac <main+56>
End of assembler dump.
xobs@Cuboid:~/Code/FPGA/csr-test$ 

If we defined OLD_STYLE_ACCESSORS to get the previous behavior, we see the code size is 33% smaller at 76 bytes:

xobs@Cuboid:~/Code/FPGA/csr-test$ riscv64-unknown-elf-gcc -DOLD_STYLE_ACCESSORS -DCONFIG_CSR_DATA_WIDTH=8 -march=rv32i -mabi=ilp32 test.c -o test -I. -ggdb3 -Os && riscv64-unknown-elf-gdb test -ex 'disassemble main' -ex 'quit'
test.c: In function 'lxspi_bitbang_write':
test.c:7:2: warning: #warning "Using old-style accessor" [-Wcpp]
 #warning "Using old-style accessor"
  ^~~~~~~
GNU gdb (GDB) 8.2.90.20190228-git
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=riscv64-unknown-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test...
Dump of assembler code for function main:
   0x00010074 <+0>:     li      a5,7
   0x00010078 <+4>:     andi    a0,a0,255
   0x0001007c <+8>:     lui     a4,0xe0008
   0x00010080 <+12>:    li      a1,2
   0x00010084 <+16>:    li      a6,3
   0x00010088 <+20>:    li      a7,1
   0x0001008c <+24>:    li      a2,-1
   0x00010090 <+28>:    sra     a3,a0,a5
   0x00010094 <+32>:    andi    a3,a3,1
   0x00010098 <+36>:    beqz    a3,0x100b4 <main+64>
   0x0001009c <+40>:    sw      a6,-2048(a4) # 0xe0007800
   0x000100a0 <+44>:    sw      a7,-2048(a4)
   0x000100a4 <+48>:    addi    a5,a5,-1
   0x000100a8 <+52>:    bne     a5,a2,0x10090 <main+28>
   0x000100ac <+56>:    li      a0,0
   0x000100b0 <+60>:    ret
   0x000100b4 <+64>:    sw      a1,-2048(a4)
   0x000100b8 <+68>:    sw      zero,-2048(a4)
   0x000100bc <+72>:    j       0x100a4 <main+48>
End of assembler dump.
xobs@Cuboid:~/Code/FPGA/csr-test$

Even enabling lto doesn't improve things, as it still has calls to lxspi_bitbang_write():

xobs@Cuboid:~/Code/FPGA/csr-test$ riscv64-unknown-elf-gcc -flto -DCONFIG_CSR_DATA_WIDTH=8 -march=rv32imac -mabi=ilp32 test.c -o test -I. -ggdb3 -Os && riscv64-unknown-elf-gdb test -ex 'disassemble main' -ex 'quit'
test.c: In function 'lxspi_bitbang_write':
test.c:10:2: warning: #warning "Using new-style accessor" [-Wcpp]
 #warning "Using new-style accessor"
  ^~~~~~~
GNU gdb (GDB) 8.2.90.20190228-git
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=riscv64-unknown-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test...
Dump of assembler code for function main:
   0x00010074 <+0>:     addi    sp,sp,-16
   0x00010076 <+2>:     sw      s0,8(sp)
   0x00010078 <+4>:     sw      s1,4(sp)
   0x0001007a <+6>:     sw      s2,0(sp)
   0x0001007c <+8>:     sw      ra,12(sp)
   0x0001007e <+10>:    li      s0,7
   0x00010080 <+12>:    andi    s2,a0,255
   0x00010084 <+16>:    li      s1,-1
   0x00010086 <+18>:    sra     a5,s2,s0
   0x0001008a <+22>:    andi    a5,a5,1
   0x0001008c <+24>:    beqz    a5,0x100aa <main+54>
   0x0001008e <+26>:    li      a0,3
   0x00010090 <+28>:    jal     0x1017c <lxspi_bitbang_write>
   0x00010092 <+30>:    li      a0,1
   0x00010094 <+32>:    addi    s0,s0,-1
   0x00010096 <+34>:    jal     0x1017c <lxspi_bitbang_write>
   0x00010098 <+36>:    bne     s0,s1,0x10086 <main+18>
   0x0001009c <+40>:    lw      ra,12(sp)
   0x0001009e <+42>:    lw      s0,8(sp)
   0x000100a0 <+44>:    lw      s1,4(sp)
   0x000100a2 <+46>:    lw      s2,0(sp)
   0x000100a4 <+48>:    li      a0,0
   0x000100a6 <+50>:    addi    sp,sp,16
   0x000100a8 <+52>:    ret
   0x000100aa <+54>:    li      a0,2
   0x000100ac <+56>:    jal     0x1017c <lxspi_bitbang_write>
   0x000100ae <+58>:    li      a0,0
   0x000100b0 <+60>:    j       0x10094 <main+32>
End of assembler dump.
xobs@Cuboid:~/Code/FPGA/csr-test$
mithro commented 4 years ago

You might want to include toolchain versions and architectures (plus compile flags), as that could make a huge difference.

gsomlo commented 4 years ago

After creating an 8-bit scratch CSR and looking for the disassembled opcodes of either ctrl_scratch8_[read|write]() or, respectively, csr_[rd|wr]_uint8(), I get the following:

#scratch8 is at 0x82000014

#common across all accesses:
lui     s2,0x82000

# ctrl_scratch8_read():
lw      s1,20(s2)
andi    s1,s1,255

# ctrl_scratch8_write():
andi    s1,s1,255
sw      s1,20(s2)

# csr_rd_uint8():
lw      s1,20(s2)

# csr_wr_uint8():
andi    s1,s1,255
sw      s1,20(s2)

This with the standard flags in the LiteX BIOS makefile (soc/software/common.mak and soc/software/bios/Makefile):

riscv64-unknown-elf-gcc -std=gnu99 -c -MD -MP -Os -march=rv32im -mabi=ilp32 \
    -D__vexriscv__ -g3 -fomit-frame-pointer -fno-builtin -nostdinc  -fexceptions \
    -Wall -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes \
    ...

Similar code is generated for both data width 8 and 32, with some small variation in when/how andi opcodes are inserted. This is with 9.2.0 from https://github.com/riscv/riscv-gnu-toolchain at commit 2c037e6.

I think (aside from compiler optimizations) the relevant difference is that csr_[read|write]_simple() is for accessing one subregister (and will therefore be represented by one MMIO access), whereas csr_[rd|wr]_uint[X]() are for accessing complete CSRs of width X. So @xobs in your case if it's an 8-bit register you're going for, the appropriate "cooked" accessor would have been csr_[rd|wr]_uint8(). Using *_uint32() will (correctly IMHO) generate four accesses (on 8-bit csr_data_width, that is) to get all four bytes of the uint32_t we're telling it to transfer...

xobs commented 4 years ago

So @xobs in your case if it's an 8-bit register you're going for, the appropriate "cooked" accessor would have been csr_[rd|wr]_uint8()

The reads are 8-bits wide, so I'm just calling the litex-generated accessor function. In the past, that would indeed have been csr_rd_uint8(), or more generally csr_readl().

I'm using https://static.dev.sifive.com/dev-tools/riscv64-unknown-elf-gcc-8.3.0-2019.08.0-x86_64-linux-centos6.tar.gz so perhaps the newer toolchain does a better job of inlining functions such as this than the older one.

bunnie commented 4 years ago

Just to put 2c in here -- these CSR changes have also broken all my simulation frameworks. They were written in C and depend heavily upon the CSR API, and now that I've upgraded my Litex dependency all of my unit test frameworks are shattered. I didn't really notice until now because most of my work is in Rust and we generate our access routines from scratch for Rust, but I've had to do some test regressions and I'm finding multitudes of subtle bugs.

Background: in order to test new hardware modules in-situ, I create a minimal SoC with a vex core plus the peripheral, and I use a very stripped-down variant of the litex BIOS that dumps me to main as fast as possible so I'm not wasting precious simulator cycles on stuff not directly related to the hardware. I rely on direct CSR API calls and various "cheats" to do this in C. I don't use Rust because by definition most things a test framework does is unsafe, e.g. test double-free, out of bounds situations on the hardware. These simulation frameworks also don't have a place in Litex mainline, but I'd really like them to work months if not years from now without lots of maintenance.

I had assumed the CSR was a stable API that would allow me to interoperate with Litex and use this for the simulation framework. As much as I have also disliked the way the CSR framework was implemented, I have always done targeted work-arounds because I know pushing anything upstream would break other people's code.

Now, the CSR changes in this issue really break things in subtle ways; code size is up, performance is down, access patterns are different. I definitely can't pull the new Litex on any currently shipping project without doing a serious, deep regression of every driver and subsystem, which is a huge time investment, so non-backward compatible changes like this just encourage the ecosystem to fork and fracture.

I would request that in the future if you want to introduce a new API, please don't destroy the old one. Maybe we could name the new CSR calls with a different name, and keep the old ones with the original name. That way we can gradually port code bases over to the new system, instead of throwing a grenade in everyone's derivative frameworks and drivers and forcing people to migrate to the new API just because it fixes some bugs in one branch. There isn't time in project schedules for this type of thing to happen on a regular basis.

Because CSRs tickle hardware state machines directly, these low-level routines sometimes even depend on the relative timing and access order of the instructions, so for example going from in-line loads and stores to a function call will break code that used to run; and in some cases code size is very tight, so even small changes to core APIs like this will cause code to cease to fit into small machines. Yes, this sounds scary, and it means even changing compiler optimization settings break things, but that's just how it is. It's too expensive to wrap robust APIs around every hardware state machine: this is why we have uncached regions and volatile pointers so we can economize on gates and push complexity out of hardware at the expense of less flexible code. But I'd advocate that the idea is that we fix all these sins at the C or Rust API to the system programmer, not at the CSR level to the hardware.

I do think it is wonderful that people are scaling Litex up to 64 bits and beyond (the most recent simulator bug I had to deal with was some routine trying to build 128-bit data types in the runtime libraries???) but, I am not happy with changes that mean the system fails to service the embedded space. If we keep going this way, eventually, Litex will end up a bit like Linux where it's wonderful that you could get it to run "on anything" from huge server farms to a 6502, but it comes at a huge performance cost to the lower end, and the low-end targets are more demoscene show-and-tell fodder; nothing you'd ever consider shipping for revenue.

In more concrete terms, I'm literally living in a world where I am five LUTs and 64 bytes short of fitting into a target sometimes, so any unnecessary abstractions introduced to make things "one size fits all" turns into a real, material cost: having to upgrade to the next increment FPGA bigger would destroy the economic viability of the project. One size fits all is not a practical goal, simply because there are like 6 orders of magnitude we are trying to span with one framework, from a 32-GiB desktop to a 32kiB embedded target.

If we do want to continue driving the ecosystem more toward the high-end, one possible compromise solution that @xobs suggested is that we come up with a separate "raw iron" API out-of-band from Litex, where we generate a stable raw-iron CSR API by reading in the SVD outputs from the build. This effectively decouples the embedded CSR problem from Litex mainline, and we can pin embedded drivers to a given C API while upgrading the Litex gateware. I think this is a clever idea, but I'd rather not acquire more codebase to maintain if I can get Litex mainline to buy into the idea of creating a versioned low-level API.

In summary, I'm not saying don't make new APIs -- just...please do it in a way that is versioned and interoperable so legacy code can still work without heavy regression testing; and also, maybe "one size fits all" is not a viable strategy for a project that may eventually have to span up to six orders of magnitude of hardware capabilities.

gsomlo commented 4 years ago

Now, the CSR changes in this issue really break things in subtle ways; code size is up, performance is down, access patterns are different. I definitely can't pull the new Litex on any currently shipping project without doing a serious, deep regression of every driver and subsystem, which is a huge time investment, so non-backward compatible changes like this just encourage the ecosystem to fork and fracture.

I would request that in the future if you want to introduce a new API, please don't destroy the old one. Maybe we could name the new CSR calls with a different name, and keep the old ones with the original name. That way we can gradually port code bases over to the new system, instead of throwing a grenade in everyone's derivative frameworks and drivers and forcing people to migrate to the new API just because it fixes some bugs in one branch. There isn't time in project schedules for this type of thing to happen on a regular basis.

Before PR #339, at commit 63cd23c9, all three pairs of CSR accessors (csr_[read|write][b|w|l]()) were written on the assumption that csr_data_width was 8 bits. Each of them made a single access to the address provided, attempting to transfer 8, 16, or 32 bits respectively. The 16 and 32 bit transfers also happened to discard anything beyond the least-significant 8 bits (single access to an 8-bit csr_data_width back-end). Further, throughout the code base, only csr_[read|write]l() were used in export.py to generate dedicated accessors for "compound" CSRs. The other pairs (csr_[read|write][b|w]()) were unused, and besides, redundant w.r.t. csr_[read|write]l().

Before PR #339, there was an RFC PR (#324) out between Dec. 23 and Jan. 12, where I was hoping to collect and address any feedback before submitting this type of change. Things seemed to go along pretty well during most of that time :)

Regardles, as things are right now, csr_[read|write]_simple() is functionally identical to the old csr_[read|write]l() pair, and should (given the way the *[b|w] pairs used to be written) also act as a drop-in replacement for those.

If we

#define csr_writel csr_write_simple
#define csr_readl csr_read_simple

you'd probably get back the exact same behavior we had before. Can't vouch for anything else throughout the LiteX code base, though :)

EDIT: Actually, we could probably just put back all of the old csr_[read|write][b|w|l]() declarations into common.h -- they'd be mostly dead code as far as the rest of the upstream LiteX tree is concerned, but if that's helpful to others, it might be worth considering...

enjoy-digital commented 4 years ago

@bunnie: thanks for your feedback, i totally agree with your points and the fact that things should be stable enough to avoid as much as maintenance as possible, that's really what we are trying to do (and i'm clearly in a similar situation you are with several LiteX based project to maintain little time to do the maintenance).

To make things clear and as @gsomlo just answered, there has been no gateware changes on the CSR API and the only things that temporarily changed are the implementation of the provided CSR accessors, before we noticed it was not as optimized as the original ones, and reverted to the old behaviour.

So i think there is a bit of frustration in your answer, which i could understand, but for now i don't understand why the behaviour has changed on your system (i'm not sure it's the CSR) and i'm happy to help you understand.

While we are trying to support more architectures in LiteX and make things more generic, it's mostly in the way the HDL or Software is generated from Python, not on the implementation. Every increase in gateware resource usage/software code size or performance is unwanted and while adding new features or 64-bit CPU in the last years, we've done others optimizations that have improved 32-bit support.

bunnie commented 4 years ago

Sorry the previous comment read grumpy. I actually was trying really hard to tone it down because I know I can be very sharp when I'm frustrated, and I apologize for letting that emotion out.

To be clear, I have been tracking the mainline Litex and was affected by all the earlier CSR API changes and bugs, but by the time I'd bump into it, I'd find that a bug was already filed on the issue, so I'd just do a work-around while it's fixed and carry on.

At this very moment things have gotten better in terms of compatibility, but today's breaking change was something to do with 128-bit datatypes, and I really didn't want to invest any time debugging 128-bit compatibility, so I just commented out the offending line and looked no further.

But my simulation frameworks had been largely stable up until a few weeks ago, because the CSR C API has been stable up until now, so I guess what I'm just trying to say, "hey, I use that API! I know it's a weird use of the API, but please be gentle with it if you can!" And in general, if we're refactoring major APIs it'd be nice to do a soft-handoff. Like, I've seen previously where Florent is phasing out a major API and he'll deliberately put a check for the legacy version and print a message warning of its phase-out along with some monkey patch code to bridge the gap -- that's quite nice, and I really appreciated those messages. I know it's not possible in all cases but I think in this case leaving the old code around might have been possible.

And thank you, I know you are trying to improve things, just letting you know the construction dust is affecting me. Also, sorry I missed the RFC, I actually had not been tracking issues on this repo very closely, so again, my bad.

enjoy-digital commented 4 years ago

@bunnie: this also demonstrates that proper releases would now be useful to still be able to make the project evolve and avoid such inconvenience for users if they are using releases. Something simple like what is done for Buildroot that just takes YYYY.MM as release's name would already be nice and would not require too much extra-work. I'll think about that.

enjoy-digital commented 4 years ago

We can probably close this now since: