enjoy-digital / litex

Build your hardware, easily!
Other
2.89k stars 555 forks source link

Mor1kx support broken by #517 #544

Closed enjoy-digital closed 4 years ago

enjoy-digital commented 4 years ago

517 changes the way accesses are done on the CSRs and broke Mor1kx support: lxsim is not responding to user commands.

Command reproduce it: rm -rf build && lxsim --cpu-type=mor1kx --opt-level=O0

Works before https://github.com/enjoy-digital/litex/commit/4a5072a014c4e5c1bdee4a06fbaf1b2d34a3769d.

gsomlo commented 4 years ago

It's caused by https://github.com/enjoy-digital/litex/commit/4a5072a014c4e5c1bdee4a06fbaf1b2d34a3769d#diff-a806562ead05ea193232765c9227bb22R43

cc @ozbenh -- MMPTR should always be unsigned long, i.e. u32 on 32-bit CPUs (where CSR alignment is 32 bits) and u64 on 64-bit CPUs (where CSR alignment is 64 bits).

CSR_DATA_WIDTH merely indicates how many of the bits within a MMPTR are populated

I'll send a PR and link this issue shortly.

enjoy-digital commented 4 years ago

Fixed with https://github.com/enjoy-digital/litex/pull/545.

enjoy-digital commented 4 years ago

Fixed with https://github.com/enjoy-digital/litex/pull/545.

ozbenh commented 4 years ago

I STRONGLY disagree. MMPTR should not be 64-bit if the bus width isn't 64-bit, otherwise it will cause all CSR accesses to be doubled up. Florent, this is the wrong fix.

The code was written as to take the alignment into account via CSR_OFFSET_BYTE and only do the proper access size accesses to the bus.

I don't know why Mor1kx broke and I can look into it but this is certainly not the right way to fix this.

ozbenh commented 4 years ago

BTW. Is Mor1kx big endian ? That would explain if the CSRs follow native endian (which is also a very wrong thing to do btw, HW interfaces should have fixed well defined endianness, swapping is cheap).

mithro commented 4 years ago

There are a couple of situations here;

mithro commented 4 years ago

@ozbenh mor1kx is big endian.

mithro commented 4 years ago

@ozbenh CSRs following native endianness makes a lot of sense in FPGA based designs where the CPU and the peripherals are coupled. FPGAs pay a huge cost for being reconfigurable, hence you should take advantage of that reconfigurability.

mithro commented 4 years ago

FYI -- https://github.com/enjoy-digital/litex/blob/3e1b17d459c1966dd561ad5262a7cb9299575191/litex/soc/software/include/hw/common.h#L18-L20

mithro commented 4 years ago

Just reiterating, there are 3 data widths going on here; (a) CPU native data width (b) CPU bus width (c) CSR bus width

As the moment (a) and (b) can be either 64bit or 32bit while (c) can be 8bit, 16bit, 32bit or 64bits.

ozbenh commented 4 years ago

No @mithro this is a false sense of benefit to use "native endianness". Your list of situations also misses the 32-bit CSR bus width which is becoming more common and might in fact become the standard.

There is no cost in hardware in using a well defined endianness for register and almost no gain in SW as the cost of the swap is negligible.

On the other hand it goes against having well defined SW/HW interfaces which are a requirement if you ever want to run something other than "what's compiled along with the generated bitfile" on LiteX.

Additionally, when having cases such as different size vs. stride (alignment) of registers, having variable endianness means problems like the one above where the offset of the register within the alignment region varies.

This is the kind of "flexibility" that provides illusory gains at the cost of maintainability in the long run.

If LiteX wants to be able to have upstream Linux support it's time to start solidifying some of the HW/SW interfaces. But that's a discussion for another issue :)

Regarding the CSRs, I had time to think more about this while driving today and I came to the following conclusion:

In that second case, you do not want an access size that is wider than your CSR bus (64-bit) because this will generate multiple accesses on the bus.

In fact, I would argue further, in that case, the CSR alignment should remain 32-bits, there is no reason to space them further appart since the down-converter handles the spacing properly.

Now, I believe @enjoy-digital you intend to make 32-bit wishbone the "standard" interface for CSRs rather than plugging them into random bus width directly, am I correct ? IE, you did that change to litedram recently for example.

In that case, am I correct to assume that there will always be either a proper 64->32 bit down-converter going to the CSRs on a SoC with a 64-bit processor ? If yes, that means that we really only have to deal with CSR alignments of 8 and 32-bit.

If we can agree on the above, that means we can solve the MMPTR problem by:

enjoy-digital commented 4 years ago

@ozbenh: thanks, we should indeed limit the supported configurations. Always having the CSR bus connected to a 32-bit bus (automatically down-converted from 64-bit when the main bus is 64-bit) and limiting csr_data_width to 8 and 32 would indeed simplify things. With this, we could also get rid of the csr_alignment parameter and only supports 32-bit. This will work for all my use cases, would this also work for yours @gsomlo? (IIRC you are using a 64-bit alignment for CSRs).

mithro commented 4 years ago

No @mithro this is a false sense of benefit to use "native endianness". Your list of situations also misses the 32-bit CSR bus width which is becoming more common and might in fact become the standard.

Well, I didn't want to list all the combinations;

(BTW A few other people have also considered doing an 8bit CPU, 8bit bus and 8bit CSR width and some other people have also considered doing 16bit CPU, 16bit bus and 8bit CSR width.)

There is also the potential for the CSR bus to be connected to a direct "IO bus" (and accessed with specific instructions) rather than directly mapped into the data bus.

There is no cost in hardware in using a well defined endianness for register and almost no gain in SW as the cost of the swap is negligible.

This isn't true at all.

A single extra ASM instruction in the CSR access path reduces things like bit-banging performance in bare metal systems by around 33-40%. This can mean something which could previously done without hardware assistance now needs hardware which means that you design no longer fits on the FPGA.

An extra ASM instruction also increases the code sizes which leads to more cache misses and other problems. Cache is super precious on the FPGA - we don't have massive amounts of it like on modern CPUs.

On the other hand it goes against having well defined SW/HW interfaces which are a requirement if you ever want to run something other than "what's compiled along with the generated bitfile" on LiteX.

This is the kind of "flexibility" that provides illusory gains at the cost of maintainability in the long run.

If LiteX wants to be able to have upstream Linux support it's time to start solidifying some of the HW/SW interfaces. But that's a discussion for another issue :)

We probably find a place to discuss this, as there are a bunch of approaches which will make upstream Linux support not worth the effort. What do you think is the best way? Maybe the mailing list at https://groups.google.com/forum/#!forum/linux-litex ?

We should also make a distinction between what LiteX itself supports verses what is required for Linux support. For example, LiteX should continue to support systems without external DDR memory and CPU configurations without MMU but you are not going to be able to run Linux on those configurations.

Regarding the CSRs, I had time to think more about this while driving today and I came to the following conclusion:

<snip>

You are correct that there are two options when going from one bus width to another bus width. You can either have a "proper" converter, or you can have just a direct mapping with holes.

There are two places that can need conversion;

These are two distinct conversion points and this seems to be where the complexity is coming from here?

For (a) -- I would say that having the CPU and data buses the same width is the normal configuration for 32bit CPU architectures. 64bit CPUs frequently do have 64bit -> 32bit bus converter but it should not be mandatory.

For (b) the decision if you want the data bus and the CSR bus width to be the same is complicated. The two most common configurations at the moment are;

(1) makes the most sense when you want the smallest amount of resource usage and easiest routing on an FPGA. This is because frequently you have a lot of peripherals and routing a wide bus around inside an FPGA is very expensive.

(2) makes the most sense when exposing to a "traditional system". If you are creating a PCIe card which exposes the CSRs directly to an x86 system.

ozbenh commented 4 years ago

So let's address alignment and endianness separately and start with the latter.

So first keep in mind that most data intensive use of registers (ie, non-DMA transfers) tend to be fifo or "memcpy" style operations which do not require byte swapping in any case (unless the bus is wired very wrongly but that's easily fixed).

Cases where bits of a registers are used for bit banging can generally be handled by pre-swapping the constants used to AND or OR. Been there, done that.

In my experience, swapping is only required on control path which aren't particularly performance sensitive.

That said, I'm also happy if we simply state that for platforms that support Linux, we fix the CSR endianness, but leave the option for other types of custom HW to have it native. We'll need to create an appropriate #define and a bit more macro gunk in hw/common.h to handle both cases but it's not that hard.

We probably want to cleanup our story in hw/common.h with IOs anyway to add the necessary memory barriers or the ability to use special instructions when needed. So I'm proposing to create a set of IO accessors for _le andbe, and rework csr*_simple() to use them. We'll need that for microwatt anyway since we'll need special instructions for cache-inhibited load stores at boot time (there's a hack in the core right now to avoid that but it will eventually go away).

I understand the desire to support very small platforms, I know the pain of large busses in FPGAs, so supporting 8-bit CSRs for small configs makes sense.

I don't think we want to support all configurations though.

That's why the idea is to limit the supported list to 8-bit and 32-bit CSRs to begin with, I think so far everybody agrees.

Now, my understanding is that the move is to go to always plumb those to a 32-bit wishbone (without down-conversion for the 8-bit case, ie just wiring the 8 bits to the LSBs of the bus, thus incurring an alignement higher than CSR size). This should be fine since the synthesizer should drop the unused data lines afaik.

This leads to two interesting categories of cases here:

That limits the amounts of cases to deal with. We can make hw/common.h able to deal with all of the above setups, for that I would suggest bringing back my patch to make the access size match the CSR size and fix it by adding the necessary BE offset (CSR_BYTE_OFFSET = (alignment - csr_size) / 8) and fold it in MMPTR.

enjoy-digital commented 4 years ago

As we see, supporting too much configurations lead to confusion and the compromise we found with @ozbenh and @gsomlo seem to cover all cases and greatly simplify things:

When a 64-bit main bus is used, since wishbone.sel is now used as a qualifier for the CSR write, it's now possible for the CPU to only write write 32-bit of LSBs/MSBs on the CSR bus. (On this point, we could also avoid doing the downstream access when wishbone.sel == 0, i will work on that, but that's only for performance/efficiency).

These simplifications will also work with a 8-bit or 16-bit CPU, that would use a 8-bit CSR data width. This will maybe impose some restrictions (like 32-bit CSR alignment) but this should be acceptable and a good compromise to simplify things for the more common use cases and ease support/understanding of the CSR bus.

mithro commented 4 years ago

@enjoy-digital - If we are going with supporting two options, then the two options that make sense to me are;

If we are okay with 3 options, then we can also add the 32bit CSR width option for usage on 64bit CPUs.

16bit CSR width only makes sense on a CPU with a native 16 bit width (which is covered by matching the CPU native width). I can't imagine a 24bit CSR width ever making sense.

BTW While LiteX only uses really uses wishbone with 32bit width at the moment, Wishbone actually supports 8bit, 16bit, 32bit, 64bit and 128bit and the same with AXI.

mithro commented 4 years ago

So let's address alignment and endianness separately and start with the latter.

  • Endianness. I disagree but....:)

So first keep in mind that most data intensive use of registers (ie, non-DMA transfers) tend to be fifo or "memcpy" style operations which do not require byte swapping in any case (unless the bus is wired very wrongly but that's easily fixed).

Yeap, I agree with that analysis.

Cases where bits of a registers are used for bit banging can generally be handled by pre-swapping the constants used to AND or OR. Been there, done that.

The best option with LiteX style CSRs is to just split the read / write / direction values into separate CSRs registers so you don't need to do anything apart from byte read/writes and rotates.

In my experience, swapping is only required on control path which aren't particularly performance sensitive.

The actually case where you have CSR registers which are >8bits tends to be either;

Reading the counters wrong tends to be pretty easy to see.

Writing memory addresses tend to be the place where people screw up most of the time.

That said, I'm also happy if we simply state that for platforms that support Linux, we fix the CSR endianness, but leave the option for other types of custom HW to have it native. We'll need to create an appropriate #define and a bit more macro gunk in hw/common.h to handle both cases but it's not that hard.

Yes, that seem like the right approach to me.

For something like Linux we can fix the CSR endianness but allow bare metal to be more flexible.

We probably want to cleanup our story in hw/common.h with IOs anyway to add the necessary memory barriers or the ability to use special instructions when needed. So I'm proposing to create a set of IO accessors for _le andbe, and rework csr*_simple() to use them. We'll need that for microwatt anyway since we'll need special instructions for cache-inhibited load stores at boot time (there's a hack in the core right now to avoid that but it will eventually go away).

I agree this makes sense too.

Up until recently, the solution to this problem has been to make one of the top bits determine if the memory bus is cached or not. With wanting to support gigabytes of memory has lead to wanting different approaches now.

<snip>

That limits the amounts of cases to deal with. We can make hw/common.h able to deal with all of the above setups, for that I would suggest bringing back my patch to make the access size match the CSR size and fix it by adding the necessary BE offset (CSR_BYTE_OFFSET = (alignment - csr_size) / 8) and fold it in MMPTR.

I have a slightly different idea about what the limited cases should be (see my response to @enjoy-digital).

I'm generally supportive of trying to make the code a lot cleaner and easier to understand. I also think having some strong static asserts to make sure that the code fails to compile in unsupported configurations is also useful.

I believe that MMPTR stood for pointer to memory mapped device. Maybe the MMPTR function should be renamed to something like PTR_TO_CSR_BUS or PTR_TO_VOLITILE_MEMORY or something like that?

ozbenh commented 4 years ago

What would "CPU native width" mean in that context ?

For example on Microwatt (and I can imagine other SoCs wanting to do the same thing to save routing pressure in smaller FPGAs) the "dumb IOs" such as CSRs are segregated on a 32-bit bus behind a downconverter. In that case I wouldn't want CSRs to be 64-bits at all.

So we really have:

I think we can at least agree on that the CSR "bus width" which is the data width unit by which underlying CSRs can be accessed, regardless of the actual size of the CSRs, need to be at least supported for sizes 8 and 32 (I'll get back to other sizes later).

I think we can probably agree that the CPU bus width per-se doesn't matter that much, what does is the SoC "bus" by which CSRs are accessed. Of course that bus is constrained by the CPU it doesn't make sense to have a 32-bit CPU with a 64-bit wide IO bus.. memory behind caches is a different matter so let's ignore it for now). So we can assume that SoC "IO" bus <= CPU bus, and thus SoC "IO" bus is what really matters.

Now if we do agree on this, then what you advocate is to have the CSR bus width be either 8 always, or match the above CPU "IO" bus, which opens the door for more than 8 and 32 but also 16 and 64. The question is whether there is value in that to balance with the maintainance and testing burden associated with it.

For things like Microwatt and I suspect most cases of a 64-bit CPU in common "consumer grade" FPGAs (for whatever that means), I suspect 32-bit is a good compromise as it reduce the routing pressure on the FPGA for most IOs.

So are there sufficient high value cases for full 64-bit and 16-bit:

Do we see the above as a realistic issue for CSRs ? Do we see CSRs as something where a host CPU (either the LiteX hosted core or a remote CPU on PCIe etc...) needs to manipulate 64-bit quantities and can't afford to do so via 2x32-bit transactions ?

My gut feeling is no but I don't necessarily have the whole picture.

As for "really large CSRs", they feel like they shouldn't be CSRs to begin with (and as discussed in anotehr issue, their layout leaves somewhat to be desired). Something that is essentially a byte stream should be probably be handled as a FIFO or a dedicated MMIO region. If it's shoved into CSRs, it should be done in a byte-address-invariant scheme that is endian neutral as discussed in that other issue, in which case it doesn't really matter what the CSR access width is.

enjoy-digital commented 4 years ago

Thanks @mithro, @ozbenh for the feedback. I fully agree with @ozbenh and want to remind that:

So i don't think it's a good choice to re-introduce support for other CSR data-width since:

So we should just not change the current gateware regarding CSR data-width support: 8-bit and 32-bit and as tested with @gsomlo on Rocket we can just remove the CSR alignment and always set it to 32-bit. This should already allow to simplify the software support.

mithro commented 4 years ago

Maybe it is worth looking at the use cases for CSR registers?


This is how I group them;

Group A - Performance sensitive on microcontrollers

Group B - Wide CSRs with atomic requirements

Group C - Others


Group A is the area which can be performance sensitive (but is mostly not) on low power FPGA microcontroller. Generally you are just writing whole

Group B is the area where endian and width matters. It is also the area where it is really easy to screw things up (introduce both functionality and security problems). Group B can also be a little performance sensitive as you may want to be updating them inside the interrupt controller.

The "ideal" interface for Group B would just be a read / write to a memory location with the same values you would use in a pointer.

The other important part of Group B is that you need the writes (and reads to) probably be atomic.


How does the 32bit CSR data-width affect Group B style CSRs on 64bit CPUs? For example on a DMA controller which has memory pointers to anywhere in the 64bit memory space?

mithro commented 4 years ago

BTW - We really should write up some documentation about "good practices" on how to use LiteX CSRs.

enjoy-digital commented 4 years ago

Supporting only 8-bit and 32-bit CSR data-with and CSR alignment of 32-bit has no impact on Group A.

For Group B, with the default behaviour of CSR, we can guarantee 8-bit atomicity with 8-bit CSR data-width, 32-bit atomicity with 32-bit CSR-width, so since want atomicity up to 32-bit for Group B, we are fine the 32-bit CSR data-width.

Now, being able to provide atomicity on arbitrary CSR size is a different thing, and we have a mechanism for that on CSRStorage (atomic_write) but that is not possible on CSRStatus due to the nature of the CSR Bus, so we are currently limited to an atomicity of the CSR data-width for reads.

Atomics are almost not used in LiteX and you will generally want to avoid them if you create a peripheral that needs to work on different CPU data widths (even if 64-bit atomicity was supported for 64-bit CPUs, you would need to have a different implementation for 32-bit CPUs). You can almost always use atomics by splitting data and control registers and that's we are generally doing:

So i think we are digressing from the original issue. I'm happy we discussed how to do atomics, if this is recommended practice or not for peripherals, for portability and to try to document best practices for CSRs in the Wiki, but we could probably move to another issue.

ozbenh commented 4 years ago

I tend to agree that atomic 64-bit isn't a huge issue. Ideally a DMA controller should use a memory based descriptor ring anyway, so the address of the ring is configured once at start up and that doesn't need to be atomic as long as it's done before the engine is started.

If you don't use a ring, but descriptor chains, then same thing, just don't start the chain, or put a valid bit in the bottom half and write it last. No biggie.

The interrupt controller might be the most timing sensitive. That said, many procecssor architectures have their own standard interrupt controllers that are MMIO based and I suspect LiteX will eventually have to support that (at least we'll require XICS for Linux on Microwatt, I wouldn't be surprised if RiscV eventually moves in that direction as well).

enjoy-digital commented 4 years ago

Closing since the original issue is fixed.