embench / embench-iot

The main Embench repository
https://www.embench.org/
GNU General Public License v3.0
259 stars 104 forks source link

Discussing potential advantages of unaligned access and bi-endian hardware support #174

Open jeras opened 1 year ago

jeras commented 1 year ago

I am writing a system bus protocol specification and I have spent more time than I expected thinking about explicit bi-endian and unaligned access support. I would like to discuss what do you see would be the performance and memory consumption impact on software, if such features were implemented in hardware. I think I could implement zero overhead unaligned support in hardware, but I do not know how much time it would be worth spending on it. I saw a recent presentation on DSP benchmarks in Embench, and I thought you might be willing to discuss my ideas.

I tried to find some research online, but all I could find is discussions on how to avoid misaligned accesses in software, to avoid the performance hit due to missing hardware support. I would like to know, if software could be written to take advantage of a hardware implementation. Hardware with zero or minimal performance hit from using unaligned access compared to avoiding them.

One example where there could be an advantage is DSP processing of non 2**n data, for example 24-bit RGB-888. 24-bit data from a compact data structure in memory could be loaded into a register with just a load operation, without shift instructions. An AND mask might still be needed depending on how the data is processed after the load. For stores the unused 8-bits could just be overwritten. FIR filters with 24-bit data (better precision then 16-bit, less memory then 32-bit) and coefficients might also be possible. Compact storage would reduce memory consumption and cache misses.

Another use case I was thinking of were a more compact heap and stack, but this would require non trivial compiler changes, and the advantage is less obvious compared to DSP applications.

Since I do not have much to work with for data problems, I am focusing on unaligned instruction fetches with the RISC-V C extension. There the impact can be quantified with existing benchmarks. Unfortunately software is not my forte so I do not have any SW running on my HW models except for RISC-V compliance tests. Until I collect the will to port at least dhrystone, I will be focusing on the hardware aspects.

I plan to write an RTL library component for splitting unaligned accesses into pairs of aligned accesses with pipelining for sequential (incrementing address) accesses. This is obliviously already used in most optimized instruction fetch implementations, so nothing new for me to invent. I was also thinking about adding unaligned access support to memories to have zero overhead for non sequential accesses, here are some of my thoughts as an OpenRAM issue.

jeremybennett commented 1 year ago

Hi @jeras

You have an interesting project, and Embench could be suitable for measuring the impact on software. It could certainly be more comprehensive than either Dhrystone or Coremark, since it is a collection of very different programs.

You are welcome to talk about your ideas at one of our monthly meetings. I am not sure what more we can do to help - it sounds like you need a colleague with firmware experience to get the benchmark running on your platform.

I'll leave this issue open for further comment for a few days. If there is no further discussion, I will mark it closed.

jeras commented 1 year ago

Thanks for a bit of support Jeremy, I don't really have anyone to talk about RISC-V offline, so I lack feedback. I will try to join one of your meetings when I get some progress on the RTL. In the meantime I plan to be a bit more active on the Reddit RISC-V forum, hopefully I can find somebody to help me with the firmware in exchange for some testbench HDL or something similar.