RISC-V DataLayout is wrong which causes many optimizations to not work on RISC-V

Quuxplusone commented 4 years ago


Bugzilla Link	PR45708
Status	NEW
Importance	P enhancement
Reported by	Heikki Kultala (heikki.kultala@gmail.com)
Reported on	2020-04-28 01:40:49 -0700
Last modified on	2020-04-30 16:06:55 -0700
Version	trunk
Hardware	All All
CC	asb@lowrisc.org, efriedma@quicinc.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

The RV32I and RV64I Base ISAs supports 8-bit and 16-bit memory operations, AND
they support unaligned memory operations.

However, on LLVM

RISCV32 data layout is just "e-m:e-p:32:32-i64:64-n32-S128",
and RISCV64 data layout is just "e-m:e-p:64:64-i64:64-i128:128-n64-S128".

These lacks information about 8- and 16-bit data types, and also the alignment
for 32 and 64 bit memory accesses is wrong.

Because the data layout forces aligned memory accesses, the backend generates
very slow code consisting of multiple narrower accesses when it cannot prove
that a memory access is aligned.

And the total lack of information about 8-bit and 16-bit accesses in the data
layout makes clang think 8-bit and 16-bit types or memory accesses are not
legal, which prevents some optimizations from working, for example -ffine-
grained-bitfield-accesses fails to generate 8-bit or 16-bit accesses because
there is a check that it only generates legal loads.

Quuxplusone commented 4 years ago

(In reply to Heikki Kultala from comment #0)
> Because the data layout forces aligned memory accesses, the backend
> generates very slow code consisting of multiple narrower accesses when it
> cannot prove that a memory access is aligned.

This can't be right; it isn't part of the datalayout at all.  See
TargetTransformInfo::allowsMisalignedMemoryAccesses.

Whether i8 and i16 should be considered "native" on RISC-V is a topic for llvm-
dev.  (I think last time it was discussed for RISC targets generally, we
concluded that the types should not be considered native, and we should fix
transforms to use DataLayout::fitsInLegalInteger when it makes sense.)

I don't know what's happening with -ffine-grained-bitfield-accesses here off
the top of my head; I would be surprised if it's actually checking the
datalayout.

Quuxplusone commented 4 years ago

Hi Heikki - thanks for the report. It's possible that there are issues with the datalayout string for RISC-V, but if so I don't think they're as you state them.

"These lacks information about 8- and 16-bit data types": As Eli says, the LLVM approach is not to declare integer types as native for RISC architectures where there isn't a set of native instructions that operate on them. So while the x86-64 datalayout string will have 8,16,32,64 as native integer widths, targets like RISC-V, AArch64 will not. We don't specify the alignment 8 and 16-bit types in the datalayout string because the default alignment is correct.

"and also the alignment for 32 and 64 bit memory accesses is wrong": 4-byte is indeed the preferred alignment for 32-bit values and 8-byte is the preferred alignment for 64-bit values.

Looking at -ffine-grained-bitfield-accesses, I see that CGRecordLowering::accumulateBitFields in CGRecordLayoutBuilder.cpp will check DataLayout.isLegalInteger(OffsetInRecord). You're right that this will fail 8 and 16-bit values on RISC-V, though you're going to get the same behaviour for AARch64 which also doesn't specify 8 and 16-bit values as native. The check you logically want to perform is whether the target supports load/store with that width, which is different to whether the type is legal. Note it also checks the offset is naturally aligned. I don't believe information on available load/store operations is obtainable just from the datalayout string, but it seems to me that code could reasonable assume any power-of-two word size less than the native word size is supported.

As for misaligned memory accesses, we should probably expose more options to control the assumptions LLVM makes but it's worth noting that 1) the RISC-V spec does not mandate that misaligned memory accesses are supported and 2) some execution environments that do support them do so with incredibly low performance - with a full trap+emulate. Last time I checked, the standard SDK for devices like the SiFive FE310 produced binaries where no trap handler is installed for misaligned memory accesses. i.e. programs with misaligned loads/stores will simple crash.

Could you say more about the problem you're trying to solve? Are you targeting a RISC-V implementation that does support misaligned memory accesses with reasonable performance?

Quuxplusone commented 4 years ago

I proposed a straightforward patch for -ffine-grained-bitfield-accesses: https://reviews.llvm.org/D79155

Quuxplusone commented 4 years ago

(In reply to Alex Bradbury from comment #3)
> I proposed a straightforward patch for -ffine-grained-bitfield-accesses:
> https://reviews.llvm.org/D79155

This seems to fix my main problem.

The lowering of unaligned accesses is then a separate problem from this, and it
seems that the correct fix to that would not be to change the datalayout but to
implement allowsUnalignedAccesses in RISVISelLowering so that it returns true
but sets the Fast flag to false (to not lower any unaligned accesses away, but
does not generate new ones either from for example memcpy/memset optimizations).

Quuxplusone commented 4 years ago

(In reply to Alex Bradbury from comment #2)

> As for misaligned memory accesses, we should probably expose more options to
> control the assumptions LLVM makes

I was originally wrong when I blamed DataLayout for the alignment. The real
reason was really the missing allowsMisalignedAccessses() in the RISC-V backend.

It seems that, for alignment, all the required interfaces for all the alignment
options are there:

In the data layout we have

1) ABI alignment (for structs)
2) Preferred alignment (for linker?)

And on backend we have

3) allowsMisalignedAccessses (is misaligned are legal, do we need get rid of
them)
4) bool flag in this function, whether misaligned accesses are slow or fast,
does it make sense to make optimizations which create new misaligned accesses.

This is all the reasonable interfaces requires for all reasonable decisions
based on alignment. Just the last ones are not implemented correctly on RISC-V
backend.

But what seems to be missing is "are load/stores to this type native?". The
native type in DataLayout consideres only arithmetics, which is somewhat
stupid, as datalayout SHOULD be about data in memory, not in datapath.

> but it's worth noting that 1) the RISC-V
> spec does not mandate that misaligned memory accesses are supported

Yes it mandates.

https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf

page 19:

"For best performance, the effective address for all loads and stores should be
naturally aligned
for each data type (i.e., on a four-byte boundary for 32-bit accesses, and a
two-byte boundary for
16-bit accesses). The base ISA supports misaligned accesses, but these might
run extremely slowly
depending on the implementation. Furthermore, naturally aligned loads and
stores are guaranteed
to execute atomically, whereas misaligned loads and stores might not, and hence
require additional
synchronization to ensure atomicity"

> and 2)
> some execution environments that do support them do so with incredibly low
> performance - with a full trap+emulate. Last time I checked, the standard
> SDK for devices like the SiFive FE310 produced binaries where no trap
> handler is installed for misaligned memory accesses. i.e. programs with
> misaligned loads/stores will simple crash.

Exactly because of this, the allowsMisalignedMemoryAccesses method  takes an
addtional "bool* Fast" parameter.

In case the HW implementation might be slow, then the
allowsMisalignedMemoryAddresses function implementation shoulld set
(if Fast) *fast = false;

This means that LLVM will not generate new misaligned addresses from example
mempcy or memset optimizations, but will still keep unaligned accesses that
were originally unaligned, does not lower these into multiple narrower accesses.

> Could you say more about the problem you're trying to solve? Are you
> targeting a RISC-V implementation that does support misaligned memory
> accesses with reasonable performance?

In my original case, the access was really aligned, but LLVM did not know it
was aligned, because of a struct pointer to a function. So what my hardware was
capable of did not matter here.

And the purpose why RISC-V supports misaligned accesses is clear - to not force
compilers to create slow furballs when they cannot prove something is aligned.

Quuxplusone commented 4 years ago

(In reply to Heikki Kultala from comment #5)
> Yes it mandates.
>
> https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf

See https://riscv.org/specifications/privileged-isa/ for the system-level spec;
it allows implementations that emulate unaligned loads using an alignment trap
instead of native unaligned operations.  It would be pretty hard to implement
the emulation code if the emulation itself traps. :)

Probably for Linux targets we should assume misaligned accesses are fast by
default.  And maybe support -munaligned-access/-mno-unaligned-access to
override the default.

Quuxplusone commented 4 years ago

Also see the current version of the specification https://content.riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf which explicitly punts the question of misaligned memory handling to the execution environment: "When an EEI does not guarantee misaligned loads and stores are handled invisibly, the EEI must define if exceptions caused by address misalignment result in a contained trap (allowing software running inside the execution environment to handle the trap) or a fatal trap (terminating execution)."

My point is simply that assuming any RISC-V target supports misaligned memory accesses is incorrect, which is why we've started with this conservative starting point. It's certainly safe for any RISC-V Linux target to assume misaligned accesses are supported, and as I said before I'm in agreement that we should implement the interfaces you point to - sorry if I wasn't clear on that.

Eli: I don't think assuming misaligned accesses are fast is safe for Linux. The SiFive U54 and most Rocket-derived designs don't support misaligned accesses in hardware, so will have to go through a pretty slow software trap handler.

Quuxplusone commented 4 years ago

(In reply to Eli Friedman from comment #6)
> (In reply to Heikki Kultala from comment #5)
> > Yes it mandates.
> >
> > https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
>
> See https://riscv.org/specifications/privileged-isa/ for the system-level
> spec; it allows implementations that emulate unaligned loads using an
> alignment trap instead of native unaligned operations.  It would be pretty
> hard to implement the emulation code if the emulation itself traps. :)

The emulation routine would not perform an unaligned access. It would perform
multiple narrower accesses.

So this argument makes no sense.

Quuxplusone commented 4 years ago

(In reply to Alex Bradbury from comment #7)
> Eli: I don't think assuming misaligned accesses are fast is safe for Linux.
> The SiFive U54 and most Rocket-derived designs don't support misaligned
> accesses in hardware, so will have to go through a pretty slow software trap
> handler.

Oh.  That's a bit surprising to me, although I guess I can see why they might
do that.

If we expect users will routinely be running Linux on hardware that traps on
misaligned access, we probably shouldn't generate misaligned accesses at all,
unless the user explicitly requests them.

Quuxplusone commented 4 years ago

(In reply to Eli Friedman from comment #9)
> (In reply to Alex Bradbury from comment #7)
> > Eli: I don't think assuming misaligned accesses are fast is safe for Linux.
> > The SiFive U54 and most Rocket-derived designs don't support misaligned
> > accesses in hardware, so will have to go through a pretty slow software trap
> > handler.
>
> Oh.  That's a bit surprising to me, although I guess I can see why they
> might do that.
>
> If we expect users will routinely be running Linux on hardware that traps on
> misaligned access, we probably shouldn't generate misaligned accesses at
> all, unless the user explicitly requests them.

Again: This is exactly why there is this "Fast" parameter for the
allowsMisalignedAccesses function. That parameter is made exactly for this,
setting this false will prevent optimizations that generate new unaligned
accesses, because unaligned access can be trapping (very slow) in some
(micro)architectures.

But about "never generating unaligned accesses":

Because something that happens very, very rarely CAN SOMETIMES BE slow is not a
reason to make something that happens much more often ALWAYS slow.

An access which has align 1 can still be align 4 in reality.
And, quite often, the situation is exactly that: alignment information has just
been lost somewhere, the data is still aligned, but the backend just does not
know it. For example, a function takes a pointer parameter which has align 1,
but all calls to the function are made with values that are aligned by 4.

Always lowering this to multiple narrower loads means just totally unnecessary
artificial slowdown in these cases.

Quuxplusone commented 4 years ago

The "fast" boolean is meant to indicate cases where unaligned is slower, but at least the same order of magnitude. If an unaligned load is 1000x slower, it could completely kill the performance of a program; the upside isn't enough to make that gamble worthwhile. (Not sure what exactly the overhead is in practice.)

It might be possible to add an optimization to speculate the alignment of pointers. Not sure what the details of that would look like, but essentially "if (p & 3) unaligned_load(p) else aligned_load(p)". Probably most useful for loops. But that's sort of orthogonal.

Quuxplusone / LLVMBugzillaTest

RISC-V DataLayout is wrong which causes many optimizations to not work on RISC-V #44678