LLVM Generates abysmal code in simple situation

Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM Generates abysmal code in simple situation #24819

Open Quuxplusone opened 9 years ago

Quuxplusone commented 9 years ago


Bugzilla Link	PR24820
Status	NEW
Importance	P normal
Reported by	Ryan Houdek (Sonicadvance1@gmail.com)
Reported on	2015-09-15 05:26:27 -0700
Last modified on	2021-05-24 00:05:22 -0700
Version	trunk
Hardware	PC Linux
CC	andrew.savonichev@gmail.com, david.green@arm.com, james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

In the instance of generating code that does a simple '<2xFloat> = SIToFP(bswap(<2xShort>))' it generates absolutely terrible code as seen in this hastebin result. http://hastebin.com/gumabupala.pas

Quuxplusone commented 9 years ago

Hi Ryan,

Could you please upload a reproducer testcase? Was this code generated through
ARM NEON intrinsics, one of the vectorizers or something else?

I'm pasting your hastebin content here because that is not timesafe:

ldrh    w9, [x0, #16]           // Latency: 4
ldrh    w10, [x0, #18]          // Latency: 4
ins.s   v0[0], w9               // Latency: 8
ins.s   v0[1], w10              // Latency: 8
rev32.8b        v0, v0          // Latency: 3
ushr.2s v0, v0, #16             // Latency: 3
shl.2s  v0, v0, #16             // Latency: 3
sshr.2s v0, v0, #16             // Latency: 3
scvtf.2s        v0, v0          // Latency: 10

Cheers,

James

Quuxplusone commented 9 years ago

What we're doing here is:
  * Loading 2 x 16-bit values into a vector register
    * This means it needs to be promoted to <2 x i32> as <2 x i16> is not a valid NEON type.
    * We're doing this by scalar loads + inserts but we could be doing a FP load + NEON extend (- 2 ops).
  * Performing the reverse operation
  * Move upper 16 bits to lower 16 bits, copysign into upper 16 bits
    * We're obviously  emitting two useless ops - the ushr + shl pair isn't needed, just the sshr.

You'd probably get better looking code if you could use <4 x i16> instead of <2
x i16> by the way, as <4 x i16> is a 64-bit vector so has native NEON
operations.

Quuxplusone commented 9 years ago

I was using the C++ API to generate it.
I don't have an easy standalone testcase that I can give currently.

Quuxplusone commented 3 years ago

The problem with 2 extra shift instructions is fixed by:

D102333 [AArch64] Combine shift instructions in SelectionDAG
https://reviews.llvm.org/D102333

D102938 and D102939 were proposed to fix the problem with loads.