Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM Generates abysmal code in simple situation #24819

Open Quuxplusone opened 9 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR24820
Status NEW
Importance P normal
Reported by Ryan Houdek (Sonicadvance1@gmail.com)
Reported on 2015-09-15 05:26:27 -0700
Last modified on 2021-05-24 00:05:22 -0700
Version trunk
Hardware PC Linux
CC andrew.savonichev@gmail.com, david.green@arm.com, james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

In the instance of generating code that does a simple '<2xFloat> = SIToFP(bswap(<2xShort>))' it generates absolutely terrible code as seen in this hastebin result. http://hastebin.com/gumabupala.pas

Quuxplusone commented 9 years ago
Hi Ryan,

Could you please upload a reproducer testcase? Was this code generated through
ARM NEON intrinsics, one of the vectorizers or something else?

I'm pasting your hastebin content here because that is not timesafe:

ldrh    w9, [x0, #16]           // Latency: 4
ldrh    w10, [x0, #18]          // Latency: 4
ins.s   v0[0], w9               // Latency: 8
ins.s   v0[1], w10              // Latency: 8
rev32.8b        v0, v0          // Latency: 3
ushr.2s v0, v0, #16             // Latency: 3
shl.2s  v0, v0, #16             // Latency: 3
sshr.2s v0, v0, #16             // Latency: 3
scvtf.2s        v0, v0          // Latency: 10

Cheers,

James
Quuxplusone commented 9 years ago
What we're doing here is:
  * Loading 2 x 16-bit values into a vector register
    * This means it needs to be promoted to <2 x i32> as <2 x i16> is not a valid NEON type.
    * We're doing this by scalar loads + inserts but we could be doing a FP load + NEON extend (- 2 ops).
  * Performing the reverse operation
  * Move upper 16 bits to lower 16 bits, copysign into upper 16 bits
    * We're obviously  emitting two useless ops - the ushr + shl pair isn't needed, just the sshr.

You'd probably get better looking code if you could use <4 x i16> instead of <2
x i16> by the way, as <4 x i16> is a 64-bit vector so has native NEON
operations.
Quuxplusone commented 9 years ago
I was using the C++ API to generate it.
I don't have an easy standalone testcase that I can give currently.
Quuxplusone commented 3 years ago
The problem with 2 extra shift instructions is fixed by:

D102333 [AArch64] Combine shift instructions in SelectionDAG
https://reviews.llvm.org/D102333

D102938 and D102939 were proposed to fix the problem with loads.