Open Quuxplusone opened 9 years ago
Hi Ryan,
Could you please upload a reproducer testcase? Was this code generated through
ARM NEON intrinsics, one of the vectorizers or something else?
I'm pasting your hastebin content here because that is not timesafe:
ldrh w9, [x0, #16] // Latency: 4
ldrh w10, [x0, #18] // Latency: 4
ins.s v0[0], w9 // Latency: 8
ins.s v0[1], w10 // Latency: 8
rev32.8b v0, v0 // Latency: 3
ushr.2s v0, v0, #16 // Latency: 3
shl.2s v0, v0, #16 // Latency: 3
sshr.2s v0, v0, #16 // Latency: 3
scvtf.2s v0, v0 // Latency: 10
Cheers,
James
What we're doing here is:
* Loading 2 x 16-bit values into a vector register
* This means it needs to be promoted to <2 x i32> as <2 x i16> is not a valid NEON type.
* We're doing this by scalar loads + inserts but we could be doing a FP load + NEON extend (- 2 ops).
* Performing the reverse operation
* Move upper 16 bits to lower 16 bits, copysign into upper 16 bits
* We're obviously emitting two useless ops - the ushr + shl pair isn't needed, just the sshr.
You'd probably get better looking code if you could use <4 x i16> instead of <2
x i16> by the way, as <4 x i16> is a 64-bit vector so has native NEON
operations.
I was using the C++ API to generate it.
I don't have an easy standalone testcase that I can give currently.
The problem with 2 extra shift instructions is fixed by:
D102333 [AArch64] Combine shift instructions in SelectionDAG
https://reviews.llvm.org/D102333
D102938 and D102939 were proposed to fix the problem with loads.
In the instance of generating code that does a simple '<2xFloat> = SIToFP(bswap(<2xShort>))' it generates absolutely terrible code as seen in this hastebin result. http://hastebin.com/gumabupala.pas