Open Quuxplusone opened 4 years ago
This looks like a cost-modelling issue in the SLP vectoriser on AArch64. The
attached slp-repro.ll highlights the issue. On AArch64, the SLP vectoriser does
not kick in.
`bin/opt -slp-vectorizer -instcombine slp-repor.ll -S -mtriple=arm64-apple-
iphoneos | bin/llc -o - -mtriple=arm64-apple-iphoneos` -> gives terrible code.
Now if we pretend to do SLP vectorization for X86, we get very nice code:
`bin/opt -slp-vectorizer -instcombine ~/Desktop/slp-repor.ll -S -mtriple=x86_64-
apple-macos | bin/llc -o - -mtriple=arm64-apple-iphoneos` produces:
ldp q0, q1, [x0]
ldp q3, q2, [x1]
eor v1.16b, v2.16b, v1.16b
eor v0.16b, v3.16b, v0.16b
orr v0.16b, v0.16b, v1.16b
ext v1.16b, v0.16b, v0.16b, #8
orr v0.16b, v0.16b, v1.16b
ext v1.16b, v0.16b, v0.16b, #4
orr v0.16b, v0.16b, v1.16b
ext v1.16b, v0.16b, v0.16b, #2
orr v0.16b, v0.16b, v1.16b
dup v1.16b, v0.b[1]
orr v0.16b, v0.16b, v1.16b
umov w0, v0.b[0]
ret
Alternatively, we could avoid unrolling early and let the loop vectoriser deal
with vectorization (e.g. pass -fno-unroll-loops to clang and we get reasonable
code). The reason we vectorise with LEN = 64 is that we *do not* unroll before
vectorization, the vectoriser kicks in and the vectorized loops is unrolled
afterwards. There's already an issue discussing excessive unrolling
(https://bugs.llvm.org/show_bug.cgi?id=42987) and I think this issue is a good
example of hurtful unrolling.
Attached slp-repor.ll
(10858 bytes, text/x-matlab): slp-repro.ll
Yep, looks like a cost-modeling issue. For example:
SLP: Adding cost 349 for reduction that starts with %xor11.15 = xor i8 %31, %30 (It is a splitting reduction)
which is a very high cost. We are looking into this now.
slp-repor.ll
(10858 bytes, text/x-matlab)