Closed Quuxplusone closed 8 years ago
Attached bug.c
(1506 bytes, text/plain): C file that reproduces the problem
Apple bug report : 24107029
The bug doesn't reproduce for me on trunk (r257365), so I'm going to resolve it
as 'fixed'.
I tried to use llvm bisect to find where this changed:
http://lists.llvm.org/pipermail/llvm-dev/2015-October/091140.html
$ llvmlab bisect /bin/sh -c '%(path)s/bin/clang -O2 bug.c && ./a.out | grep
"10"'
clang-r229097-t2015-02-13_02-38-20-b4745: first working build
clang-r229100-t2015-02-13_02-58-18-b4746: next failing build
That seems too far back, so I'm not sure if I invoked that correctly.
So; I looked a little closer. Sanjay's bisect was correct. clang-700 is pretty
old now; I bisected to:
r229099 [SimplifyCFG] Be more aggressive
Sure enough, this still reproduces on trunk with -mllvm -phi-node-folding-
threshold=1.
Long story short: the problematic pattern is:
(c ? -v : v)
which we lower to (because "c" is <4 x i1>, lowered as a vector mask):
(~c & v) | (c & -v)
roughly corresponding to this IR:
define <4 x i32> @t(<4 x i32> %v, <4 x i32> %c) {
%cl = shl <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
%cs = ashr <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
%tmp2 = trunc <4 x i32> %cs to <4 x i1>
; ^ not as artificial as it looks: equivalent to a legalized vsetcc
%mv = sub nsw <4 x i32> zeroinitializer, %v
%r = select <4 x i1> %tmp2, <4 x i32> %v, <4 x i32> %mv
ret <4 x i32> %r
}
The SSE2 codegen is pretty straightforward:
xorps %xmm1, %xmm1
... # xmm6 <- %v
... # xmm3 <- %c
psubd %xmm6, %xmm1 # 0 - v # 0 - 5 -> -5
movaps %xmm3, %xmm0 # c # 0 -> 0
pandn %xmm6, %xmm0 # ~c & v # ~0 & 5 -> 5
pand %xmm3, %xmm1 # c & -v # -5 & 0 -> 0
por %xmm0, %xmm1 # (~c & v) | (c & -v) # 0 | 5 -> 5
However when we have SSSE3 (the default on OS X), we try to match it to PSIGND,
instead doing:
psignd %xmm3, %xmm1 # (c < 0 ? -v : (c > 0 ? v : 0))
# c is a mask, so (c > 0) == 0
# (c ? -v : 0)
# (0 ? -5 : 0)
# -> 0
Which is not equivalent; one does:
(c ? -v : 0)
the other:
(c ? -v : v)
Now. This bug existed since 2010. However, I think we don't know about this
issue because of operand canonicalization.
The PSIGN combine matches:
(or (and m, x), (pandn m, (0 - x)))
(or (and x, m), (pandn m, (0 - x)))
(or (pandn m, (0 - x)), (and m, x))
(or (pandn m, (0 - x)), (and x, m))
but not the variants of:
(or (and m, (0 - x)), (pandn m, x))
Which is what gets generated for the function above (the most obvious IR that I
could write).
I think this is pretty easy to fix: instead of using c as a mask, put any non-
sign bit in there, to default to the 'v' case.
So, this should work:
por <1,1,1,1>, %xmm3 # c' = c | 1
psignd %xmm3, %xmm1 # (c' < 0 ? -v : (c' > 0 ? v : 0))
# c is a mask, so c' is either 1 or 0xff..f
# (c' == 0xff..f ? -v : (c' != 0 ? v : v))
# (c' == 0xff..f ? -v : v)
# (0 ? -5 : 5)
# -> 5
CP loads are cheap, so this is probably still a win over the SSE2 codegen:
psrad $31, %xmm1
pxor %xmm2, %xmm2
psubd %xmm0, %xmm2
pand %xmm1, %xmm2
pandn %xmm0, %xmm1
por %xmm1, %xmm2
movdqa %xmm2, %xmm0
Note that I don't think the couple of PSIGN tests in trunk are correct either.
Consider test/CodeGen/X86/vec-sign.ll:
define <4 x i32> @signd(<4 x i32> %a, <4 x i32> %b) nounwind {
entry:
%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %a, %0
%2 = and <4 x i32> %b.lobit, %sub
%cond = or <4 x i32> %1, %2
ret <4 x i32> %cond
}
if %b is zero:
%b.lobit = <4 x i32> zeroinitializer
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = <4 x i32> %a
%2 = <4 x i32> zeroinitializer
%cond = or <4 x i32> %1, %2
ret <4 x i32> %a
}
whereas we currently generate:
psignd %xmm1, %xmm0
retq
which return 0, as %xmm1 is 0.
Attached file_26110.txt
(3833 bytes, text/plain): Reduced IR
I submitted a tentative patch in http://reviews.llvm.org/D17181.
Ahmed - thanks for digging further. Sorry about the premature bug closing!
I feel like we're missing something simple here. I'm going to try some experiments and report back in D17181.
Committed:
r261025 [X86] Remove the now-unused X86ISD::PSIGN. NFC.
r261024 [X86] Generalize logic blend of (x, -x) combine to match (-x, x).
r261023 [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.
r261022 [X86] Extract PSIGN/BLENDVP tests into vector-blend.ll. NFC.
r261021 [X86] Extract PSIGN/BLENDVP combine. NFC.
r261020 [X86] Extract ANDNP combine. NFC.
bug.c
(1506 bytes, text/plain)file_26110.txt
(3833 bytes, text/plain)