clang c compiler produces wrong result for the attached c code with -O2 optimzation

Quuxplusone commented 8 years ago


Bugzilla Link	PR26110
Status	RESOLVED FIXED
Importance	P release blocker
Reported by	Arun Gopalaswami (garunsrinivasan@gmail.com)
Reported on	2016-01-11 08:08:50 -0800
Last modified on	2016-02-16 16:26:32 -0800
Version	trunk
Hardware	Macintosh MacOS X
CC	ahmed@bougacha.org, garunsrinivasan@gmail.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`bug.c` (1506 bytes, text/plain) `file_26110.txt` (3833 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 15602
C file that reproduces the problem

A bug in the llvm auto vectoizer produces wrong results for the attached c
file. Turning off auto vectorizer produces the right results, but makes the
bigger test cases run much slower.

Compile the attached c file with xcode 7.0x or later with (clang-700.1.81)
(based on LLVM 3.7.0svn) with the following command:

bad result:
===============================
 % clang -O2 bug.c -Wall -o bad
 % ./bad
 dbg = 10 10 10 10 10 10 10 10

good result:
===============================
 % clang -O2 -fno-vectorize bug.c -Wall -o bad
 %./bad
 dbg = 15 15 15 15 15 15 15 15

gcc 4.7

 % gcc -O2 bug.c -Wall -o bad
 % ./bad
 dbg = 15 15 15 15 15 15 15 15

Quuxplusone commented 8 years ago

Attached bug.c (1506 bytes, text/plain): C file that reproduces the problem

Quuxplusone commented 8 years ago

Apple bug report : 24107029

Quuxplusone commented 8 years ago

The bug doesn't reproduce for me on trunk (r257365), so I'm going to resolve it
as 'fixed'.

I tried to use llvm bisect to find where this changed:
http://lists.llvm.org/pipermail/llvm-dev/2015-October/091140.html

$ llvmlab bisect /bin/sh -c '%(path)s/bin/clang -O2 bug.c && ./a.out | grep
"10"'

clang-r229097-t2015-02-13_02-38-20-b4745: first working build
clang-r229100-t2015-02-13_02-58-18-b4746: next failing build

That seems too far back, so I'm not sure if I invoked that correctly.

Quuxplusone commented 8 years ago

So; I looked a little closer. Sanjay's bisect was correct. clang-700 is pretty
old now; I bisected to:
  r229099 [SimplifyCFG] Be more aggressive

Sure enough, this still reproduces on trunk with -mllvm -phi-node-folding-
threshold=1.

Long story short: the problematic pattern is:
  (c ? -v : v)

which we lower to (because "c" is <4 x i1>, lowered as a vector mask):
  (~c & v) | (c & -v)

roughly corresponding to this IR:
  define <4 x i32> @t(<4 x i32> %v, <4 x i32> %c) {
    %cl = shl <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
    %cs = ashr <4 x i32> %c, <i32 31, i32 31, i32 31, i32 31>
    %tmp2 = trunc <4 x i32> %cs to <4 x i1>
    ; ^ not as artificial as it looks: equivalent to a legalized vsetcc

    %mv = sub nsw <4 x i32> zeroinitializer, %v
    %r = select <4 x i1> %tmp2, <4 x i32> %v, <4 x i32> %mv
    ret <4 x i32> %r
  }

The SSE2 codegen is pretty straightforward:

    xorps  %xmm1, %xmm1
    ...                   # xmm6 <- %v
    ...                   # xmm3 <- %c
    psubd  %xmm6, %xmm1   # 0 - v                # 0 - 5 -> -5
    movaps %xmm3, %xmm0   # c                    # 0 -> 0
    pandn  %xmm6, %xmm0   # ~c & v               # ~0 & 5 -> 5
    pand   %xmm3, %xmm1   # c & -v               # -5 & 0 -> 0
    por    %xmm0, %xmm1   # (~c & v) | (c & -v)  # 0 | 5 -> 5

However when we have SSSE3 (the default on OS X), we try to match it to PSIGND,
instead doing:

    psignd    %xmm3, %xmm1    # (c < 0 ? -v : (c > 0 ? v : 0))
                              #   c is a mask, so (c > 0) == 0
                              # (c ? -v : 0)
                              # (0 ? -5 : 0)
                              #   -> 0

Which is not equivalent; one does:
  (c ? -v : 0)
the other:
  (c ? -v : v)

Now. This bug existed since 2010. However, I think we don't know about this
issue because of operand canonicalization.

The PSIGN combine matches:
  (or (and m, x), (pandn m, (0 - x)))
  (or (and x, m), (pandn m, (0 - x)))
  (or (pandn m, (0 - x)), (and m, x))
  (or (pandn m, (0 - x)), (and x, m))

but not the variants of:
  (or (and m, (0 - x)), (pandn m, x))

Which is what gets generated for the function above (the most obvious IR that I
could write).

I think this is pretty easy to fix: instead of using c as a mask, put any non-
sign bit in there, to default to the 'v' case.

So, this should work:
    por       <1,1,1,1>, %xmm3 # c' = c | 1
    psignd    %xmm3, %xmm1     # (c' < 0 ? -v : (c' > 0 ? v : 0))
                               #   c is a mask, so c' is either 1 or 0xff..f
                               # (c' == 0xff..f ? -v : (c' != 0 ? v : v))
                               # (c' == 0xff..f ? -v : v)
                               # (0 ? -5 : 5)
                               #   -> 5

CP loads are cheap, so this is probably still a win over the SSE2 codegen:

    psrad   $31, %xmm1
    pxor    %xmm2, %xmm2
    psubd   %xmm0, %xmm2
    pand    %xmm1, %xmm2
    pandn   %xmm0, %xmm1
    por %xmm1, %xmm2
    movdqa  %xmm2, %xmm0

Note that I don't think the couple of PSIGN tests in trunk are correct either.
Consider test/CodeGen/X86/vec-sign.ll:

define <4 x i32> @signd(<4 x i32> %a, <4 x i32> %b) nounwind {
entry:
  %b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
  %sub = sub nsw <4 x i32> zeroinitializer, %a
  %0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = and <4 x i32> %a, %0
  %2 = and <4 x i32> %b.lobit, %sub
  %cond = or <4 x i32> %1, %2
  ret <4 x i32> %cond
}

if %b is zero:

  %b.lobit = <4 x i32> zeroinitializer
  %sub = sub nsw <4 x i32> zeroinitializer, %a
  %0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = <4 x i32> %a
  %2 = <4 x i32> zeroinitializer
  %cond = or <4 x i32> %1, %2
  ret <4 x i32> %a
}

whereas we currently generate:
  psignd %xmm1, %xmm0
  retq

which return 0, as %xmm1 is 0.

Quuxplusone commented 8 years ago

Attached file_26110.txt (3833 bytes, text/plain): Reduced IR

Quuxplusone commented 8 years ago

I submitted a tentative patch in http://reviews.llvm.org/D17181.

Quuxplusone commented 8 years ago

Ahmed - thanks for digging further. Sorry about the premature bug closing!

I feel like we're missing something simple here. I'm going to try some experiments and report back in D17181.

Quuxplusone commented 8 years ago

Committed:
r261025 [X86] Remove the now-unused X86ISD::PSIGN. NFC.
r261024 [X86] Generalize logic blend of (x, -x) combine to match (-x, x).
r261023 [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.
r261022 [X86] Extract PSIGN/BLENDVP tests into vector-blend.ll. NFC.
r261021 [X86] Extract PSIGN/BLENDVP combine. NFC.
r261020 [X86] Extract ANDNP combine. NFC.

Quuxplusone / LLVMBugzillaTest

clang c compiler produces wrong result for the attached c code with -O2 optimzation #26109