Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[ppc] too many cr manipulation instructions are used for comparisons #29456

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR30483
Status NEW
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2016-09-21 18:50:24 -0700
Last modified on 2017-03-24 13:39:43 -0700
Version trunk
Hardware PC Linux
CC ehsanamiri@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR32401
The source code is simplified from a protocol buffer test case,

class PB1{
public:
  bool has_f1() const;
  int f1() const;

  bool has_f2() const;
  unsigned long long f2() const;

  bool has_f3() const;
  unsigned long long f3() const;

  unsigned int _has_bits_[1];
  unsigned long long f2_;
  unsigned long long f3_;
  int f1_;
};

class PB2{
 public:
  bool has_pb1() const;
  const PB1& pb1() const;

  unsigned int _has_bits_[1];
  PB1* pb1_;
};

inline bool PB1::has_f1() const {
  return (_has_bits_[0] & 0x00000004u) != 0;
}
inline int PB1::f1() const {
  return f1_;
}
inline bool PB1::has_f2() const {
  return (_has_bits_[0] & 0x00000008u) != 0;
}
inline unsigned long long PB1::f2() const {
  return f2_;
}
inline bool PB1::has_f3() const {
  return (_has_bits_[0] & 0x00000010u) != 0;
}
inline unsigned long long PB1::f3() const {
  return f3_;
}
inline bool PB2::has_pb1() const {
  return (_has_bits_[0] & 0x00000008u) != 0;
}
inline const PB1& PB2::pb1() const {
  return *pb1_;
}

bool foo(const PB2& s_a, const PB2& s_b) {
  if (s_a.has_pb1() != s_b.has_pb1()) {
    return s_a.has_pb1() < s_b.has_pb1();
  }
  if (s_a.has_pb1()) {
    const PB1& v_a = s_a.pb1();
    const PB1& v_b = s_b.pb1();
    if (v_a.has_f2() != v_b.has_f2()) { return v_a.has_f2() < v_b.has_f2(); }
    if (v_a.has_f2() && (v_a.f2() != v_b.f2())) { return v_a.f2() < v_b.f2(); }
    if (v_a.has_f3() != v_b.has_f3()) { return v_a.has_f3() < v_b.has_f3(); }
    if (v_a.has_f3() && (v_a.f3() != v_b.f3())) { return v_a.f3() < v_b.f3(); }
    if (v_a.has_f1() != v_b.has_f1()) { return v_a.has_f1() < v_b.has_f1(); }
    if (v_a.has_f1() && (v_a.f1() != v_b.f1())) { return v_a.f1() < v_b.f1(); }
  }
  return false;
}

The function foo contains many comparisons, when compiled with options
-m64 -O2 -mvsx -mcpu=power8

LLVM generates code with many slow cr manipulation instructions, following is
first part of them:

_Z3fooRK3PB2S1_:                        # @_Z3fooRK3PB2S1_
.Lfunc_begin0:
# BB#0:                                 # %entry
    lbz 5, 0(3)
    lbz 6, 0(4)
    srwi 7, 5, 3
    xor 5, 5, 6
    srwi 6, 6, 3
    andi. 7, 7, 1
    srwi 5, 5, 3
    crmove   20, 1
    andi. 6, 6, 1
    crmove   21, 1
    andi. 5, 5, 1
    bc 4, 1, .LBB0_2
# BB#1:                                 # %if.then
    crandc 22, 21, 20
    b .LBB0_14
.LBB0_2:                                # %if.end
    crxor 22, 22, 22
    bc 4, 20, .LBB0_14
# BB#3:                                 # %if.then9
    ld 3, 8(3)
    ld 4, 8(4)
    lwz 6, 0(3)
    lwz 5, 0(4)
    rlwinm 8, 6, 29, 31, 31
    rlwinm 7, 6, 0, 28, 28
    andi. 8, 8, 1
    xor 7, 7, 5
    srwi 8, 5, 3
    srwi 7, 7, 3
    crmove   20, 1
    andi. 8, 8, 1
    crmove   21, 1
    andi. 7, 7, 1
    bc 4, 1, .LBB0_5
# BB#4:                                 # %if.then17
    crandc 22, 21, 20
    b .LBB0_14
        ...

It causes 40% slower than gcc.
It should be handled by PPCBoolRetToInt.cpp.
Quuxplusone commented 8 years ago
I have reduced the function foo even further, to the shortest code, in which I
can find inefficiency. Once I fix the first issue, I will add more code to it,
until I find the next issue to fix.

The smallest problematic code is this:

bool foo(const PB2& s_a, const PB2& s_b) {

  return s_a.has_pb1() != s_b.has_pb1();

}

Here the problem is not related to cr instruction. Instead we are generating
the following two instructions:

   c:   fe e8 63 54     rlwinm  r3,r3,29,3,31
  10:   e0 07 63 78     clrldi  r3,r3,63

While gcc generates only one

   c:   e2 ef 63 78     rldicl  r3,r3,61,63
Quuxplusone commented 8 years ago

https://reviews.llvm.org/D24924

for issue in comment 1

Quuxplusone commented 8 years ago
This sequence has a more serious problem:

bool foo(const PB2& s_a, const PB2& s_b) {
  return s_a.has_pb1() < s_b.has_pb1();
}

LLVM code:

   0:   00 00 84 88     lbz     r4,0(r4)
   4:   00 00 63 88     lbz     r3,0(r3)
   8:   fe e8 84 54     rlwinm  r4,r4,29,3,31
   c:   38 07 63 54     rlwinm  r3,r3,0,28,28
  10:   00 00 83 2c     cmpwi   cr1,r3,0
  14:   01 00 84 70     andi.   r4,r4,1
  18:   01 00 60 38     li      r3,1
  1c:   c2 09 86 4e     crnand  4*cr5+lt,4*cr1+eq,gt
  20:   1e 1d 60 7c     isel    r3,0,r3,20
  24:   20 00 80 4e     blr

GCC code:

   0:   00 00 23 81     lwz     r9,0(r3)
   4:   00 00 64 80     lwz     r3,0(r4)
   8:   e2 ef 29 79     rldicl  r9,r9,61,63
   c:   e2 ef 63 78     rldicl  r3,r3,61,63
  10:   50 48 63 7c     subf    r3,r3,r9
  14:   e0 0f 63 78     rldicl  r3,r3,1,63
  18:   20 00 80 4e     blr
Quuxplusone commented 8 years ago
>
> Here the problem is not related to cr instruction. Instead we are generating
> the following two instructions:
>
>    c:   fe e8 63 54     rlwinm  r3,r3,29,3,31
>   10:   e0 07 63 78     clrldi  r3,r3,63
>
> While gcc generates only one
>
>    c:   e2 ef 63 78     rldicl  r3,r3,61,63

Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to
simplify it.
Quuxplusone commented 8 years ago
(In reply to comment #4)
> >
> > Here the problem is not related to cr instruction. Instead we are generating
> > the following two instructions:
> >
> >    c:   fe e8 63 54     rlwinm  r3,r3,29,3,31
> >   10:   e0 07 63 78     clrldi  r3,r3,63
> >
> > While gcc generates only one
> >
> >    c:   e2 ef 63 78     rldicl  r3,r3,61,63
>
> Thanks you, Ehsan, I also noticed this pattern in another case, but I failed
> to simplify it.

You're welcome. This is a really good code testcase as it reveals multiple
problems.
Quuxplusone commented 8 years ago
For the problem in comment 3, I am thinking of the following solution. I post
it here to get some feedback

The IR currently generated for the testcase is the following:

define zeroext i1 @_Z3fooRK3PB2S1_(%class.PB2* nocapture readonly
dereferenceable(16) %s_a, %class.PB2* nocapture readonly dereferenceable(16)
%s_b) local_unnamed_addr #0 {
entry:
  %arrayidx.i6 = bitcast %class.PB2* %s_a to i32*
  %0 = load i32, i32* %arrayidx.i6, align 8, !tbaa !1
  %and.i = and i32 %0, 8
  %arrayidx.i37 = bitcast %class.PB2* %s_b to i32*
  %1 = load i32, i32* %arrayidx.i37, align 8, !tbaa !1
  %and.i4 = and i32 %1, 8
  %cmp.i5 = icmp ne i32 %and.i4, 0
  %cmptmp = icmp eq i32 %and.i, 0
  %cmp = and i1 %cmptmp, %cmp.i5
  ret i1 %cmp
}

I think it is reasonable to expect InstCombine to recognize that it can remove
all icmp instruction and replace them with the following:

  %cmp = icmp ult i32 %and.i, %and.i4

This is not enough for us though. During lowering for this code we generate a
setcc for which we generate an isel. The next step is to look at the following
during lowering:

1) The user of setcc is not a branch. This value is returned to the caller. So
we need the result of comparison in GPR.
2) This is a setcc with a ult condition code, applied to two i32 values. So we
can instead subtract the values, and look at the sign bit (so subtract and
shift). This is what GCC does when they encounter a < operation during
translation of gimple to RTL. (I talked to the gcc developer who implemented
this).

Does this make sense?
Quuxplusone commented 8 years ago

Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.

Quuxplusone commented 8 years ago
(In reply to comment #7)
> Talked to David Majnemer on IRC about the InstCombine part. He agreed to the
> change. If the setcc part is also looks fine I will post patches to fix this
> pattern.

I was just looking at FoldAndOfICmps(); surprised we don't have this fold
already.
Is there a separate bug report for this? If not, should I file one?
Quuxplusone commented 8 years ago
(In reply to comment #8)
> (In reply to comment #7)
> > Talked to David Majnemer on IRC about the InstCombine part. He agreed to the
> > change. If the setcc part is also looks fine I will post patches to fix this
> > pattern.
>
> I was just looking at FoldAndOfICmps(); surprised we don't have this fold
> already.
> Is there a separate bug report for this? If not, should I file one?

Thanks Sanjay for looking. If you plan to make the change, I can open a
separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree
with the overall solution. If they don't we can open a bugzilla item for
instcombine, so it is not forgotten.
Quuxplusone commented 8 years ago

Hal also agreed to this. I will start working on patches for this.

Quuxplusone commented 8 years ago
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > Talked to David Majnemer on IRC about the InstCombine part. He agreed to
the
> > > change. If the setcc part is also looks fine I will post patches to fix
this
> > > pattern.
> >
> > I was just looking at FoldAndOfICmps(); surprised we don't have this fold
> > already.
> > Is there a separate bug report for this? If not, should I file one?
>
> Thanks Sanjay for looking. If you plan to make the change, I can open a
> separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree
> with the overall solution. If they don't we can open a bugzilla item for
> instcombine, so it is not forgotten.

I think we're going to need multiple fixes to get this right in instcombine, so
I appreciate any bug reports that you'd like to open.

Here are a few possibilities for the patterns that we are missing:

bool andcmps1(int x, int y) {
  return ((x & 16) != 0) && ((y & 8) == 0); // logical-and
}

bool andcmps2(int x, int y) {
  return ((x & 16) != 0) & ((y & 8) == 0); // bitwise-and
}

bool andcmps3(int x, int y) {
  return ((x & 32768) != 0) && ((y & 8) == 0); // special case constant
}

$ ./clang -O1 bittests.c -S -o - -emit-llvm
; ModuleID = 'bittests.c'
source_filename = "bittests.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"

define zeroext i1 @andcmps1(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
  %and = and i32 %x, 16
  %and1 = and i32 %y, 8
  %cmp2 = icmp eq i32 %and1, 0
  %not.cmp = icmp ne i32 %and, 0
  %0 = and i1 %not.cmp, %cmp2
  ret i1 %0
}

define zeroext i1 @andcmps2(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
  %and = lshr i32 %x, 4
  %and1 = lshr i32 %y, 3
  %and1.lobit = and i32 %and1, 1
  %0 = xor i32 %and1.lobit, 1
  %and4 = and i32 %0, %and
  %tobool = icmp ne i32 %and4, 0
  ret i1 %tobool
}

define zeroext i1 @andcmps3(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
  %0 = trunc i32 %x to i16
  %cmp = icmp slt i16 %0, 0
  %and1 = and i32 %y, 8
  %cmp2 = icmp eq i32 %and1, 0
  %1 = and i1 %cmp, %cmp2
  ret i1 %1
}
Quuxplusone commented 8 years ago

OK. I'll work on this to address our problem. For anything left, I will open a new bug report.

Quuxplusone commented 8 years ago

https://reviews.llvm.org/D25200

patch for InstCombine changes. I will shortly post another patch for changes in PPC dag combine.

Quuxplusone commented 8 years ago
The early patch for dag combine changes:
https://reviews.llvm.org/D25221

This patch can be extended, but I post it to get some feedback before extending
it.
Quuxplusone commented 8 years ago

Carrot,

the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?

Quuxplusone commented 8 years ago
(In reply to comment #15)
> Carrot,
>
> the fact that we have a function that returns a bool value is a property of
> the reduced testcase, or is that true in the original code as well?

Yes, the original function returns bool value.
Quuxplusone commented 7 years ago

Currently looking at some approaches that would reduce CR-logical instruction use that will likely improve this test case.