Open Quuxplusone opened 8 years ago
Bugzilla Link | PR30483 |
Status | NEW |
Importance | P normal |
Reported by | Carrot (carrot@google.com) |
Reported on | 2016-09-21 18:50:24 -0700 |
Last modified on | 2017-03-24 13:39:43 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | ehsanamiri@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR32401 |
I have reduced the function foo even further, to the shortest code, in which I
can find inefficiency. Once I fix the first issue, I will add more code to it,
until I find the next issue to fix.
The smallest problematic code is this:
bool foo(const PB2& s_a, const PB2& s_b) {
return s_a.has_pb1() != s_b.has_pb1();
}
Here the problem is not related to cr instruction. Instead we are generating
the following two instructions:
c: fe e8 63 54 rlwinm r3,r3,29,3,31
10: e0 07 63 78 clrldi r3,r3,63
While gcc generates only one
c: e2 ef 63 78 rldicl r3,r3,61,63
https://reviews.llvm.org/D24924
for issue in comment 1
This sequence has a more serious problem:
bool foo(const PB2& s_a, const PB2& s_b) {
return s_a.has_pb1() < s_b.has_pb1();
}
LLVM code:
0: 00 00 84 88 lbz r4,0(r4)
4: 00 00 63 88 lbz r3,0(r3)
8: fe e8 84 54 rlwinm r4,r4,29,3,31
c: 38 07 63 54 rlwinm r3,r3,0,28,28
10: 00 00 83 2c cmpwi cr1,r3,0
14: 01 00 84 70 andi. r4,r4,1
18: 01 00 60 38 li r3,1
1c: c2 09 86 4e crnand 4*cr5+lt,4*cr1+eq,gt
20: 1e 1d 60 7c isel r3,0,r3,20
24: 20 00 80 4e blr
GCC code:
0: 00 00 23 81 lwz r9,0(r3)
4: 00 00 64 80 lwz r3,0(r4)
8: e2 ef 29 79 rldicl r9,r9,61,63
c: e2 ef 63 78 rldicl r3,r3,61,63
10: 50 48 63 7c subf r3,r3,r9
14: e0 0f 63 78 rldicl r3,r3,1,63
18: 20 00 80 4e blr
>
> Here the problem is not related to cr instruction. Instead we are generating
> the following two instructions:
>
> c: fe e8 63 54 rlwinm r3,r3,29,3,31
> 10: e0 07 63 78 clrldi r3,r3,63
>
> While gcc generates only one
>
> c: e2 ef 63 78 rldicl r3,r3,61,63
Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to
simplify it.
(In reply to comment #4)
> >
> > Here the problem is not related to cr instruction. Instead we are generating
> > the following two instructions:
> >
> > c: fe e8 63 54 rlwinm r3,r3,29,3,31
> > 10: e0 07 63 78 clrldi r3,r3,63
> >
> > While gcc generates only one
> >
> > c: e2 ef 63 78 rldicl r3,r3,61,63
>
> Thanks you, Ehsan, I also noticed this pattern in another case, but I failed
> to simplify it.
You're welcome. This is a really good code testcase as it reveals multiple
problems.
For the problem in comment 3, I am thinking of the following solution. I post
it here to get some feedback
The IR currently generated for the testcase is the following:
define zeroext i1 @_Z3fooRK3PB2S1_(%class.PB2* nocapture readonly
dereferenceable(16) %s_a, %class.PB2* nocapture readonly dereferenceable(16)
%s_b) local_unnamed_addr #0 {
entry:
%arrayidx.i6 = bitcast %class.PB2* %s_a to i32*
%0 = load i32, i32* %arrayidx.i6, align 8, !tbaa !1
%and.i = and i32 %0, 8
%arrayidx.i37 = bitcast %class.PB2* %s_b to i32*
%1 = load i32, i32* %arrayidx.i37, align 8, !tbaa !1
%and.i4 = and i32 %1, 8
%cmp.i5 = icmp ne i32 %and.i4, 0
%cmptmp = icmp eq i32 %and.i, 0
%cmp = and i1 %cmptmp, %cmp.i5
ret i1 %cmp
}
I think it is reasonable to expect InstCombine to recognize that it can remove
all icmp instruction and replace them with the following:
%cmp = icmp ult i32 %and.i, %and.i4
This is not enough for us though. During lowering for this code we generate a
setcc for which we generate an isel. The next step is to look at the following
during lowering:
1) The user of setcc is not a branch. This value is returned to the caller. So
we need the result of comparison in GPR.
2) This is a setcc with a ult condition code, applied to two i32 values. So we
can instead subtract the values, and look at the sign bit (so subtract and
shift). This is what GCC does when they encounter a < operation during
translation of gimple to RTL. (I talked to the gcc developer who implemented
this).
Does this make sense?
Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.
(In reply to comment #7)
> Talked to David Majnemer on IRC about the InstCombine part. He agreed to the
> change. If the setcc part is also looks fine I will post patches to fix this
> pattern.
I was just looking at FoldAndOfICmps(); surprised we don't have this fold
already.
Is there a separate bug report for this? If not, should I file one?
(In reply to comment #8)
> (In reply to comment #7)
> > Talked to David Majnemer on IRC about the InstCombine part. He agreed to the
> > change. If the setcc part is also looks fine I will post patches to fix this
> > pattern.
>
> I was just looking at FoldAndOfICmps(); surprised we don't have this fold
> already.
> Is there a separate bug report for this? If not, should I file one?
Thanks Sanjay for looking. If you plan to make the change, I can open a
separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree
with the overall solution. If they don't we can open a bugzilla item for
instcombine, so it is not forgotten.
Hal also agreed to this. I will start working on patches for this.
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > Talked to David Majnemer on IRC about the InstCombine part. He agreed to
the
> > > change. If the setcc part is also looks fine I will post patches to fix
this
> > > pattern.
> >
> > I was just looking at FoldAndOfICmps(); surprised we don't have this fold
> > already.
> > Is there a separate bug report for this? If not, should I file one?
>
> Thanks Sanjay for looking. If you plan to make the change, I can open a
> separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree
> with the overall solution. If they don't we can open a bugzilla item for
> instcombine, so it is not forgotten.
I think we're going to need multiple fixes to get this right in instcombine, so
I appreciate any bug reports that you'd like to open.
Here are a few possibilities for the patterns that we are missing:
bool andcmps1(int x, int y) {
return ((x & 16) != 0) && ((y & 8) == 0); // logical-and
}
bool andcmps2(int x, int y) {
return ((x & 16) != 0) & ((y & 8) == 0); // bitwise-and
}
bool andcmps3(int x, int y) {
return ((x & 32768) != 0) && ((y & 8) == 0); // special case constant
}
$ ./clang -O1 bittests.c -S -o - -emit-llvm
; ModuleID = 'bittests.c'
source_filename = "bittests.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"
define zeroext i1 @andcmps1(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
%and = and i32 %x, 16
%and1 = and i32 %y, 8
%cmp2 = icmp eq i32 %and1, 0
%not.cmp = icmp ne i32 %and, 0
%0 = and i1 %not.cmp, %cmp2
ret i1 %0
}
define zeroext i1 @andcmps2(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
%and = lshr i32 %x, 4
%and1 = lshr i32 %y, 3
%and1.lobit = and i32 %and1, 1
%0 = xor i32 %and1.lobit, 1
%and4 = and i32 %0, %and
%tobool = icmp ne i32 %and4, 0
ret i1 %tobool
}
define zeroext i1 @andcmps3(i32 %x, i32 %y) local_unnamed_addr #0 {
entry:
%0 = trunc i32 %x to i16
%cmp = icmp slt i16 %0, 0
%and1 = and i32 %y, 8
%cmp2 = icmp eq i32 %and1, 0
%1 = and i1 %cmp, %cmp2
ret i1 %1
}
OK. I'll work on this to address our problem. For anything left, I will open a new bug report.
https://reviews.llvm.org/D25200
patch for InstCombine changes. I will shortly post another patch for changes in PPC dag combine.
The early patch for dag combine changes:
https://reviews.llvm.org/D25221
This patch can be extended, but I post it to get some feedback before extending
it.
Carrot,
the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?
(In reply to comment #15)
> Carrot,
>
> the fact that we have a function that returns a bool value is a property of
> the reduced testcase, or is that true in the original code as well?
Yes, the original function returns bool value.
Currently looking at some approaches that would reduce CR-logical instruction use that will likely improve this test case.