Open Quuxplusone opened 8 years ago
Bugzilla Link | PR26445 |
Status | NEW |
Importance | P normal |
Reported by | Carrot (carrot@google.com) |
Reported on | 2016-02-02 16:53:11 -0800 |
Last modified on | 2016-04-08 09:07:55 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | echristo@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
This should be a mid end problem, x86 back end generates similar inefficient
code.
.LBB0_2: # %for.body
# =>This Inner Loop Header: Depth=1
movss (%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero
ucomiss %xmm0, %xmm1
movq %rdi, %rcx
ja .LBB0_4
# BB#3: # %select.false
# in Loop: Header=BB0_2 Depth=1
movq %rax, %rcx
.LBB0_4: # %select.end
# in Loop: Header=BB0_2 Depth=1
movl (%rcx), %ecx
movl %ecx, -4(%rsp)
movd %ecx, %xmm0
addq $4, %rdi
decl %esi
jne .LBB0_2
The implementation of std::max is
template<typename _Tp>
inline const _Tp&
max(const _Tp& __a, const _Tp& __b)
{
if (__a < __b)
return __b;
return __a;
}
In LLVM IR it is
define linkonce_odr nonnull dereferenceable(4) float*
@_ZSt3maxIfERKT_S2_S2_(float* readonly dereferenceable(4) %__a, float* readonly
dereferenceable(4) %__b) #2 comdat {
entry:
%0 = load float, float* %__a, align 4, !tbaa !1
%1 = load float, float* %__b, align 4, !tbaa !1
%cmp = fcmp olt float %0, %1
%__b.__a = select i1 %cmp, float* %__b, float* %__a
ret float* %__b.__a
}
And the loop body of the test case is:
for.body: ; preds = %for.cond
%idxprom = sext i32 %j.0 to i64
%arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
%call = call dereferenceable(4) float* @_ZSt3maxIfERKT_S2_S2_(float* nonnull dereferenceable(4) %max_value, float* dereferenceable(4) %arrayidx1)
%5 = bitcast float* %call to i32*
%6 = load i32, i32* %5, align 4, !tbaa !1
%7 = bitcast float* %max_value to i32*
store i32 %6, i32* %7, align 4, !tbaa !1
%inc = add nsw i32 %j.0, 1
br label %for.cond
After inlining, the loop body is changed to:
for.body: ; preds = %for.cond
%idxprom = sext i32 %j.0 to i64
%arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
%5 = load float, float* %max_value, align 4, !tbaa !1
%6 = load float, float* %arrayidx1, align 4, !tbaa !1
%cmp.i = fcmp olt float %5, %6
%__b.__a.i = select i1 %cmp.i, float* %arrayidx1, float* %max_value
%7 = bitcast float* %__b.__a.i to i32*
%8 = load i32, i32* %7, align 4, !tbaa !1
%9 = bitcast float* %max_value to i32*
store i32 %8, i32* %9, align 4, !tbaa !1
%inc = add nsw i32 %j.0, 1
br label %for.cond
And later LLVM failed to find out that value %8 is actually one of %5 or %6.
Any suggestion on how fix it?
Looking at the code after inlining I see two separate problems. These two
problems, with some small changes, survive all the way to the binary code.
1- Unnecessary bitcasts. This causes load to GPR and move from GPR to VSX
registers.
2- Complex data flow due to use of mem address in select. This causes
additional load and store instructions.
It seems easier to start from (1). The bitcasts that we see in the code after
inlining are generated earlier than that (apparently after Combine redundant
instructions). The code does not really need those bitcasts and "ideally"
should look like this (after inlining)
for.body: ; preds = %for.cond
%idxprom = sext i32 %j.0 to i64
%arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
%5 = load float, float* %max_value, align 4, !tbaa !1
%6 = load float, float* %arrayidx1, align 4, !tbaa !1
%cmp.i = fcmp olt float %5, %6
%__b.__a.i = select i1 %cmp.i, float* %arrayidx1, float* %max_value
%7 = load float, float* %__b.__a.i, align 4, !tbaa !1
store float %7, float* %max_value, align 4, !tbaa !1
%inc = add nsw i32 %j.0, 1
br label %for.cond
So this may seem to suggest that we need to look at "Combine redundant
instructions" and fix the issue. But if I manually change the IR and remove the
bitcasts, and pass it through opt again, this time SROA generates some other
bitcasts. So fixing the issue in "Combine redundant instructions" is probably
not enough. At this point I am not sure which approach is the best to address
this issue:
1- Fixing each transformation to avoid generating bitcasts.
2- Having a cleanup pass to remove bitcasts that are not needed.
3- Leaving the bitcasts and handle them properly during code gen.
4- Some other approach?
SROA contains code for "select speculation" which is exactly what we need to fix problem (2) that I mentioned in the beginning of comment 3. We probably need to focus on cleaning up unnecessary bitcasts before and after SROA.
InstCombine, introduces bitcasts that are not needed and prevent SROA from
performing select speculation. If I disable instcombine the final code for the
loop body looks like this:
1c: 10 00 00 48 b 2c <_Z3fooPfi+0x2c>
20: 90 08 00 fc fmr f0,f1
24: 20 00 40 4e bdzlr
28: 04 00 00 48 b 2c <_Z3fooPfi+0x2c>
2c: 04 00 23 c4 lfsu f1,4(r3)
30: 00 08 00 fc fcmpu cr0,f0,f1
34: ec ff 80 41 blt 20 <_Z3fooPfi+0x20>
38: 90 00 20 fc fmr f1,f0
3c: e4 ff ff 4b b 20 <_Z3fooPfi+0x20>
There is still room for improvement, but we have removed all extra load, store
and direct moves.
I will focus on removing bitcasts introduced by InstCombine.
Bitcasts introduced by instCombine are due to canonicalization of load
http://lists.llvm.org/pipermail/llvm-dev/2015-January/080956.html
git commit id: b778cbc0c8f1
Hal suggested that while reducing the scope of canonicalization (so this code
pattern is not canonicalized) may help PPC, it probably hurts most other
platforms and so that is not recommended.
I also checked the fix in http://reviews.llvm.org/D14596#45a2ee0d for PR25342.
That is a different issue and the fix does change the code pattern that we get
in this case.
The right approach for this is still under discussion.
There was an important typo in the previous comment
Incorrect:
the fix does change the code pattern that we get in this case.
Correct:
the fix does **NOT** change the code pattern that we get in this case.
The approach to fix this is discussed in the mailing list http://lists.llvm.org/pipermail/llvm-dev/2016-March/096901.html. I will work on its implementation after my vacation.