[ppc] inefficient code generated for std::max(float, float)

Quuxplusone commented 8 years ago


Bugzilla Link	PR26445
Status	NEW
Importance	P normal
Reported by	Carrot (carrot@google.com)
Reported on	2016-02-02 16:53:11 -0800
Last modified on	2016-04-08 09:07:55 -0700
Version	trunk
Hardware	PC Linux
CC	echristo@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Compile following code with options: $ ~/llvm/obj2/bin/clang++  --
target=powerpc64le-grtev4-linux-gnu -m64 -mvsx -mcpu=power8 -O2 -c -o t9.o
t9.cc -fno-unroll-loops

#include <algorithm>

float foo(float* input, int s) {
  float max_value = input[0];
  for (int j = 1; j <= s; ++j)
      max_value = std::max(max_value, input[j]);

  return max_value;
}

I got following code for the loop body

.LBB0_2:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
        lfsu 0, 4(3)
        fcmpu 0, 1, 0
        isel 5, 3, 4, 0
        lwz 5, 0(5)
        mtvsrd 34, 5
        stw 5, -12(1)
        xxsldwi 13, 34, 34, 1
        xscvspdpn 1, 13
        bdnz .LBB0_2

There are several problems in this code snippet

1. instead of compare and choose maximum float value, the generated code uses
isel to choose the address of larger value, then load it into integer register,
then move it to fp register, then expand it to double type.

2. no need to store the max value to memory.

It causes one of our internal applications more than 2x slower.

Quuxplusone commented 8 years ago

This should be a mid end problem, x86 back end generates similar inefficient
code.

.LBB0_2:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
        movss   (%rdi), %xmm1           # xmm1 = mem[0],zero,zero,zero
        ucomiss %xmm0, %xmm1
        movq    %rdi, %rcx
        ja      .LBB0_4
# BB#3:                                 # %select.false
                                        #   in Loop: Header=BB0_2 Depth=1
        movq    %rax, %rcx
.LBB0_4:                                # %select.end
                                        #   in Loop: Header=BB0_2 Depth=1
        movl    (%rcx), %ecx
        movl    %ecx, -4(%rsp)
        movd    %ecx, %xmm0
        addq    $4, %rdi
        decl    %esi
        jne     .LBB0_2

Quuxplusone commented 8 years ago

The implementation of std::max is

  template<typename _Tp>
    inline const _Tp&
    max(const _Tp& __a, const _Tp& __b)
    {
      if (__a < __b)
        return __b;
      return __a;
    }

In LLVM IR it is

define linkonce_odr nonnull dereferenceable(4) float*
@_ZSt3maxIfERKT_S2_S2_(float* readonly dereferenceable(4) %__a, float* readonly
dereferenceable(4) %__b) #2 comdat {
entry:
  %0 = load float, float* %__a, align 4, !tbaa !1
  %1 = load float, float* %__b, align 4, !tbaa !1
  %cmp = fcmp olt float %0, %1
  %__b.__a = select i1 %cmp, float* %__b, float* %__a
  ret float* %__b.__a
}

And the loop body of the test case is:

for.body:                                         ; preds = %for.cond
  %idxprom = sext i32 %j.0 to i64
  %arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
  %call = call dereferenceable(4) float* @_ZSt3maxIfERKT_S2_S2_(float* nonnull dereferenceable(4) %max_value, float* dereferenceable(4) %arrayidx1)
  %5 = bitcast float* %call to i32*
  %6 = load i32, i32* %5, align 4, !tbaa !1
  %7 = bitcast float* %max_value to i32*
  store i32 %6, i32* %7, align 4, !tbaa !1
  %inc = add nsw i32 %j.0, 1
  br label %for.cond

After inlining, the loop body is changed to:

for.body:                                         ; preds = %for.cond
  %idxprom = sext i32 %j.0 to i64
  %arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
  %5 = load float, float* %max_value, align 4, !tbaa !1
  %6 = load float, float* %arrayidx1, align 4, !tbaa !1
  %cmp.i = fcmp olt float %5, %6
  %__b.__a.i = select i1 %cmp.i, float* %arrayidx1, float* %max_value
  %7 = bitcast float* %__b.__a.i to i32*
  %8 = load i32, i32* %7, align 4, !tbaa !1
  %9 = bitcast float* %max_value to i32*
  store i32 %8, i32* %9, align 4, !tbaa !1
  %inc = add nsw i32 %j.0, 1
  br label %for.cond

And later LLVM failed to find out that value %8 is actually one of %5 or %6.

Any suggestion on how fix it?

Quuxplusone commented 8 years ago

Looking at the  code after inlining I see two separate problems. These two
problems, with some small changes, survive all the way to the binary code.

1- Unnecessary bitcasts. This causes load to GPR and move from GPR to VSX
registers.
2- Complex data flow due to use of mem address in select. This causes
additional load and store instructions.

It seems easier to start from (1).  The bitcasts that we see in the code after
inlining are generated earlier than that (apparently after Combine redundant
instructions).  The code does not really need those bitcasts and "ideally"
should look like this (after inlining)

for.body:                                         ; preds = %for.cond
  %idxprom = sext i32 %j.0 to i64
  %arrayidx1 = getelementptr inbounds float, float* %input, i64 %idxprom
  %5 = load float, float* %max_value, align 4, !tbaa !1
  %6 = load float, float* %arrayidx1, align 4, !tbaa !1
  %cmp.i = fcmp olt float %5, %6
  %__b.__a.i = select i1 %cmp.i, float* %arrayidx1, float* %max_value
  %7 = load float, float* %__b.__a.i, align 4, !tbaa !1
  store float %7, float* %max_value, align 4, !tbaa !1
  %inc = add nsw i32 %j.0, 1
  br label %for.cond

So this may seem to suggest that we need to look at "Combine redundant
instructions" and fix the issue. But if I manually change the IR and remove the
bitcasts, and pass it through opt again, this time SROA generates some other
bitcasts. So fixing the issue in "Combine redundant instructions" is probably
not enough. At this point I am not sure which approach is the best to address
this issue:

1- Fixing each transformation to avoid generating bitcasts.
2- Having a cleanup pass to remove bitcasts that are not needed.
3- Leaving the bitcasts and handle them properly during code gen.
4- Some other approach?

Quuxplusone commented 8 years ago

SROA contains code for "select speculation" which is exactly what we need to fix problem (2) that I mentioned in the beginning of comment 3. We probably need to focus on cleaning up unnecessary bitcasts before and after SROA.

Quuxplusone commented 8 years ago

InstCombine, introduces bitcasts that are not needed and prevent SROA from
performing select speculation. If I disable instcombine the final code for the
loop body looks like this:

  1c:   10 00 00 48     b       2c <_Z3fooPfi+0x2c>
  20:   90 08 00 fc     fmr     f0,f1
  24:   20 00 40 4e     bdzlr
  28:   04 00 00 48     b       2c <_Z3fooPfi+0x2c>
  2c:   04 00 23 c4     lfsu    f1,4(r3)
  30:   00 08 00 fc     fcmpu   cr0,f0,f1
  34:   ec ff 80 41     blt     20 <_Z3fooPfi+0x20>
  38:   90 00 20 fc     fmr     f1,f0
  3c:   e4 ff ff 4b     b       20 <_Z3fooPfi+0x20>

There is still room for improvement, but we have removed all extra load, store
and direct moves.

I will focus on removing bitcasts introduced by InstCombine.

Quuxplusone commented 8 years ago

Bitcasts introduced by instCombine are due to canonicalization of load

http://lists.llvm.org/pipermail/llvm-dev/2015-January/080956.html
git commit id: b778cbc0c8f1

Hal suggested that while reducing the scope of canonicalization (so this code
pattern is not canonicalized) may help PPC, it probably hurts most other
platforms and so that is not recommended.

I also checked the fix in http://reviews.llvm.org/D14596#45a2ee0d for PR25342.
That is a different issue and the fix does change the code pattern that we get
in this case.

The right approach for this is still under discussion.

Quuxplusone commented 8 years ago

There was an important typo in the previous comment

Incorrect:
the fix does change the code pattern that we get in this case.

Correct:
the fix does **NOT** change the code pattern that we get in this case.

Quuxplusone commented 8 years ago

The approach to fix this is discussed in the mailing list http://lists.llvm.org/pipermail/llvm-dev/2016-March/096901.html. I will work on its implementation after my vacation.

Quuxplusone / LLVMBugzillaTest

[ppc] inefficient code generated for std::max(float, float) #26444