Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

improve hot path code through inter procedural alias analysis in cold path #28871

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR28876
Status NEW
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2016-08-05 17:38:37 -0700
Last modified on 2016-08-06 22:13:31 -0700
Version trunk
Hardware PC Linux
CC ehsanamiri@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com, riddleriver@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
The source code is

#include <vector>

void foo(int iters, int size) {
  std::vector<int> fv;
  for (int i = 0; i < iters; ++i) {
    fv.clear();
    for (int j = 0; j < size; ++j) {
      fv.push_back(j);
    }
  }
}

When compiled into powerpc instructions with options
  -m64 -O2 -mvsx -mcpu=power8 -fno-exceptions

LLVM generates following code for the loops:

.LBB0_2:                                # %for.cond.cleanup3.for.body_crit_edge
                                        #   in Loop: Header=BB0_3 Depth=1
        ld 4, 112(31)
.LBB0_3:                                # %for.body
                                        # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_6 Depth 2
        std 4, 120(31)
        stw 26, 108(31)
        bc 4, 9, .LBB0_10
# BB#4:                                 # %for.body4.preheader
                                        #   in Loop: Header=BB0_3 Depth=1
        li 3, 0
        b .LBB0_6
        .p2align        4
.LBB0_5:                                #
%_ZNSt6vectorIiSaIiEE9push_backERKi.exit.for.body4_crit_edge
                                        #   in Loop: Header=BB0_6 Depth=2
        ld 4, 120(31)
.LBB0_6:                                # %for.body4
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        ld 5, 128(31)
        cmpld    4, 5
        beq      0, .LBB0_8
# BB#7:                                 # %if.then.i
                                        #   in Loop: Header=BB0_6 Depth=2
        addi 5, 4, 4                    // update pointer of available space
        stw 3, 0(4)                     // push_back j
        std 5, 120(31)
        b .LBB0_9
        .p2align        4
.LBB0_8:                                # %if.else.i
                                        #   in Loop: Header=BB0_6 Depth=2
        mr 3, 28
        mr 5, 27
        bl _ZNSt6vectorIiSaIiEE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPiS1_EERKi
        nop
.LBB0_9:                                #
%_ZNSt6vectorIiSaIiEE9push_backERKi.exit
                                        #   in Loop: Header=BB0_6 Depth=2
        lwz 3, 108(31)
        addi 3, 3, 1                    // j++
        stw 3, 108(31)
        cmpw     3, 30
        blt      0, .LBB0_5
.LBB0_10:                               # %for.cond.cleanup3
                                        #   in Loop: Header=BB0_3 Depth=1
        addi 25, 25, 1
        cmplw    25, 29
        bne      0, .LBB0_2

GCC generates following code for the hot path of the loops, there are more code
in the cold path not shown here:

.L2:
        ble 3,.L4
        mr 31,28
        li 23,0
        .p2align 4,,15
.L14:
        cmpld 7,31,10
        beq 7,.L5
        cmpdi 7,31,0
        beq 7,.L6
        stw 23,0(31)           // push_back j
.L6:
        addi 31,31,4           // update the pointer
.L7:
        addi 9,23,1            // j++
        cmpw 7,9,29
        extsw 23,9
        bne 7,.L14
.L4:
        addi 8,24,1
        cmpw 7,8,27
        extsw 24,8
        bne 7,.L2

Both compilers inlined function push_back, but gcc inlined all sub functions
called by push_back, so it can do more complete alias analysis, and figure out
variable j does not alias with fields of fv, so all variables are allocated to
registers.

On the other hand, llvm didn't inline the function _M_insert_aux, it is
reasonable since it is in cold path only. But it needs pointers of both j and
fv, so it may cause j and fv to be alias with each other. Then it impacts the
accesses of j and fv in the hot path, and generates many memory access
instructions in the hot path. It slows down this function by 5.9x on power8.

One possible improvement is through inter procedural alias analysis, we can
figure out that j is not aliased with fields of fv in function _M_insert_aux
even without inlining it, so we can allocate them in registers in the hot path,
only store and load them around the function call to satisfy the parameter
requirement of _M_insert_aux.
Quuxplusone commented 8 years ago

LLVM has the ability to tag with callsite parameters with NoAlias(The attribute for not aliasing any other address) but I don't know of any current analysis passes that tag anything other then struct return parameters with it.