Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[ppc] redundant extsw and sldi instructions #25580

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR25581
Status REOPENED
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2015-11-19 18:09:59 -0800
Last modified on 2017-05-26 17:07:29 -0700
Version trunk
Hardware PC Linux
CC echristo@gmail.com, ehsanamiri@gmail.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
The following is a hot function from 471.omnetpp

void cMessageHeap::shiftup(int from)
{
    // restores heap structure (in a sub-heap)
    int i,j;
    cMessage *temp;

    i=from;
    while ((j=2*i) <= n)
    {
        if (j<n && (*h[j] > *h[j+1]))   //direction
            j++;
        if (*h[i] > *h[j])  //is change necessary?
        {
            temp=h[j];
            (h[j]=h[i])->heapindex=j;
            (h[i]=temp)->heapindex=i;
            i=j;
        }
        else
            break;
    }
}

When compiled with options -fno-strict-aliasing -O2 -m64 -mvsx -mcpu=power8
I got:

_ZN12cMessageHeap7shiftupEi:            # @_ZN12cMessageHeap7shiftupEi
.Lfunc_begin14:
# BB#0:                                 # %entry
        lwz 7, 64(3)
        slwi 8, 4, 1
        cmpw     8, 7
        bgtlr 0
# BB#1:                                 # %while.body.lr.ph
        crxor 20, 20, 20
        li 5, 144
        creqv 21, 21, 21
        .align  4
.LBB14_2:                               # %while.body
                                        # =>This Inner Loop Header: Depth=1
        mr 6, 4
        cmpw     8, 7
        bge 0, .LBB14_9
# BB#3:                                 # %land.lhs.true
                                        #   in Loop: Header=BB14_2 Depth=1
        ori 4, 8, 1                  // A
        extsw 9, 8                   // B
        ld 7, 56(3)
        cror 1, 20, 20
        extsw 10, 4                  // C
        sldi 9, 9, 3                 // D
        sldi 11, 10, 3               // E
        ldx 10, 7, 9
        ldx 9, 7, 11
        lxsdx 0, 10, 5
        lxsdx 1, 9, 5
        xscmpudp 1, 0, 1
        blt      1, .LBB14_8
        ...

In the source code note that j comes from (i*2), so the lowest bit is 0, and
then we have following conclusion:

1) j+1 equals to j|1, so instruction A actually computes j+1

2) extsw(j+1) equals to extsw(j) + 1

3) (j+1) << 3 equals to j<<3 + 8

So instructions ABCDE can be replaced with shorter code sequence

    extsw 9, 8
    sldi 9, 9, 3
    addi 11, 9, 8
Quuxplusone commented 8 years ago

Is this a backend or a mid-end issue? Can you please provide some corresponding IR?

Quuxplusone commented 8 years ago
(In reply to comment #1)
> Is this a backend or a mid-end issue? Can you please provide some
> corresponding IR?

FYI: I can reproduce, but I'll also note that the problem does not appear when
compiling with -O3.
Quuxplusone commented 8 years ago

What happens at O3 is that a transformation called SeparateConstOffsetFromGEP in llc, commons some part of address for h[j] and h[j+1]. The optimization will work when we have GEPs with constant distance from each other.

This is some sort of commoning. Can't we turn that on at O2?

We can look at this from another angle as well. The code that we are currently discussing is corresponding to accesses to h[j] and h[j+1] in the first if statement.

There is a subsequent if statement that checks h[i] and h[j]. The optimization above does not work here because i and j do not have a constant distance from each other. So even in O3 code we have two extsw and two sldi for the second if statement.

We can think of an algorithm that can understand both i and j are indices to an array and promote them to 64 bit before the loop. This will remove all extsw instns inside the loop. I suspect that we already do this for canonicalized loops, so this will be useful only for loops that cannot be canonicalized or code that is not inside a loop.

There might be benefit in following both paths.

Quuxplusone commented 8 years ago

SeparateConstOffsetFromGEP is controlled in PPCPassConfig::addIRPasses(). Currently it requires O3. We can turn it on, with no impact on other targets. We need to discuss about all three passes in PPCPassConfig::addIRPasses() whether we want them to be turned on at O2 or not. (I think LICM should be safe to move to O2, I have more doubts about other two).

Quuxplusone commented 8 years ago
(In reply to comment #4)
> SeparateConstOffsetFromGEP is controlled in PPCPassConfig::addIRPasses().
> Currently it requires O3. We can turn it on, with no impact on other
> targets. We need to discuss about all three passes in
> PPCPassConfig::addIRPasses() whether we want them to be turned on at O2 or
> not. (I think LICM should be safe to move to O2, I have more doubts about
> other two).

We have SeparateConstOffsetFromGEP, EarlyCSE (which is designed to be cheap),
and LICM. All of these can be used at -O2. I have no problem with that. I put
them at -O3 only because I was being conservative with
SeparateConstOffsetFromGEP (which was a new optimization at the time). Now that
we have a use case for them to be at -O2, please go ahead with enabling them at
-O2.

Regarding developing a new widening transformations for induction variables in
non-canonical loops, that certainly could be useful.
Quuxplusone commented 8 years ago

I ran the test suite with the three transformations above enable at O3 vs. O2. The difference in compile time is negligible. Will submit the patch to make the change.

Quuxplusone commented 8 years ago

patch http://reviews.llvm.org/D18562 committed

Quuxplusone commented 7 years ago
Definitely doesn't look fixed:

10072360:       a0 00 6c e9     ld 11, 160(12)
10072364:       a0 00 be eb     ld 29, 160(30)
10072368:       40 e8 2b 7c     cmpld    11, 29
1007236c:       5e 30 67 7d     isel 11, 7, 6, 1
10072370:       00 00 0b 28     cmplwi   11, 0
10072374:       9e 20 8a 7c     isel 4, 10, 4, 2
10072378:       b4 07 0a 7d     extsw 10, 8
1007237c:       b4 07 8b 7c     extsw 11, 4
10072380:       24 1f 4a 79     sldi 10, 10, 3
10072384:       24 1f 6b 79     sldi 11, 11, 3
10072388:       2a 50 89 7d     ldx 12, 9, 10
1007238c:       6a 58 69 7d     ldux 11, 9, 11
10072390:       98 2c 0c 7c     lxsdx 0, 12, 5
10072394:       98 2c 2b 7c     lxsdx 1, 11, 5
10072398:       18 09 00 f0     xscmpudp 0, 0, 1
1007239c:       58 00 80 41     bt       0, .+88

just different. Investigating.
Quuxplusone commented 7 years ago

To be clear this is mostly that the code isn't great since it still has the two extsw/sldi with the addition now of some isel, rather than the GEP offset pass not working.

Quuxplusone commented 7 years ago
(In reply to Eric Christopher from comment #9)
> To be clear this is mostly that the code isn't great since it still has the
> two extsw/sldi with the addition now of some isel, rather than the GEP
> offset pass not working.

If you read bug description and comment 3, GEP offset removes other pairs of
extsw/sldi. Bug description is about a very specific problem and did not extend
to any potential improvement in the code.