Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Failing to hoist all loads out of a loop #29665

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR30692
Status REOPENED
Importance P normal
Reported by Charles Li (charles_li@playstation.sony.com)
Reported on 2016-10-13 19:13:34 -0700
Last modified on 2020-10-23 03:52:21 -0700
Version trunk
Hardware PC Linux
CC andrea.dibiagio@gmail.com, danielcdh@gmail.com, dberlin@dberlin.org, ditaliano@apple.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, paul_robinson@playstation.sony.com, rafael@espindo.la, sebpop.llvm@gmail.com, simon.f.whittaker@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments hoist-pregvn.ll (2403 bytes, application/octet-stream)
pre-non-local.diff (5948 bytes, text/plain)
Blocks
Blocked by
See also
Hi Everyone,

Found a optimization issue.

Here is the test case.
$ cat t.cpp
/*************************/
int aaa = 10;
int bbb = 20;
int ccc = 30;

void foo(const int *pData)
{
    while (1)
    {
        int x = *pData++;

        if (x > ccc)
            bbb += x;

        if (x > bbb)
            aaa += x;

        if (x > aaa)
            ccc += x;

        if (x < 0)
            break;
    }
}
/*************************/

Compile this test with optimization.
$ clang t.cpp -S -O2

Here is the assembly.
$ cat t.s

        .text
        .file   "t.cpp"
        .globl  _Z3fooPKi
        .p2align        4, 0x90
        .type   _Z3fooPKi,@function
_Z3fooPKi:                              # @_Z3fooPKi
        .cfi_startproc
# BB#0:                                 # %entry
        movl    ccc(%rip), %eax
        .p2align        4, 0x90
.LBB0_1:                                # %while.body
                                        # =>This Inner Loop Header: Depth=1
        movl    (%rdi), %ecx
        movl    bbb(%rip), %edx
        cmpl    %eax, %ecx
        jle     .LBB0_3
# BB#2:                                 # %if.then
                                        #   in Loop: Header=BB0_1 Depth=1
        addl    %ecx, %edx
        movl    %edx, bbb(%rip)
.LBB0_3:                                # %if.end
                                        #   in Loop: Header=BB0_1 Depth=1
        movl    aaa(%rip), %esi
        cmpl    %edx, %ecx
        jle     .LBB0_5
# BB#4:                                 # %if.then2
                                        #   in Loop: Header=BB0_1 Depth=1
        addl    %ecx, %esi
        movl    %esi, aaa(%rip)
.LBB0_5:                                # %if.end4
                                        #   in Loop: Header=BB0_1 Depth=1
        cmpl    %esi, %ecx
        jle     .LBB0_7
# BB#6:                                 # %if.then6
                                        #   in Loop: Header=BB0_1 Depth=1
        addl    %ecx, %eax
        movl    %eax, ccc(%rip)
.LBB0_7:                                # %if.end8
                                        #   in Loop: Header=BB0_1 Depth=1
        addq    $4, %rdi
        testl   %ecx, %ecx
        jns     .LBB0_1
# BB#8:                                 # %while.end
        retq
.Lfunc_end0:
        .size   _Z3fooPKi, .Lfunc_end0-_Z3fooPKi
        .cfi_endproc

        .type   aaa,@object             # @aaa
        .data
        .globl  aaa
        .p2align        2
aaa:
        .long   10                      # 0xa
        .size   aaa, 4

        .type   bbb,@object             # @bbb
        .globl  bbb
        .p2align        2
bbb:
        .long   20                      # 0x14
        .size   bbb, 4

        .type   ccc,@object             # @ccc
        .globl  ccc
        .p2align        2
ccc:
        .long   30                      # 0x1e
        .size   ccc, 4

Looking at the location of these 3 instructions:
  movl    ccc(%rip), %eax
  movl    bbb(%rip), %edx
  movl    aaa(%rip), %esi
we can see that,
only the load of “ccc” got hoisted into the %entry block,
while the load of “bbb” and “aaa” are still inside the %while.body.
Quuxplusone commented 8 years ago

We hoist ccc in GVN (when we perform PRE). I'm not sure why we don't hoist the other two variables as they don't seem entirely different and if it makes really sense performing this transformation as part of GVN. Daniel/Sebastian, any opinion about it?

Quuxplusone commented 8 years ago
Also, FWIW, gcc hoists all the three loads outside of the loop, i.e.

.Ltext0:
foo(int const*):
.LFB0:
.LVL0:
        movl    bbb(%rip), %esi
        movl    aaa(%rip), %ecx
        movl    ccc(%rip), %edx
.L5:
.LBB2:
        addq    $4, %rdi
.LVL1:
        movl    -4(%rdi), %eax
.LVL2:
        cmpl    %eax, %edx
        jge     .L2
        addl    %eax, %esi
        movl    %esi, bbb(%rip)
.L2:
        cmpl    %esi, %eax
        jle     .L3
        addl    %eax, %ecx
        movl    %ecx, aaa(%rip)
.L3:
        cmpl    %ecx, %eax
        jle     .L4
        addl    %eax, %edx
        movl    %edx, ccc(%rip)
.L4:
        testl   %eax, %eax
        jns     .L5
.LBE2:
        rep ret
.LFE0:
ccc:
        .long   30
bbb:
        .long   20
aaa:
        .long   10
Quuxplusone commented 8 years ago

(In reply to comment #1)

We hoist ccc in GVN (when we perform PRE). I'm not sure why we don't hoist the other two variables as they don't seem entirely different and if it makes really sense performing this transformation as part of GVN. Daniel/Sebastian, any opinion about it?

Can you paste attach the LLVM IR right before PRE?

Quuxplusone commented 8 years ago

Attached hoist-pregvn.ll (2403 bytes, application/octet-stream): pre-gvn ir

Quuxplusone commented 8 years ago

Yet another reason to move to NewGVN =) Well, thanks for taking care of it in the current GVN, at least we'll have a test case to make sure it won't regress when we do the switch

Quuxplusone commented 8 years ago
Actually, i was trying the wrong branch.
That issue was already fixed, AFAIK.

This issue is different, where it doesn't understand the nonFuncLocal type of
clobber, and thus thinks it's unavailable.
Instead, it means it can be recreated anywhere up the CFG, which it also
doesn't understand.

I've attached a fix that works for global values.
We lack the PRE infrastructure to make this more general (in general, we can
move anything where we can prove the operands or available, or we can recreate
them)

I don't have time to test this further :)

Note also that because GVN understands nothing about phi nodes, it generates a
ton of duplicate ones that it requires a large number of passes to clean up :(

NewGVN definitely fixes that.

In any case, patch attached, if someone wants to push it along.
Quuxplusone commented 8 years ago

Attached pre-non-local.diff (5948 bytes, text/plain): Patch to fix this issue

Quuxplusone commented 8 years ago
(note that this also causes pre-single-pred.ll to fail because we are now able
to completely move the load out of the loop.
yay)
Quuxplusone commented 8 years ago
Thanks!
I noticed about lack of phi handling (this actually was the original reason why
I started looking @NewGVN, because we have some cases internally which could
greatly benefit by the new phi-predication mechanism).

For now,I'll test this more, try to bootstrap etc..
It would be good to make this more general but I'm not entirely sure it's worth
our time in particular with a complete rewrite behind the door. So, assuming
the fix is safe, I'd rather land it as is and focus our effort on NewGVN. Makes
sense? Otherwise I'll keep the fix in our private tree for the time being.
Quuxplusone commented 8 years ago

We are basically going to want to rewrite PRE in terms of GVN + MemorySSA anyway, so i'm pretty sure we get nothing from trying to generalize this.

It's also non-trivial.

It's basically "reimplement a real GVNPRE in LLVM".

Doing the operand availability/etc calculations sparsely and quickly is .... not easy (it either requires memoizing and wasting a lot of memory, or giving up and using bitvector dataflow).

Quuxplusone commented 8 years ago
(In reply to comment #11)
> We are basically going to want to rewrite PRE in terms of GVN + MemorySSA
> anyway, so i'm pretty sure we get nothing from trying to generalize this.
>
> It's also non-trivial.
>
> It's basically "reimplement a real GVNPRE in LLVM".
>
> Doing the operand availability/etc calculations sparsely and quickly is ....
> not easy (it either requires memoizing and wasting a lot of memory, or
> giving up and using bitvector dataflow).

(so yeah, makes sense :P)
Quuxplusone commented 8 years ago
I did some additional testing and checked in r284311.
I'll leave this bug open for a bit to ensure no regressions are introduced.
Quuxplusone commented 4 years ago

This was reverted back at rL284796 and is still as bad as when it was first reported:

https://simd.godbolt.org/z/vz5YK6