Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Shrink wrapping not firing on seemingly basic cases? (maybe i misunderstand how it works) #32841

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR33868
Status NEW
Importance P enhancement
Reported by Chandler Carruth (chandlerc@gmail.com)
Reported on 2017-07-20 18:07:29 -0700
Last modified on 2018-03-14 12:25:55 -0700
Version unspecified
Hardware PC All
CC compnerd@compnerd.org, dblaikie@gmail.com, ditaliano@apple.com, francisvm@yahoo.com, junbuml@codeaurora.org, llvm-bugs@lists.llvm.org, matze@braunis.de, quentin.colombet@gmail.com, sfertile@ca.ibm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
I was digging into sadly unrelated performance issues and noticed a surprising
lack of shrink wrapping for functions with obvious, trivial early-exits.

Here is one example I was able to quickly reproduce:

% cat shrink_wrap.cpp
struct S { long a, b; };

struct C {
  S doit() { return {1, 2}; }
};

S compute() {
  static C* c = new C();
  return c->doit();
}

% ./dev/bin/clang++ -std=c++11 -fno-rtti -fno-exceptions -c -S -o -
shrink_wrap.cpp -O3
        .text
        .file   "shrink_wrap.cpp"
        .globl  _Z7computev             # -- Begin function _Z7computev
        .p2align        4, 0x90
        .type   _Z7computev,@function
_Z7computev:                            # @_Z7computev
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rax
.Lcfi0:
        .cfi_def_cfa_offset 16
        movb    _ZGVZ7computevE1c(%rip), %al
        testb   %al, %al
        jne     .LBB0_3
# BB#1:                                 # %init.check
        movl    $_ZGVZ7computevE1c, %edi
        callq   __cxa_guard_acquire
        testl   %eax, %eax
        je      .LBB0_3
# BB#2:                                 # %init
        movl    $1, %edi
        callq   _Znwm
        movq    %rax, _ZZ7computevE1c(%rip)
        movl    $_ZGVZ7computevE1c, %edi
        callq   __cxa_guard_release
.LBB0_3:                                # %init.end
        movl    $1, %eax
        movl    $2, %edx
        popq    %rcx
        retq
.Lfunc_end0:

Why do we need to pushq %rax and popq %rcx in the fast path?
Quuxplusone commented 7 years ago
As discussed on IRC, I'll explain why shrink-wrapping doesn't kick in:

Currently, the only thing we do is delay the execution of the prologue /
epilogue, with an unique saving point, and an unique restoring point.

In this case, both BB#0 (the entry block) and BB#1 jump to BB#3, the only
return block we have here.

Since we don't create any new blocks, BB#3 is the only one eligible for a
restore point (successor of BB#1 and BB#2, which call functions).
BB#3 has BB#0 as a predecessor, forcing us to save in the entry block.
Therefore, no shrink-wrapping.

What would be cool is to split the edges and create intermediary blocks to
allow more shrink-wrapping to be performed.
Quuxplusone commented 7 years ago
We could consider block splitting when it is free.
I mean putting the epilog/prologue on a fall through edge should be free and
adding such free basic block should be easy enough.

Francis, do you want to try it or do you want me to?
Quuxplusone commented 7 years ago
(In reply to Quentin Colombet from comment #2)
> Francis, do you want to try it or do you want me to?

I think it's probably better if you do it, you seem to have a pretty clear idea
of how to do it!
Quuxplusone commented 7 years ago
I looked at this on aarch64 where shrink-wrapping didn't happen as well. For me
it seems that a CSR is allocated for a virtual register which is used in the
entry block. Since the first use of CSR is in the entry block, the shrink-
wrapping pass might see the entry block as a safe save point, so there is no
shrink-wrapping candidate.

Basically, I'm not clear how block splitting can expose more shrink-wrapping
opportunities when a CSR is allocated in the entry block. Did you mean that
splitting blocks and split use of the virtual register before RA? Or block
split only before the first use of CSR?

It's not that hard to find another case where shrink-wrapping doesn't happen in
a simple code. For example, in below C code, I saw that shrink-wrapping didn't
happen due to the very early uses of CSRs (for argument i) in the entry block.
Do you think we could make shrink-wrapping happen by block splitting in the
code below ?

int callee(int i);

int foo(int *P, int i) {
   if (i>0)
     return P[i];

   i = callee(i);
   return P[i];
}
Quuxplusone commented 7 years ago
(In reply to Jun Bum Lim from comment #4)
> I looked at this on aarch64 where shrink-wrapping didn't happen as well. For
> me it seems that a CSR is allocated for a virtual register which is used in
> the entry block. Since the first use of CSR is in the entry block, the
> shrink-wrapping pass might see the entry block as a safe save point, so
> there is no shrink-wrapping candidate.

Right, I just looked at the aarch64 version and on trunk (r317548) with "-
std=c++11 -fno-rtti -fno-exceptions -O3 -target aarch64-unknown-unknown" x19
gets allocated in the entry block, so you're right, no shrink-wrapping
candidate and block splitting won't help either.

> Basically, I'm not clear how block splitting can expose more shrink-wrapping
> opportunities when a CSR is allocated in the entry block. Did you mean that
> splitting blocks and split use of the virtual register before RA? Or block
> split only before the first use of CSR?

In this case on AArch64, block splitting won't help.

On x86_64 however, there is no use of any callee-saved register nor the stack
in the entry block. The uses are in BB#1 and BB#2.

Ideally, we could place the saving point in BB#1, but the only successor of
BB#1 is BB#3, which is the common postdominator of both blocks using the stack
(BB#1 and BB#2).

This forces us to move the saving point in BB#1. If we can insert a block
before BB#3 where we can insert the epilogue, we can place the saving point in
BB#1.

> It's not that hard to find another case where shrink-wrapping doesn't happen
> in a simple code. For example, in below C code, I saw that shrink-wrapping
> didn't happen due to the very early uses of CSRs (for argument i) in the
> entry block. Do you think we could make shrink-wrapping happen by block
> splitting in the code below ?
>
> int callee(int i);
>
> int foo(int *P, int i) {
>    if (i>0)
>      return P[i];
>
>    i = callee(i);
>    return P[i];
> }

This is still the same problem, x19 gets allocated in the entry block (because
it's alive past the call). In this case, block splitting will not help either.
Quuxplusone commented 7 years ago
Thanks Francis for the explanation.
I'm trying to see if we can encourage more shrink-wrapping even in cases where
CSRs are currently allocated in the entry block.

One way I can think of is that we can discourage RegAlloc from allocating CSRs
in the entry block if the function has an early exit. Previously, I proposed
https://reviews.llvm.org/D34608 in this matter, but the way I detect the
profitable cases and the way I increase the CRSCost was somewhat unclear.

Now, I'm thinking to increase CSRCost for the first allocation from CSRs in the
entry block if we detect an early exit so that we encourage more shrink-
wrapping and avoid executing CSR spill/recover when the early exit is taken.

Actually, I sent out this in llvm.dev before, but didn't get too much
attention, so I'm ask one more time to see if doing such thing in RegAlloc is
make sense. I will be happy to hear any opinion about this.
Quuxplusone commented 6 years ago
> We could consider block splitting when it is free. I mean putting the
epilog/prologue on a fall through edge should be free and adding such free
basic block should be easy enough.

Due to this issue, I observed many shrink-wrapping opportunities are blocked.
If no one is actively working on this, I can start looking at this.

Quentin, did you mean to add a new pass for the block splitting before shrink
wrapping pass? Please let me know any detail about what you intended by your
comment above.
Quuxplusone commented 6 years ago
(In reply to Jun Bum Lim from comment #7)
> > We could consider block splitting when it is free. I mean putting the
epilog/prologue on a fall through edge should be free and adding such free
basic block should be easy enough.
>
>
> Due to this issue, I observed many shrink-wrapping opportunities are
> blocked. If no one is actively working on this, I can start looking at this.
>
> Quentin, did you mean to add a new pass for the block splitting before
> shrink wrapping pass? Please let me know any detail about what you intended
> by your comment above.

I was hoping that we could emulate the split in the dominator tree (through new
APIs) and only create the basic block if we choose to put some code on the edge.

Alternatively, a naive approach, would be to split all critical edges and then
merge things back after PEI, but that sounds wasteful.
Quuxplusone commented 6 years ago
>I was hoping that we could emulate the split in the dominator tree (through
>new APIs) and only create the basic block if we choose to put some code on
>the edge.
>
>Alternatively, a naive approach, would be to split all critical edges and
>then merge things back after PEI, but that sounds wasteful.

I'm not sure how splitting critical edges help in general. For example, in the
IR below, %Exit is the common postdominator of %BB0 and %BB1. I cannot see how
splitting critical edge help shrink wrapping choose the better Restore/Save
point. Don't we need to duplicate the return block which should be hooked into
%entry so that we can isolate the Restore point (%Exit) from %entry?

define void @test(i32 %a) {
entry:
  %cmp5 = icmp sgt i32 %a, 0
  br i1 %cmp5, label %BB0, label %Exit

BB0:
  %call = call i32 @fun()
  %c = icmp eq i32 %call, 0
  br i1 %c, label %BB1, label %Exit

BB1:
  %call2 = call i32 @fun()
  %c2 = icmp eq i32 %call2, 0
  br label %Exit

Exit:
  ret void
}
Quuxplusone commented 6 years ago
I think the idea here is to end up with something like this diff:

(In reply to Jun Bum Lim from comment #9)
> define void @test(i32 %a) {
> entry:
>   %cmp5 = icmp sgt i32 %a, 0
>   br i1 %cmp5, label %BB0, label %Exit
>
> BB0:
>   %call = call i32 @fun()
>   %c = icmp eq i32 %call, 0
-   br i1 %c, label %BB1, label %Exit
+   br i1 %c, label %BB1, label %NewExit
>
> BB1:
>   %call2 = call i32 @fun()
>   %c2 = icmp eq i32 %call2, 0
-   br label %Exit
+   br label %NewExit

+ NewExit:
+   ; Fallthrough
>
> Exit:
>   ret void
> }

This would allow us to save in BB0 and restore in NewExit, without inserting
any additional branches from NewExit to Exit.
Quuxplusone commented 6 years ago
Thanks Francis for the detail. Your hand modified IR make sense because only
%BB0 and %BB1 are hooked into %NewExit, while %entry is still jumping %Exit.
However, I'm not clear how we can make such decision. Based on what, we can
group %BB0 and %BB1 into %NewExit and %entry into %Exit?

Should we perform this transformation in shrink-wrap pass when we need to
update the save/restore point? Please correct anything I misunderstand.
Quuxplusone commented 6 years ago
(In reply to Jun Bum Lim from comment #4)
> I looked at this on aarch64 where shrink-wrapping didn't happen as well. For
> me it seems that a CSR is allocated for a virtual register which is used in
> the entry block. Since the first use of CSR is in the entry block, the
> shrink-wrapping pass might see the entry block as a safe save point, so
> there is no shrink-wrapping candidate.
>
> Basically, I'm not clear how block splitting can expose more shrink-wrapping
> opportunities when a CSR is allocated in the entry block. Did you mean that
> splitting blocks and split use of the virtual register before RA? Or block
> split only before the first use of CSR?
>
> It's not that hard to find another case where shrink-wrapping doesn't happen
> in a simple code. For example, in below C code, I saw that shrink-wrapping
> didn't happen due to the very early uses of CSRs (for argument i) in the
> entry block. Do you think we could make shrink-wrapping happen by block
> splitting in the code below ?
>
> int callee(int i);
>
> int foo(int *P, int i) {
>    if (i>0)
>      return P[i];
>
>    i = callee(i);
>    return P[i];
> }

(In reply to Jun Bum Lim from comment #9)
> >I was hoping that we could emulate the split in the dominator tree (through
>new APIs) and only create the basic block if we choose to put some code on
>the edge.
> >
> >Alternatively, a naive approach, would be to split all critical edges and
>then merge things back after PEI, but that sounds wasteful.
>
> I'm not sure how splitting critical edges help in general. For example, in
> the IR below, %Exit is the common postdominator of %BB0 and %BB1. I cannot
> see how splitting critical edge help shrink wrapping choose the better
> Restore/Save point. Don't we need to duplicate the return block which should
> be hooked into %entry so that we can isolate the Restore point (%Exit) from
> %entry?
>
> define void @test(i32 %a) {
> entry:
>   %cmp5 = icmp sgt i32 %a, 0
>   br i1 %cmp5, label %BB0, label %Exit
>
> BB0:
>   %call = call i32 @fun()
>   %c = icmp eq i32 %call, 0
>   br i1 %c, label %BB1, label %Exit
>
> BB1:
>   %call2 = call i32 @fun()
>   %c2 = icmp eq i32 %call2, 0
>   br label %Exit
>
> Exit:
>   ret void
> }

Given there is no critical edges, unless the jump instructions are dealing with
CSRs, then there boundaries should be valid store/restore points, right?
Quuxplusone commented 6 years ago
(In reply to Quentin Colombet from comment #12)
>
> Given there is no critical edges, unless the jump instructions are dealing
> with CSRs, then there boundaries should be valid store/restore points, right?

Scratch that, there is actually a critical edges in your case...
But indeed, in that case that wouldn't be enough.
Quuxplusone commented 6 years ago
(In reply to Jun Bum Lim from comment #11)
> Thanks Francis for the detail. Your hand modified IR make sense because only
> %BB0 and %BB1 are hooked into %NewExit, while %entry is still jumping %Exit.
> However, I'm not clear how we can make such decision. Based on what, we can
> group %BB0 and %BB1 into %NewExit and %entry into %Exit?
>
> Should we perform this transformation in shrink-wrap pass when we need to
> update the save/restore point? Please correct anything I misunderstand.

I agree with Jun, this is more complicated than just splitting edges (critical
or not), I misread the first motivating example. We actually need to come up
with a common post dominator for a region.

That being said, assuming we have data to prove that this is a big opportunity,
we could merge the common target of CSRs blocks in one common dominator.
Like in Jun's example, candidates are:
BB0 and BB1

BB0 jumps to Exit
BB1 jumps to Exit

=> BB0 and BB1 jumps to NewExit; NewExit jumps to Exit

We would probably have to do the same thing for the sources of the current CSRs
blocks.
Quuxplusone commented 6 years ago

=> BB0 and BB1 jumps to NewExit; NewExit jumps to Exit

Note: This is only useful because Exit has more predecessors than just BB0 and BB1.

Quuxplusone commented 6 years ago

In order to continue the discussion, I just posted a patch (https://reviews.llvm.org/D42600) in which I split the restore point to merge the common targets of CSRs blocks in one common post-dominator. Please take a look and let me know any comment or suggestion.

Quuxplusone commented 6 years ago

I just checked that my WIP patch (https://reviews.llvm.org/D42600) handle the very first test case. Please take a look and let me know any comment on it.