Shrink wrapping not firing on seemingly basic cases? (maybe i misunderstand how it works)

Quuxplusone commented 7 years ago


Bugzilla Link	PR33868
Status	NEW
Importance	P enhancement
Reported by	Chandler Carruth (chandlerc@gmail.com)
Reported on	2017-07-20 18:07:29 -0700
Last modified on	2018-03-14 12:25:55 -0700
Version	unspecified
Hardware	PC All
CC	compnerd@compnerd.org, dblaikie@gmail.com, ditaliano@apple.com, francisvm@yahoo.com, junbuml@codeaurora.org, llvm-bugs@lists.llvm.org, matze@braunis.de, quentin.colombet@gmail.com, sfertile@ca.ibm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

I was digging into sadly unrelated performance issues and noticed a surprising
lack of shrink wrapping for functions with obvious, trivial early-exits.

Here is one example I was able to quickly reproduce:

% cat shrink_wrap.cpp
struct S { long a, b; };

struct C {
  S doit() { return {1, 2}; }
};

S compute() {
  static C* c = new C();
  return c->doit();
}

% ./dev/bin/clang++ -std=c++11 -fno-rtti -fno-exceptions -c -S -o -
shrink_wrap.cpp -O3
        .text
        .file   "shrink_wrap.cpp"
        .globl  _Z7computev             # -- Begin function _Z7computev
        .p2align        4, 0x90
        .type   _Z7computev,@function
_Z7computev:                            # @_Z7computev
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rax
.Lcfi0:
        .cfi_def_cfa_offset 16
        movb    _ZGVZ7computevE1c(%rip), %al
        testb   %al, %al
        jne     .LBB0_3
# BB#1:                                 # %init.check
        movl    $_ZGVZ7computevE1c, %edi
        callq   __cxa_guard_acquire
        testl   %eax, %eax
        je      .LBB0_3
# BB#2:                                 # %init
        movl    $1, %edi
        callq   _Znwm
        movq    %rax, _ZZ7computevE1c(%rip)
        movl    $_ZGVZ7computevE1c, %edi
        callq   __cxa_guard_release
.LBB0_3:                                # %init.end
        movl    $1, %eax
        movl    $2, %edx
        popq    %rcx
        retq
.Lfunc_end0:

Why do we need to pushq %rax and popq %rcx in the fast path?

Quuxplusone commented 7 years ago

As discussed on IRC, I'll explain why shrink-wrapping doesn't kick in:

Currently, the only thing we do is delay the execution of the prologue /
epilogue, with an unique saving point, and an unique restoring point.

In this case, both BB#0 (the entry block) and BB#1 jump to BB#3, the only
return block we have here.

Since we don't create any new blocks, BB#3 is the only one eligible for a
restore point (successor of BB#1 and BB#2, which call functions).
BB#3 has BB#0 as a predecessor, forcing us to save in the entry block.
Therefore, no shrink-wrapping.

What would be cool is to split the edges and create intermediary blocks to
allow more shrink-wrapping to be performed.

Quuxplusone commented 7 years ago

We could consider block splitting when it is free.
I mean putting the epilog/prologue on a fall through edge should be free and
adding such free basic block should be easy enough.

Francis, do you want to try it or do you want me to?

Quuxplusone commented 7 years ago

(In reply to Quentin Colombet from comment #2)
> Francis, do you want to try it or do you want me to?

I think it's probably better if you do it, you seem to have a pretty clear idea
of how to do it!

Quuxplusone commented 7 years ago

I looked at this on aarch64 where shrink-wrapping didn't happen as well. For me
it seems that a CSR is allocated for a virtual register which is used in the
entry block. Since the first use of CSR is in the entry block, the shrink-
wrapping pass might see the entry block as a safe save point, so there is no
shrink-wrapping candidate.

Basically, I'm not clear how block splitting can expose more shrink-wrapping
opportunities when a CSR is allocated in the entry block. Did you mean that
splitting blocks and split use of the virtual register before RA? Or block
split only before the first use of CSR?

It's not that hard to find another case where shrink-wrapping doesn't happen in
a simple code. For example, in below C code, I saw that shrink-wrapping didn't
happen due to the very early uses of CSRs (for argument i) in the entry block.
Do you think we could make shrink-wrapping happen by block splitting in the
code below ?

int callee(int i);

int foo(int *P, int i) {
   if (i>0)
     return P[i];

   i = callee(i);
   return P[i];
}

Quuxplusone commented 7 years ago

(In reply to Jun Bum Lim from comment #4)
> I looked at this on aarch64 where shrink-wrapping didn't happen as well. For
> me it seems that a CSR is allocated for a virtual register which is used in
> the entry block. Since the first use of CSR is in the entry block, the
> shrink-wrapping pass might see the entry block as a safe save point, so
> there is no shrink-wrapping candidate.

Right, I just looked at the aarch64 version and on trunk (r317548) with "-
std=c++11 -fno-rtti -fno-exceptions -O3 -target aarch64-unknown-unknown" x19
gets allocated in the entry block, so you're right, no shrink-wrapping
candidate and block splitting won't help either.

> Basically, I'm not clear how block splitting can expose more shrink-wrapping
> opportunities when a CSR is allocated in the entry block. Did you mean that
> splitting blocks and split use of the virtual register before RA? Or block
> split only before the first use of CSR?

In this case on AArch64, block splitting won't help.

On x86_64 however, there is no use of any callee-saved register nor the stack
in the entry block. The uses are in BB#1 and BB#2.

Ideally, we could place the saving point in BB#1, but the only successor of
BB#1 is BB#3, which is the common postdominator of both blocks using the stack
(BB#1 and BB#2).

This forces us to move the saving point in BB#1. If we can insert a block
before BB#3 where we can insert the epilogue, we can place the saving point in
BB#1.

> It's not that hard to find another case where shrink-wrapping doesn't happen
> in a simple code. For example, in below C code, I saw that shrink-wrapping
> didn't happen due to the very early uses of CSRs (for argument i) in the
> entry block. Do you think we could make shrink-wrapping happen by block
> splitting in the code below ?
>
> int callee(int i);
>
> int foo(int *P, int i) {
>    if (i>0)
>      return P[i];
>
>    i = callee(i);
>    return P[i];
> }

This is still the same problem, x19 gets allocated in the entry block (because
it's alive past the call). In this case, block splitting will not help either.

Quuxplusone commented 7 years ago

Thanks Francis for the explanation.
I'm trying to see if we can encourage more shrink-wrapping even in cases where
CSRs are currently allocated in the entry block.

One way I can think of is that we can discourage RegAlloc from allocating CSRs
in the entry block if the function has an early exit. Previously, I proposed
https://reviews.llvm.org/D34608 in this matter, but the way I detect the
profitable cases and the way I increase the CRSCost was somewhat unclear.

Now, I'm thinking to increase CSRCost for the first allocation from CSRs in the
entry block if we detect an early exit so that we encourage more shrink-
wrapping and avoid executing CSR spill/recover when the early exit is taken.

Actually, I sent out this in llvm.dev before, but didn't get too much
attention, so I'm ask one more time to see if doing such thing in RegAlloc is
make sense. I will be happy to hear any opinion about this.