Open Quuxplusone opened 7 years ago
Bugzilla Link | PR33868 |
Status | NEW |
Importance | P enhancement |
Reported by | Chandler Carruth (chandlerc@gmail.com) |
Reported on | 2017-07-20 18:07:29 -0700 |
Last modified on | 2018-03-14 12:25:55 -0700 |
Version | unspecified |
Hardware | PC All |
CC | compnerd@compnerd.org, dblaikie@gmail.com, ditaliano@apple.com, francisvm@yahoo.com, junbuml@codeaurora.org, llvm-bugs@lists.llvm.org, matze@braunis.de, quentin.colombet@gmail.com, sfertile@ca.ibm.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
As discussed on IRC, I'll explain why shrink-wrapping doesn't kick in:
Currently, the only thing we do is delay the execution of the prologue /
epilogue, with an unique saving point, and an unique restoring point.
In this case, both BB#0 (the entry block) and BB#1 jump to BB#3, the only
return block we have here.
Since we don't create any new blocks, BB#3 is the only one eligible for a
restore point (successor of BB#1 and BB#2, which call functions).
BB#3 has BB#0 as a predecessor, forcing us to save in the entry block.
Therefore, no shrink-wrapping.
What would be cool is to split the edges and create intermediary blocks to
allow more shrink-wrapping to be performed.
We could consider block splitting when it is free.
I mean putting the epilog/prologue on a fall through edge should be free and
adding such free basic block should be easy enough.
Francis, do you want to try it or do you want me to?
(In reply to Quentin Colombet from comment #2)
> Francis, do you want to try it or do you want me to?
I think it's probably better if you do it, you seem to have a pretty clear idea
of how to do it!
I looked at this on aarch64 where shrink-wrapping didn't happen as well. For me
it seems that a CSR is allocated for a virtual register which is used in the
entry block. Since the first use of CSR is in the entry block, the shrink-
wrapping pass might see the entry block as a safe save point, so there is no
shrink-wrapping candidate.
Basically, I'm not clear how block splitting can expose more shrink-wrapping
opportunities when a CSR is allocated in the entry block. Did you mean that
splitting blocks and split use of the virtual register before RA? Or block
split only before the first use of CSR?
It's not that hard to find another case where shrink-wrapping doesn't happen in
a simple code. For example, in below C code, I saw that shrink-wrapping didn't
happen due to the very early uses of CSRs (for argument i) in the entry block.
Do you think we could make shrink-wrapping happen by block splitting in the
code below ?
int callee(int i);
int foo(int *P, int i) {
if (i>0)
return P[i];
i = callee(i);
return P[i];
}
(In reply to Jun Bum Lim from comment #4)
> I looked at this on aarch64 where shrink-wrapping didn't happen as well. For
> me it seems that a CSR is allocated for a virtual register which is used in
> the entry block. Since the first use of CSR is in the entry block, the
> shrink-wrapping pass might see the entry block as a safe save point, so
> there is no shrink-wrapping candidate.
Right, I just looked at the aarch64 version and on trunk (r317548) with "-
std=c++11 -fno-rtti -fno-exceptions -O3 -target aarch64-unknown-unknown" x19
gets allocated in the entry block, so you're right, no shrink-wrapping
candidate and block splitting won't help either.
> Basically, I'm not clear how block splitting can expose more shrink-wrapping
> opportunities when a CSR is allocated in the entry block. Did you mean that
> splitting blocks and split use of the virtual register before RA? Or block
> split only before the first use of CSR?
In this case on AArch64, block splitting won't help.
On x86_64 however, there is no use of any callee-saved register nor the stack
in the entry block. The uses are in BB#1 and BB#2.
Ideally, we could place the saving point in BB#1, but the only successor of
BB#1 is BB#3, which is the common postdominator of both blocks using the stack
(BB#1 and BB#2).
This forces us to move the saving point in BB#1. If we can insert a block
before BB#3 where we can insert the epilogue, we can place the saving point in
BB#1.
> It's not that hard to find another case where shrink-wrapping doesn't happen
> in a simple code. For example, in below C code, I saw that shrink-wrapping
> didn't happen due to the very early uses of CSRs (for argument i) in the
> entry block. Do you think we could make shrink-wrapping happen by block
> splitting in the code below ?
>
> int callee(int i);
>
> int foo(int *P, int i) {
> if (i>0)
> return P[i];
>
> i = callee(i);
> return P[i];
> }
This is still the same problem, x19 gets allocated in the entry block (because
it's alive past the call). In this case, block splitting will not help either.
Thanks Francis for the explanation.
I'm trying to see if we can encourage more shrink-wrapping even in cases where
CSRs are currently allocated in the entry block.
One way I can think of is that we can discourage RegAlloc from allocating CSRs
in the entry block if the function has an early exit. Previously, I proposed
https://reviews.llvm.org/D34608 in this matter, but the way I detect the
profitable cases and the way I increase the CRSCost was somewhat unclear.
Now, I'm thinking to increase CSRCost for the first allocation from CSRs in the
entry block if we detect an early exit so that we encourage more shrink-
wrapping and avoid executing CSR spill/recover when the early exit is taken.
Actually, I sent out this in llvm.dev before, but didn't get too much
attention, so I'm ask one more time to see if doing such thing in RegAlloc is
make sense. I will be happy to hear any opinion about this.
> We could consider block splitting when it is free. I mean putting the
epilog/prologue on a fall through edge should be free and adding such free
basic block should be easy enough.
Due to this issue, I observed many shrink-wrapping opportunities are blocked.
If no one is actively working on this, I can start looking at this.
Quentin, did you mean to add a new pass for the block splitting before shrink
wrapping pass? Please let me know any detail about what you intended by your
comment above.
(In reply to Jun Bum Lim from comment #7)
> > We could consider block splitting when it is free. I mean putting the
epilog/prologue on a fall through edge should be free and adding such free
basic block should be easy enough.
>
>
> Due to this issue, I observed many shrink-wrapping opportunities are
> blocked. If no one is actively working on this, I can start looking at this.
>
> Quentin, did you mean to add a new pass for the block splitting before
> shrink wrapping pass? Please let me know any detail about what you intended
> by your comment above.
I was hoping that we could emulate the split in the dominator tree (through new
APIs) and only create the basic block if we choose to put some code on the edge.
Alternatively, a naive approach, would be to split all critical edges and then
merge things back after PEI, but that sounds wasteful.
>I was hoping that we could emulate the split in the dominator tree (through
>new APIs) and only create the basic block if we choose to put some code on
>the edge.
>
>Alternatively, a naive approach, would be to split all critical edges and
>then merge things back after PEI, but that sounds wasteful.
I'm not sure how splitting critical edges help in general. For example, in the
IR below, %Exit is the common postdominator of %BB0 and %BB1. I cannot see how
splitting critical edge help shrink wrapping choose the better Restore/Save
point. Don't we need to duplicate the return block which should be hooked into
%entry so that we can isolate the Restore point (%Exit) from %entry?
define void @test(i32 %a) {
entry:
%cmp5 = icmp sgt i32 %a, 0
br i1 %cmp5, label %BB0, label %Exit
BB0:
%call = call i32 @fun()
%c = icmp eq i32 %call, 0
br i1 %c, label %BB1, label %Exit
BB1:
%call2 = call i32 @fun()
%c2 = icmp eq i32 %call2, 0
br label %Exit
Exit:
ret void
}
I think the idea here is to end up with something like this diff:
(In reply to Jun Bum Lim from comment #9)
> define void @test(i32 %a) {
> entry:
> %cmp5 = icmp sgt i32 %a, 0
> br i1 %cmp5, label %BB0, label %Exit
>
> BB0:
> %call = call i32 @fun()
> %c = icmp eq i32 %call, 0
- br i1 %c, label %BB1, label %Exit
+ br i1 %c, label %BB1, label %NewExit
>
> BB1:
> %call2 = call i32 @fun()
> %c2 = icmp eq i32 %call2, 0
- br label %Exit
+ br label %NewExit
+ NewExit:
+ ; Fallthrough
>
> Exit:
> ret void
> }
This would allow us to save in BB0 and restore in NewExit, without inserting
any additional branches from NewExit to Exit.
Thanks Francis for the detail. Your hand modified IR make sense because only
%BB0 and %BB1 are hooked into %NewExit, while %entry is still jumping %Exit.
However, I'm not clear how we can make such decision. Based on what, we can
group %BB0 and %BB1 into %NewExit and %entry into %Exit?
Should we perform this transformation in shrink-wrap pass when we need to
update the save/restore point? Please correct anything I misunderstand.
(In reply to Jun Bum Lim from comment #4)
> I looked at this on aarch64 where shrink-wrapping didn't happen as well. For
> me it seems that a CSR is allocated for a virtual register which is used in
> the entry block. Since the first use of CSR is in the entry block, the
> shrink-wrapping pass might see the entry block as a safe save point, so
> there is no shrink-wrapping candidate.
>
> Basically, I'm not clear how block splitting can expose more shrink-wrapping
> opportunities when a CSR is allocated in the entry block. Did you mean that
> splitting blocks and split use of the virtual register before RA? Or block
> split only before the first use of CSR?
>
> It's not that hard to find another case where shrink-wrapping doesn't happen
> in a simple code. For example, in below C code, I saw that shrink-wrapping
> didn't happen due to the very early uses of CSRs (for argument i) in the
> entry block. Do you think we could make shrink-wrapping happen by block
> splitting in the code below ?
>
> int callee(int i);
>
> int foo(int *P, int i) {
> if (i>0)
> return P[i];
>
> i = callee(i);
> return P[i];
> }
(In reply to Jun Bum Lim from comment #9)
> >I was hoping that we could emulate the split in the dominator tree (through
>new APIs) and only create the basic block if we choose to put some code on
>the edge.
> >
> >Alternatively, a naive approach, would be to split all critical edges and
>then merge things back after PEI, but that sounds wasteful.
>
> I'm not sure how splitting critical edges help in general. For example, in
> the IR below, %Exit is the common postdominator of %BB0 and %BB1. I cannot
> see how splitting critical edge help shrink wrapping choose the better
> Restore/Save point. Don't we need to duplicate the return block which should
> be hooked into %entry so that we can isolate the Restore point (%Exit) from
> %entry?
>
> define void @test(i32 %a) {
> entry:
> %cmp5 = icmp sgt i32 %a, 0
> br i1 %cmp5, label %BB0, label %Exit
>
> BB0:
> %call = call i32 @fun()
> %c = icmp eq i32 %call, 0
> br i1 %c, label %BB1, label %Exit
>
> BB1:
> %call2 = call i32 @fun()
> %c2 = icmp eq i32 %call2, 0
> br label %Exit
>
> Exit:
> ret void
> }
Given there is no critical edges, unless the jump instructions are dealing with
CSRs, then there boundaries should be valid store/restore points, right?
(In reply to Quentin Colombet from comment #12)
>
> Given there is no critical edges, unless the jump instructions are dealing
> with CSRs, then there boundaries should be valid store/restore points, right?
Scratch that, there is actually a critical edges in your case...
But indeed, in that case that wouldn't be enough.
(In reply to Jun Bum Lim from comment #11)
> Thanks Francis for the detail. Your hand modified IR make sense because only
> %BB0 and %BB1 are hooked into %NewExit, while %entry is still jumping %Exit.
> However, I'm not clear how we can make such decision. Based on what, we can
> group %BB0 and %BB1 into %NewExit and %entry into %Exit?
>
> Should we perform this transformation in shrink-wrap pass when we need to
> update the save/restore point? Please correct anything I misunderstand.
I agree with Jun, this is more complicated than just splitting edges (critical
or not), I misread the first motivating example. We actually need to come up
with a common post dominator for a region.
That being said, assuming we have data to prove that this is a big opportunity,
we could merge the common target of CSRs blocks in one common dominator.
Like in Jun's example, candidates are:
BB0 and BB1
BB0 jumps to Exit
BB1 jumps to Exit
=> BB0 and BB1 jumps to NewExit; NewExit jumps to Exit
We would probably have to do the same thing for the sources of the current CSRs
blocks.
=> BB0 and BB1 jumps to NewExit; NewExit jumps to Exit
Note: This is only useful because Exit has more predecessors than just BB0 and BB1.
In order to continue the discussion, I just posted a patch (https://reviews.llvm.org/D42600) in which I split the restore point to merge the common targets of CSRs blocks in one common post-dominator. Please take a look and let me know any comment or suggestion.
I just checked that my WIP patch (https://reviews.llvm.org/D42600) handle the very first test case. Please take a look and let me know any comment on it.