Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

CFG simplify sinking causes regression for a small testcase #29923

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR30950
Status NEW
Importance P normal
Reported by Wei Mi (wmi@google.com)
Reported on 2016-11-08 13:51:14 -0800
Last modified on 2016-11-09 17:04:03 -0800
Version trunk
Hardware PC Linux
CC davidxl@google.com, hfinkel@anl.gov, james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org, mkuper@google.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments r282452.ll (1690 bytes, application/octet-stream)
r282453.ll (1690 bytes, application/octet-stream)
Blocks
Blocked by
See also
------------------------------------------------------------------------
r282453 | rsmith | 2016-09-26 16:49:47 -0700 (Mon, 26 Sep 2016) | 4 lines

P0145R3 (C++17 evaluation order tweaks): consistently emit the LHS of array
subscripting before the RHS, regardless of which is the base and which is the
index.

------------------------------------------------------------------------

--------------------- 1.c -----------------------
#define MAX(x, y)   (((x) > (y)) ? (x) : (y))
long foo(long *maxarray, long k, long size) {
  return MAX(maxarray[k], maxarray[k+size+1]);
}
-------------------------------------------------

For the testcase 1.c above, with and without r282453 llvm generates different
code.

code generated by r282452:
        .cfi_startproc
# BB#0:                                 # %entry
        movq    (%rdi,%rsi,8), %rcx
        addq    %rsi, %rdx
        movq    8(%rdi,%rdx,8), %rax
        cmpq    %rax, %rcx
        cmovgeq %rcx, %rax  =====> select from the vals of two array accesses.
        retq
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc

code generated by r282453:
        .cfi_startproc
# BB#0:                                 # %entry
        movq    (%rdi,%rsi,8), %rax
        leaq    (%rsi,%rdx), %rcx
        leaq    1(%rsi,%rdx), %rdx
        cmpq    8(%rdi,%rcx,8), %rax
        cmovgq  %rsi, %rdx ====> select from the indexes of two array accesses.
        movq    (%rdi,%rdx,8), %rax
        retq
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc

However, the cause of the change above is CFG simplify sinking. The algorithm
to decide InstructionsToSink in CFG simplify sinking depends on the instruction
sequence, i.e. during the reverse traversal of BB, once an unsinkable
instruction is seen, the traversal will be stopped and remaining sinkable
instruction will not be added into InstructionsToSink. That is why when layout
is changed by r282453, CFG simplify sinking may generate different code.

How the different sinking causes the final code to be different:

Without r282453, CFG simplify sinking generates a select which will select from
the addresses of two array accesses: %arrayidx and %arrayidx2.

define i64 @foo(i64* %maxarray, i64 %k, i64 %size) local_unnamed_addr #0 {
entry:
  %arrayidx = getelementptr inbounds i64, i64* %maxarray, i64 %k
  %0 = load i64, i64* %arrayidx, align 8, !tbaa !1
  %add = add nsw i64 %k, %size
  %add1 = add nsw i64 %add, 1
  %arrayidx2 = getelementptr inbounds i64, i64* %maxarray, i64 %add1
  %1 = load i64, i64* %arrayidx2, align 8, !tbaa !1
  %cmp = icmp sgt i64 %0, %1
  %arrayidx.arrayidx2 = select i1 %cmp, i64* %arrayidx, i64* %arrayidx2
  %2 = load i64, i64* %arrayidx.arrayidx2, align 8, !tbaa !1
  ret i64 %2
}

For the IR above, Instcombine can change load(select %arrayidx, %arrayidx2) to
select(%0, %1) and we can remove the last load.

With r282453, CFG simplify sinking generates a select which will select from
the indexes of two array accesses: %k, %add1.

define i64 @foo(i64* %maxarray, i64 %k, i64 %size) local_unnamed_addr #0 {
entry:
  %arrayidx = getelementptr inbounds i64, i64* %maxarray, i64 %k
  %0 = load i64, i64* %arrayidx, align 8, !tbaa !1
  %add = add nsw i64 %k, %size
  %add1 = add nsw i64 %add, 1
  %arrayidx2 = getelementptr inbounds i64, i64* %maxarray, i64 %add1
  %1 = load i64, i64* %arrayidx2, align 8, !tbaa !1
  %cmp = icmp sgt i64 %0, %1
  %k.add1 = select i1 %cmp, i64 %k, i64 %add1
  %arrayidx6 = getelementptr inbounds i64, i64* %maxarray, i64 %k.add1
  %2 = load i64, i64* %arrayidx6, align 8, !tbaa !1
  ret i64 %2
}

For the IR above, Instcombine can not do the load(select) => select(load)
transformation because the select may be deeply embedded inside of an array
expr like load(...select...), so we cannot eliminate the last load.

In summary, two things to be addressed here:
1. It is better for cfg simplify sinking to generate consistent code for
unrelated changes.
2. Either limit cfg simplify sinking to stop sinking at load address, or extend
instcombine to handle the case load(...select(index1, index2)...) and transform
it to select(load1, load2).
Quuxplusone commented 8 years ago

Hi Wei,

Sorry about that. Is it possible to get the IR being input into the SimplifyCFG pass, and how it differs?

If CFG sinking is disabled, I'd presume this testcase would regress significantly, because the selects would not even be produced. Is this correct, or does CFG sinking actually regress performance?

James

Quuxplusone commented 8 years ago

Attached r282452.ll (1690 bytes, application/octet-stream): IR generated by r282452 before cfgsimplify

Quuxplusone commented 8 years ago

Attached r282453.ll (1690 bytes, application/octet-stream): IR generated by r282453 before cfgsimplify

Quuxplusone commented 8 years ago
Hi James,

> Hi Wei,
>
> Sorry about that. Is it possible to get the IR being input into the
> SimplifyCFG pass, and how it differs?

I attached the IR.

>
> If CFG sinking is disabled, I'd presume this testcase would regress
> significantly, because the selects would not even be produced. Is this
> correct, or does CFG sinking actually regress performance?
>
> James

I tried opt -O2 -simplifycfg-sink-common=false on r282452.ll and r282453.ll,
and it could still generate the select and the final code was what we expect.
It is earlycse which helps to eliminate the redundent load.

Wei.
Quuxplusone commented 8 years ago
Hi Wei,

Thanks for this. What's happening seems to be that, in the r282453 case, the
load of %maxarray.addr is first in both conditional blocks. This allows
simplifycfg to *hoist* it, which makes sinking see that the GEPs are of the
same base pointer.

We have heuristic bailouts so we don't produce nasty GEPs, but a GEP with only
a variable (PHI) final operand is allowed.

I'm really not sure what to do here. GEPs with a single variable final operand
were even allowed to be sunk before I rewrote this code, so adding a bailout
for that could cause all sorts of problems...

The issue with this optimization is that some testcases *require* it to fire
before SROA, and others *require* it not to fire before SROA. It's very
difficult to please all consumers.

I'd like to get around to committing the GVN-sinking pass, which would allow us
to cripple simplifycfg to only handle the most trivial cases (without which
SROA can't handle some cases) and to do the heavy lifting *after SROA* in GVN-
sink.

In the meantime I'm open to suggestions about what to do for your testcase :(

Cheers,

James
Quuxplusone commented 8 years ago
(In reply to comment #5)
> Hi Wei,
>
> Thanks for this. What's happening seems to be that, in the r282453 case, the
> load of %maxarray.addr is first in both conditional blocks. This allows
> simplifycfg to *hoist* it, which makes sinking see that the GEPs are of the
> same base pointer.
>
> We have heuristic bailouts so we don't produce nasty GEPs, but a GEP with
> only a variable (PHI) final operand is allowed.
>
> I'm really not sure what to do here. GEPs with a single variable final
> operand were even allowed to be sunk before I rewrote this code, so adding a
> bailout for that could cause all sorts of problems...
>

Thanks for looking at the problem.

Could you be more specific or even better to have a testcase to show what kind
of problem we will see if we bailout for all GEPs?

>
> The issue with this optimization is that some testcases *require* it to fire
> before SROA, and others *require* it not to fire before SROA. It's very
> difficult to please all consumers.
>
> I'd like to get around to committing the GVN-sinking pass, which would allow
> us to cripple simplifycfg to only handle the most trivial cases (without
> which SROA can't handle some cases) and to do the heavy lifting *after SROA*
> in GVN-sink.
>
> In the meantime I'm open to suggestions about what to do for your testcase :(

I was thinking about whether extending instcombine may be easier. The testcase
can be fixed by by supporting the transformation of
load(gep(array, select(idx1, idx2))) ==> select(load(gep(array, idx1)),
load(gep(array, idx2))).
But we need to apply isSafeToLoadUnconditionally on a new GEP node without
actually inserting the new node into IR. It seems a problem.

In addition, we may have load(gep(gep...(select)))... It will be nasty.

Thanks,
Wei.