Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

bad generated code for conditionals connected by (&) #23826

Closed Quuxplusone closed 8 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR23827
Status RESOLVED INVALID
Importance P normal
Reported by Fisnik Kastrati (fkastrati@gmail.com)
Reported on 2015-06-12 03:33:26 -0700
Last modified on 2015-09-10 08:56:18 -0700
Version unspecified
Hardware PC Linux
CC chandlerc@gmail.com, dgregor@apple.com, hans@chromium.org, hfinkel@anl.gov, kparzysz@quicinc.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, t.p.northover@gmail.com
Fixed by commit(s)
Attachments 23827_tester.c (1765 bytes, text/plain)
23827_tests.s (1247 bytes, application/octet-stream)
Blocks
Blocked by
See also PR22723, PR23155, PR5615, PR24743, PR24766
consider the following code snippet (c++):

void ampamp(int x, int y) {
  if (x < 3 && y > 10 )
    printf("%d%d", x, y);
}

void amp(int x, int y) {
  if ((x < 3) & (y > 10) )
    printf("%d%d", x, y);
}

the assembly code generated by clang++ (all versions I tested), is not optimal.
Basically, for both methods the assembly code is identical. For the method
"amp" (with single `&') there should be generated only one jump, as branch
misprediction penalty is very high otherwise

As a side note: the code by intel's  compiler (ICC) is however generating
optimal code for such scenarios, at least for versions icc13, and icc15 that
I've tested.

See: http://goo.gl/oiTPX5
Quuxplusone commented 9 years ago
The optimized IR for both cases is identical (this is from clang/llvm built
from r240416):

define void @ampamp(i32 %x, i32 %y) #0 {
entry:
  %cmp = icmp slt i32 %x, 3
  %cmp1 = icmp sgt i32 %y, 10
  %and5 = and i1 %cmp, %cmp1
  br i1 %and5, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([5 x i8], [5 x i8]* @.str, i64 0, i64 0), i32 %x, i32 %y) #2
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  ret void
}

Note that the IR optimizer has transformed the logical (&&) op into the bitwise
('and') op. It is the x86 backend that is then transforming the 'and' IR back
into a logical op (branches).
Quuxplusone commented 9 years ago
I noticed that PowerPC does what we expect on this test case, and then tracked
this down to a fundamental difference in the initial DAG.

It's controlled by the target lowering hook:
      setJumpIsExpensive();

By default jumps are *not* considered expensive, so x86 and many other targets
will transform comparison logic in SelectionDAGBuilder::visitBr(). There's even
a comment that almost exactly matches the test case in this bug report:

  // If this is a series of conditions that are or'd or and'd together, emit
  // this as a sequence of branches instead of setcc's with and/or operations.
  // As long as jumps are not expensive, this should improve performance.
  // For example, instead of something like:
  //     cmp A, B
  //     C = seteq
  //     cmp D, E
  //     F = setle
  //     or C, F
  //     jnz foo
  // Emit:
  //     cmp A, B
  //     je foo
  //     cmp D, E
  //     jle foo

This is very surprising to me at least. I didn't realize that LLVM had such a
coarse-grain knob to control this, and of course, this contradicts the logic in
instcombine. Maybe it's expected that later (machine?) passes can undo this?

If we change X86TargetLowering to setJumpIsExpensive(), we'll get closer to the
expected output (I changed the 'printf' to a 'foo' to simplify):

_ampamp:
## BB#0:
    cmpl    $3, %edi
    setl    %al
    cmpl    $10, %esi
    setg    %cl
    andb    %al, %cl
    movzbl  %cl, %eax
    cmpl    $1, %eax
    jne LBB0_1
## BB#2:                                ## %if.then
    jmp _foo                    ## TAILCALL
LBB0_1:                                 ## %if.end
    retq

(Specifying a CPU that has cmov doesn't change this.)

...but I'm guessing that making jumps expensive could have a negative impact
for other code. FWIW, I see 5 regression test failures after making this change:
    LLVM :: CodeGen/X86/2008-02-18-TailMergingBug.ll
    LLVM :: CodeGen/X86/MachineBranchProb.ll
    LLVM :: CodeGen/X86/cmov.ll
    LLVM :: CodeGen/X86/or-branch.ll
    LLVM :: CodeGen/X86/remat-invalid-liveness.ll
Quuxplusone commented 9 years ago
(In reply to comment #2)
> this contradicts the logic in instcombine.

That should be "simplifycfg" not "instcombine":

/// FoldBranchToCommonDest - If this basic block is simple enough, and if a
/// predecessor branches to us and one of our successors, fold the block into
/// the predecessor and use logical operations to pick the right destination.
Quuxplusone commented 9 years ago
Fisnik, on what processor and with what benchmark do you measure the problem
with 'amp's generated code? Why is the branch misprediction penalty very high
there? I don't really understand that at all.

Fundamentally, I disbelieve that these should ever be compiled differently.
They are identical code sequences functionally, and we should generate an
identical sequence of instructions.

Also, note that all versions of GCC also generate two jumps in both test cases.
ICC is the only one that generates different code, and the code sequence it
generates for 'amp' seems absurdly worse than what Clang and GCC do for both
test cases and what ICC itself odes for 'ampamp'.

Finally, do note that with Clang's output you have prolog/epilog idiocy in the
code that has nothing to do with the testcase. AFAICT, other than that
silliness, Clang, GCC, and ICC for 'ampamp' all agree, and I think that code is
the best code.

Please re-open with measurements and details about what you think is going
wrong here if there is an actual performance problem.
Quuxplusone commented 9 years ago
(In reply to comment #4)
> Fisnik, on what processor and with what benchmark do you measure the problem
> with 'amp's generated code? Why is the branch misprediction penalty very
> high there? I don't really understand that at all.
>
> Fundamentally, I disbelieve that these should ever be compiled differently.
> They are identical code sequences functionally, and we should generate an
> identical sequence of instructions.
>
>
> Also, note that all versions of GCC also generate two jumps in both test
> cases. ICC is the only one that generates different code, and the code
> sequence it generates for 'amp' seems absurdly worse than what Clang and GCC
> do for both test cases and what ICC itself odes for 'ampamp'.
>
>
> Finally, do note that with Clang's output you have prolog/epilog idiocy in
> the code that has nothing to do with the testcase. AFAICT, other than that
> silliness, Clang, GCC, and ICC for 'ampamp' all agree, and I think that code
> is the best code.
>
>
> Please re-open with measurements and details about what you think is going
> wrong here if there is an actual performance problem.

If a predicate is taken with roughly 0,5 probability, then the branch
prediction is really bad. Since the branchy code (&&) will generate a jump for
each conjunct, there is a branch prediction associated for each jump. Now, as
developer I need to control this, and depending on selectivities of predicates,
I might choose "amp" over 'ampamp'; but clang/gcc compiler does not allow that,
as it is generating the same code for both methods! It destroying the
opportunity for optimization.
The example with 'amp', and 'ampamp' was taken just to illustrate the problem,
I could have n predicates in a if statement, and that would further exacerbate
the problem.

To this end, the compiler should listen to developer and generate the code that
developer is asking. We are talking here about an optimization bug, which is
being well handled by ICC for instance, and problem is not with the method
'ampamp' but with 'amp'.

I have tried the code on all the machines (linux OS) that we have in our
department, all intel processors, the same problem.

To better understand this problem and my point, see the paper: "Conjunctive
selection conditions in main memory" by K.Ross. There you have also the
experimental results, and explanations to why it makes sense to choose 'amp'
over 'ampamp', depending on predicate selectivity.
Quuxplusone commented 9 years ago
(In reply to comment #5)
> (In reply to comment #4)
> > Fisnik, on what processor and with what benchmark do you measure the problem
> > with 'amp's generated code? Why is the branch misprediction penalty very
> > high there? I don't really understand that at all.
> >
> > Fundamentally, I disbelieve that these should ever be compiled differently.
> > They are identical code sequences functionally, and we should generate an
> > identical sequence of instructions.
> >
> >
> > Also, note that all versions of GCC also generate two jumps in both test
> > cases. ICC is the only one that generates different code, and the code
> > sequence it generates for 'amp' seems absurdly worse than what Clang and GCC
> > do for both test cases and what ICC itself odes for 'ampamp'.
> >
> >
> > Finally, do note that with Clang's output you have prolog/epilog idiocy in
> > the code that has nothing to do with the testcase. AFAICT, other than that
> > silliness, Clang, GCC, and ICC for 'ampamp' all agree, and I think that code
> > is the best code.
> >
> >
> > Please re-open with measurements and details about what you think is going
> > wrong here if there is an actual performance problem.
>
> If a predicate is taken with roughly 0,5 probability, then the branch
> prediction is really bad.

This is an oversimplification on all modern x86 processors which have a history-
buffer based predictor. These will detect periodic and other patterned branches
even though they may have a 0.5 probability on average.

But certainly, there do exist unpredictable branches.

> Since the branchy code (&&) will generate a jump
> for each conjunct, there is a branch prediction associated for each jump.
> Now, as developer I need to control this, and depending on selectivities of
> predicates, I might choose "amp" over 'ampamp'; but clang/gcc compiler does
> not allow that, as it is generating the same code for both methods! It
> destroying the opportunity for optimization.
> The example with 'amp', and 'ampamp' was taken just to illustrate the
> problem,
> I could have n predicates in a if statement, and that would further
> exacerbate the problem.
>
> To this end, the compiler should listen to developer and generate the code
> that  developer is asking. We are talking here about an optimization bug,
> which is being well handled by ICC for instance, and problem is not with the
> method 'ampamp' but with 'amp'.

I firmly disagree.

First off, understand that this has *nothing* to do with LLVM, only to do with
Clang at this point. Clang controls how to interpret the language expressions.

We should not encode some magical meaning to a bitwise-and between two boolean
values versus a logical and between the two boolean values, merely because *if*
the boolean values are the results of expressions with side-effects, the
bitwise-and would happen to expose both sets of side-effects.

If you want to propose an annotation which specifically marks conditions as
unpredictable, propose it (not here, to cfe-dev) and we could discuss it. But
co-opting the bitwise-and seems the wrong design.

At a very fundamental level, LLVM should be producing two branches when that is
likely the best lowering, and a single branch when that is likely the best
lowering. If we have actual information about a lack of predictability (either
through analysis of the *logical* structure, not the syntax, or through an
explicit annotation scheme), certainly, we should generate predicated code that
avoids branches. If you have suggestions on how to automatically detect these
cases, please file a bug with that. If you end up designing syntax to represent
this, you should also suggest to llvmdev a proposal for how to model the
annotation in the IR as well, and then we can explore patches to try to get
LLVM to reasonably transform the code to this end.
Quuxplusone commented 9 years ago

"At a very fundamental level, LLVM should be producing two branches when that is likely the best lowering, and a single branch when that is likely the best lowering."

And ideally, all transformations that could be influenced by this would be able to take this information into account. Since this particular case would affect CFG, this would include simplifycfg or instcombine.

Quuxplusone commented 9 years ago

Attached 23827_tester.c (1765 bytes, text/plain): harness for unpredictable branch tests

Quuxplusone commented 9 years ago

Attached 23827_tests.s (1247 bytes, application/octet-stream): test cases for unpredictable branches

Quuxplusone commented 9 years ago
I've attached a test program (test harness in C, actual tests in asm) to check
performance of the current branch optimization that's happening for x86 in the
DAG. To try it out on Linux or Mac:

$ clang 23827_tester.c 23827_tests.s
$ ./a.out

As-is, the test program is modeling this C code:

extern int bar();//{ return 0; }

int foo(int x, int y) {
        if (x != 0 && y != 0) return bar();
        return 1;
}

I used clang to compile that down to asm while toggling the
setJumpIsExpensive() hook.

The 'super' cases are the asm for a bigger compound predicate because the DAG
optimization appears to happily turn any size predicate into multiple branches:

extern int bar();//{ return 0; }

int foo(int w, int x, int y, int z) {
        if (w != 0 && x != 0 && y != 0 && z != 0) return bar();
        return 1;
}
Quuxplusone commented 9 years ago
Here's what I'm measuring for the branchy code relative to the compound
predicate code:

Intel Haswell:
60-65% perf for the 2-way compound predicate
35-40% perf for the 4-way (super) compound predicate

AMD Jaguar:
70-75% perf for the 2-way compound predicate
60-65% perf for the 4-way compound predicate
Quuxplusone commented 9 years ago
I've tried to be careful about code alignment since that's known to shake
things up in a big way:
https://llvm.org/bugs/show_bug.cgi?id=23155#c16
https://llvm.org/bugs/show_bug.cgi?id=5615#c4

Testing x86 perf at this low of a level is notoriously flaky, so please do
review the test code to make sure I haven't introduced any dumb bugs.
Quuxplusone commented 9 years ago
(In reply to comment #11)
> Here's what I'm measuring for the branchy code relative to the compound
> predicate code:
>
> Intel Haswell:
> 60-65% perf for the 2-way compound predicate
> 35-40% perf for the 4-way (super) compound predicate

What does "60-65% perf" mean? Is that a performance improvement over baseline?

>
> AMD Jaguar:
> 70-75% perf for the 2-way compound predicate
> 60-65% perf for the 4-way compound predicate
Quuxplusone commented 9 years ago
(In reply to comment #13)
> (In reply to comment #11)
> > Here's what I'm measuring for the branchy code relative to the compound
> > predicate code:
> >
> > Intel Haswell:
> > 60-65% perf for the 2-way compound predicate
> > 35-40% perf for the 4-way (super) compound predicate
>
> What does "60-65% perf" mean? Is that a performance improvement over
> baseline?

I'm also deeply concerned with the use of relative performance numbers of a
benchmark on branches with nothing *but* branches. That's not representative
and for many processors unnecessarily punitive.

Since LLVM perfectly canonicalizes the two forms of C code, we should be
looking at how real-world benchmarks perform with different cost models for the
branch.
Quuxplusone commented 9 years ago
(In reply to comment #13)
> What does "60-65% perf" mean? Is that a performance improvement over
> baseline?

No - that's perf relative to the compound predicate, so a 35-40% degradation.
If you run the benchmark, you should see something like this:

Branchy perf is 60.6% of compound pred perf (branchy = 2635571 cycles, setne =
1596360 cycles)
Quuxplusone commented 9 years ago
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #11)
> > > Here's what I'm measuring for the branchy code relative to the compound
> > > predicate code:
> > >
> > > Intel Haswell:
> > > 60-65% perf for the 2-way compound predicate
> > > 35-40% perf for the 4-way (super) compound predicate
> >
> > What does "60-65% perf" mean? Is that a performance improvement over
> > baseline?
>
> I'm also deeply concerned with the use of relative performance numbers of a
> benchmark on branches with nothing *but* branches. That's not representative
> and for many processors unnecessarily punitive.
>
> Since LLVM perfectly canonicalizes the two forms of C code, we should be
> looking at how real-world benchmarks perform with different cost models for
> the branch.

The problem, from personal experience, is that:

int foo(int w, int x, int y, int z) {
        if (w != 0 && x != 0 && y != 0 && z != 0) return bar();
        return 1;
}

*is* real code. This comes up a surprising amount. Even worse are extern
predicate functions, which do nothing but evaluate some combination of their
inputs and then return.

And so even if using branches is generally better when there is other code
around, we really should have a cost model that can do the right thing when
these conditions are on the critical path and contention is low.

(FWIW, the modeling of this in the PPC backend is not great either, and
complicated there by the fact that generating isel instructions leads to better
instruction scheduling, etc. but are actually inferior to using branches on
some cores on the critical path).
Quuxplusone commented 9 years ago
(In reply to comment #14)
> I'm also deeply concerned with the use of relative performance numbers of a
> benchmark on branches with nothing *but* branches. That's not representative
> and for many processors unnecessarily punitive.
>
> Since LLVM perfectly canonicalizes the two forms of C code, we should be
> looking at how real-world benchmarks perform with different cost models for
> the branch.

I'll give test-suite a try, but I don't expect anything to rise above the noise
there based on my past experience with it.

I can assure you that we are getting valid complaints from the real-world on
this (separate from this bug report). Unfortunately, I don't have direct access
to that code, so I won't be able to post anything from it.

At this point, my goal is simply to make the transform optional. I'm not trying
to change the defaults for anyone. I agree that most branches are predictable,
and therefore, converting to branches is a reasonable thing to do most of the
time.

For example, if you take the rand() out of the attached micro-benchmark and
make the branches go all one direction or the other, then I see that branchy
perf will outperform compound pred by ~20% on Haswell.

We just need to allow for the oddball unpredictable case, and that's all I'm
trying to show with the test program here: that we can't always win, no matter
what we do.
Quuxplusone commented 9 years ago
Note: there's an LLVM option to enable/disable the transform after:
http://llvm.org/viewvc/llvm-project?view=revision&revision=241179

I ran the benchmarking subset of test-suite on an AMD Jaguar system using this
option along with -O3 -ffast-math -march=btver2. As expected, I'm not seeing
much rising above the noise, so I added a statistic to see where the transform
fires and dug a little deeper on the top 3:

test-suite/SingleSource/Benchmarks/Polybench/stencils/fdtd-apml/fdtd-apml
  19 isel              - Compound predicate transformed
test-suite/SingleSource/Benchmarks/Linpack/linpack-pc
  14 isel              - Compound predicate transformed
test-suite/SingleSource/Benchmarks/Misc-C++/Large/sphereflake
  13 isel              - Compound predicate transformed

The results are mixed when using compound predicates (bitwise ops on condition
register results) instead of the default branch-is-cheap codegen for x86:
1. fdtd: no perf difference (this test is apparently mostly I/O time)
2. linpack: 2.0% loss
3. sphereflake: 1.4% improvement

One part of the differing linpack codegen when using bitwise ops looks
suspicious:
  4029e0:       4d 85 d2                test   %r10,%r10
  4029e3:       4c 89 f1                mov    %r14,%rcx
  4029e6:       0f 9f c0                setg   %al
...
  4029fe:       c5 f8 2e c3             vucomiss %xmm3,%xmm0
...
  402a0a:       0f 9a c1                setp   %cl
  402a0d:       0f 95 c3                setne  %bl
  402a10:       08 cb                   or     %cl,%bl
  402a12:       20 c3                   and    %al,%bl
  402a14:       20 c3                   and    %al,%bl <--- repeat???
  402a16:       0f b6 c3                movzbl %bl,%eax
  402a19:       83 f8 01                cmp    $0x1,%eax

Regardless of that, the conclusion I'm coming to is that we should allow the
programmer to override the default codegen using an 'unpredictable' hint as
Chandler suggested on the devlist.
Quuxplusone commented 8 years ago
As of:
http://reviews.llvm.org/rL246699

...we can use "__builtin_unpredictable()" to get the desired single branch
codegen from either of the ampamp() or amp() test cases:

void ampamp(int x, int y) {
  if (__builtin_unpredictable(x < 3 && y > 10) )
    printf("%d%d", x, y);
}

void amp(int x, int y) {
  if (__builtin_unpredictable((x < 3) & (y > 10)) )
    printf("%d%d", x, y);
}

$ ./clang -O2 -fomit-frame-pointer -S -o - 23827.c
...
## BB#0:                                ## %entry
    movl    %esi, %r8d
    movl    %edi, %edx
    cmpl    $3, %edx
    setl    %al
    cmpl    $10, %r8d
    setg    %cl
    andb    %al, %cl
    movzbl  %cl, %eax
    cmpl    $1, %eax
    jne LBB0_1
## BB#2:                                ## %if.then
    leaq    L_.str(%rip), %rdi
    xorl    %eax, %eax
    movl    %edx, %esi
    movl    %r8d, %edx
    jmp _printf                 ## TAILCALL
LBB0_1:                                 ## %if.end
    retq

-------------------------------------------------------------------------

I'm resolving this as 'invalid' because the request to change codegen in the
original test is not something we can do. Please reopen if you still disagree.
If there are problems with the less branchy codegen after using
'unpredictable', please file a new bug.

I filed bug 24743 to track the problem mentioned in comment 1 / comment 7.