Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

performance oportunity for CG with instruction reordering #12974

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR12869
Status NEW
Importance P enhancement
Reported by Kostya Serebryany (kcc@google.com)
Reported on 2012-05-18 09:03:43 -0700
Last modified on 2012-05-20 13:17:19 -0700
Version trunk
Hardware PC Linux
CC anton@korobeynikov.info, baldrick@free.fr, evan.cheng@apple.com, llvm-bugs@lists.llvm.org, nicholas@mxc.ca, nlewycky@google.com, resistor@mac.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Here are two semantically equivalent functions.
LLVM x86 CG produces 2 instructions for function one and 4 for another.

void foo(int * restrict x, int * restrict y) {
  int tx = *x;
  int ty = *y;
  tx = tx >> 10;
  ty = ty >> 10;
  *x = tx;
  *y = ty;
}

void bar(int * restrict x, int * restrict y) {
  *x = *x >> 10;
  *y = *y >> 10;
}

; ModuleID = 'tt.c'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

define void @foo(i32* nocapture noalias  %A1, i32* nocapture noalias %A2)
nounwind uwtable {
entry:
  %L1 = load i32* %A1, align 4
  %L2 = load i32* %A2, align 4
  %S1  = ashr i32 %L1, 10
  %S2  = ashr i32 %L2, 10
  store i32 %S1, i32* %A1, align 4
  store i32 %S2, i32* %A2, align 4
  ret void
}

define void @bar(i32* nocapture noalias  %A1, i32* nocapture noalias %A2)
nounwind uwtable {
entry:
  %L1 = load i32* %A1, align 4
  %S1  = ashr i32 %L1, 10
  store i32 %S1, i32* %A1, align 4
  %L2 = load i32* %A2, align 4
  %S2  = ashr i32 %L2, 10
  store i32 %S2, i32* %A2, align 4
  ret void
}

First:
        sarl    $10, (%rdi)
        sarl    $10, (%rsi)
Second:
        movl    (%rsi), %eax
        sarl    $10, (%rdi)
        sarl    $10, %eax
        movl    %eax, (%rsi)
Quuxplusone commented 12 years ago

Kostya, what if you provide -combiner-aa argument to llc ?

Quuxplusone commented 12 years ago
(In reply to comment #1)
> Kostya, what if you provide -combiner-aa argument to llc ?
Actually, the code for "foo" is longer and for "bar" - shorter. "llc -combiner-
alias-analysis -combiner-global-alias-analysis" makes the output identical.
Quuxplusone commented 12 years ago

Sorting blocks into loads+ops+stores is one of my old todo-list wishlist items. It makes a lot of things easier to analyze, and lets backends do trivial load-fusion and store-fusion. We should do this as an IR pass, and it should turn @bar into @foo.

Quuxplusone commented 12 years ago
>> Actually, the code for "foo" is longer and for "bar" - shorter.
Sure. Meant to say "for one function", not "for function one".

>> "llc -combiner-alias-analysis -combiner-global-alias-analysis" makes the
output identical.

Coolness! Any plans to enable this by default?
OTOH Nick's suggestion to implement this on IR level makes sense too.
Quuxplusone commented 12 years ago
(In reply to comment #4)
> Coolness! Any plans to enable this by default?
> OTOH Nick's suggestion to implement this on IR level makes sense too.
Well... it's "experimental" for something like 2 or 3 years already... Maybe
Evan or Owen will comment why it's not turned on yet...
Quuxplusone commented 12 years ago

It's poorly tested, expensive, and showed little benefit on most test suites when we tried it. Beyond that, it's rapidly being superseded by Andy's new scheduler work.

Quuxplusone commented 12 years ago
(In reply to comment #6)
> It's poorly tested, expensive, and showed little benefit on most test suites
> when we tried it.  Beyond that, it's rapidly being superseded by Andy's new
> scheduler work.
Well, I know at least 2 cases when it provided much better results:

1. EEMBC on ARM
2. It's impossible to match mem-mem instructions w/o them. Right now the only
target in the tree which has mem-mem instructions is msp430 :)
Quuxplusone commented 12 years ago
(In reply to comment #6)
> Beyond that, it's rapidly being superseded by Andy's new scheduler work.
How the scheduler will help to fold memory operands, btw?