Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Pathological register allocation within gigantic single basic block function (x86-64) #20013

Open Quuxplusone opened 10 years ago

Quuxplusone commented 10 years ago
Bugzilla Link PR20014
Status NEW
Importance P normal
Reported by Vlad Petric (vlad@drpetric.com)
Reported on 2014-06-12 12:15:54 -0700
Last modified on 2015-04-15 22:35:36 -0700
Version trunk
Hardware PC Linux
CC anton@korobeynikov.info, atrick@apple.com, echristo@gmail.com, jonathan.sauer@gmx.de, lhames@gmail.com, llvm-bugs@lists.llvm.org, quentin.colombet@gmail.com, rafael@espindo.la
Fixed by commit(s)
Attachments Evaluator.ll.gz (187631 bytes, application/gzip)
EvalStripped.cpp.ll (262330 bytes, application/octet-stream)
EvalStripped2.ll (218027 bytes, application/octet-stream)
Blocks
Blocked by
See also
The code in question is from an Othello minimax engine.
https://github.com/vladpetric/ntest/branches (jmobs2 branch)

The static evaluation function of ntest is a huge basic block, of around 550
instructions. It's a complex basic block, with a lot of live values.

Unfortunately, clang produces no fewer than 75 spills! I count a spill as a
store/load pair to the stack, which includes pushes, pops, moves and other
instructions with an operand in memory and using %rsp (callee-managed registers
are included here as well). At the same time gcc 4.8.2 generates 9, and icc
only 4. Given that this is a single linear basic block, all instructions are
executed exactly the same number of times.

Instruction statistics for the function CEvaluator::EvalMobs :

gcc 4.8.2:
    Total st stores     loads  st loads     moves       alu      mult    popcnt
      547         9       122         9       117       261        28         1

icc:
    Total st stores     loads  st loads     moves       alu      mult    popcnt
      541         4       122         4       122       260        28         1

clang (3.5 trunk):
    Total st stores     loads  st loads     moves       alu      mult    popcnt
      704        76       122        75        81       321        28         1

A few notes:

* there are no branches or non-stack stores. st loads mean stack loads, st
stores mean stack stores.

* non-stack loads are loads from evaluation tables. The mults come primarily
from the "magic multiplier" bit gather trick. popcnt - well, that's explicitly
called. There's a constant number of them.

* an x86_64 ALU instruction that has a memory operand is counted twice - both
ALU and load. Such operations are likely broken into two micro-ops, generally
speaking (one load and one alu).

The code can be compiled by fetching the sourcecode, then a straightfoward
cmake project, but I've also attached the .ll file which includes the
pathological function (CEvaluator::EvalMobs).
Quuxplusone commented 10 years ago

Attached Evaluator.ll.gz (187631 bytes, application/gzip): File which includes CEvaluator::EvalMobs pathological function.

Quuxplusone commented 10 years ago
I haven’t looked at the test case, but based on the statistics I would say we,
at least partly, pay the price of the aggressiveness of the register coalescer
(see the numbers of moves).

You can try with -join-liveintervals=false to see what happens.

Also, do you know why we have so many alu operations?
I do not think that those are added by the register allocator, so it is hard to
blame the register allocator if the input unallocated code is not somehow close.
Quuxplusone commented 10 years ago

Are we generates moves that look like alu instructions?

Is the schedule the same?

-debug-only=misched should report the register pressure after coalescing somewhere in its output.

Quuxplusone commented 10 years ago
Thanks for the comments!

Passing down (-mllvm ) -join-liveintervals=false does help:

Total  st stores      loads   st loads      moves        alu       mult
popcnt
  694         37        122         37        147        322         28          1

Sure, there's more moves, but those should be cheaper than stores and loads
(moves may even be handled entirely in the frontend with recent processors)

Andrew, I passed the following param:
-mllvm -debug-only=misched

$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-
result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -
mllvm -debug-only=misched
clang (LLVM option parsing): Unknown command line argument '-debug-
only=misched'.  Try: 'clang (LLVM option parsing) -help'
clang (LLVM option parsing): Did you mean '-debug-pass=misched'?

Regarding the ALU operations: I think I know what's going on with respect to
the ALU->move

A pattern in the code is the following is to extract a row from a bitboard (a
bitboard is a 64-bit binary representation of the board; usually two such
bitboards are used to represent a position). The code is pretty straightforward:

#define BB_EXTRACT_ROW_PATTERN(ROW) \
    base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
    base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2

    TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
    value += pR1[Row0];
    TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
    value += pR2[Row1];
    (and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)

So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers.
The address is actually ((empty >> (8 * (ROW))) & 0xff) * 4.

LLVM is folding the * 4 operation with the AND and the shift. So you end up
with (empty >> (8 * ROW - 2)) & (0xff << 2 which is 0x3fc)

GCC and ICC don't do that; they instead use the more complex addressing modes.
The number of instructions is the same though:

LLVM:
1: empty >> (8 * ROW - 2)
2: and #1 with 0x3fc
3: load #2

GCC and ICC:
1: empty >> (8 * ROW)
2: and with 0xff which is implemented as a movzbl
3: load #2 with a multiplier of 4

To improve the comparison I migrated movzbl and movzwl into the ALU section.

gcc:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  547          9      122          9         81        297         28          1

clang with regular flags:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  713         75      122         75         76        336         28          1

clang with joining of live intervals disabled:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  694         37      122         37        133        336         28          1

Anyway:

a) I'm still trying to figure out where the reset of the difference in alu
instructions is coming from

b) is it possible to disable as well the shift/multiply (by 4 in the example
above) folded into an AND?
Quuxplusone commented 10 years ago
Thanks for the comments!

Passing down (-mllvm ) -join-liveintervals=false does help:

Total  st stores      loads   st loads      moves        alu       mult
popcnt
  694         37        122         37        147        322         28          1

Sure, there's more moves, but those should be cheaper than stores and loads
(moves may even be handled entirely in the frontend with recent processors)

Andrew, I passed the following param:
-mllvm -debug-only=misched

$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-
result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -
mllvm -debug-only=misched
clang (LLVM option parsing): Unknown command line argument '-debug-
only=misched'.  Try: 'clang (LLVM option parsing) -help'
clang (LLVM option parsing): Did you mean '-debug-pass=misched'?

Regarding the ALU operations: I think I know what's going on with respect to
the ALU->move

A pattern in the code is the following is to extract a row from a bitboard (a
bitboard is a 64-bit binary representation of the board; usually two such
bitboards are used to represent a position). The code is pretty straightforward:

#define BB_EXTRACT_ROW_PATTERN(ROW) \
    base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
    base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2

    TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
    value += pR1[Row0];
    TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
    value += pR2[Row1];
    (and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)

So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers.
The address is actually ((empty >> (8 * (ROW))) & 0xff) * 4.

LLVM is folding the * 4 operation with the AND and the shift. So you end up
with (empty >> (8 * ROW - 2)) & (0xff << 2 which is 0x3fc)

GCC and ICC don't do that; they instead use the more complex addressing modes.
The number of instructions is the same though:

LLVM:
1: empty >> (8 * ROW - 2)
2: and #1 with 0x3fc
3: load #2

GCC and ICC:
1: empty >> (8 * ROW)
2: and with 0xff which is implemented as a movzbl
3: load #2 with a multiplier of 4

To improve the comparison I migrated movzbl and movzwl into the ALU section.

gcc:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  547          9      122          9         81        297         28          1

clang with regular flags:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  713         75      122         75         76        336         28          1

clang with joining of live intervals disabled:
Total  st stores    loads   st loads      moves        alu       mult     popcnt
  694         37      122         37        133        336         28          1

Anyway:

a) I'm still trying to figure out where the reset of the difference in alu
instructions is coming from

b) is it possible to disable as well the shift/multiply (by 4 in the example
above) folded into an AND?
Quuxplusone commented 10 years ago

Oh yeah. Hidden flags like -debug-only are not available in the Release build. You (or someone else) needs to run with Release+Asserts or Debug build. If you ping me next week I can probably do it for you.

Quuxplusone commented 10 years ago
Ok, so I've identified another issue:

As I implied earlier, the function computes a lot of base-3 patterns. Most of
these base-3 pattern calculations take the form of:

    base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff]
    base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2

Things work well up to the final summation. Essentially, gcc and icc generate
an lea instruction to do the addition, e.g.:

439fad: mov    0x768880(,%rax,4),%eax
439fb4: mov    0x768880(,%rsi,4),%esi
439fbb: lea    (%rax,%rsi,2),%r11d

However, llvm produces an extra add:

434755: mov    0x6ad540(%rcx),%ecx
43475b: add    %ecx,%ecx
43475d: add    0x6ad540(%rdx),%ecx

True, the number of instructions is the same. However(!), the final add is in
fact two micro-ops (see Agner's instruction tables
http://agner.org/optimize/instruction_tables.pdf , page 185; it's not just
Haswell though).

This explains the discrepancy in ALU counts almost entirely, as I'm counting an
instruction such as add 0x6ad540(%rdx),%ecx as a load and an add.

To summarize:

* there's an issue with join-liveintervals cause about 38 spills

* there's another issue with the rest 37 spills.

* finally, a + 2 * b where a and b are unsigned 32 bit integers is not folded
into a single lea instruction.
Quuxplusone commented 10 years ago
Andrew,

Per your request, I ran the misched thing. The output regarding register
pressure on EvalMobs looks as follows:

********** MI Scheduling **********
_ZNK10CEvaluator8EvalMobsERK4Pos2jj:BB#0
  From: DBG_VALUE %RDI, %noreg, !"this"; line no:0
    To: RETQ %EAX<kill>; dbg:/home/vlad/pet/git/ntest/src/Evaluator.cpp:564
 RegionInstrs: 643 Remaining: 0
Max Pressure: GR8_ABCD_H=2
GR8_ABCD_L=2
GR8_NOREX=8
GR8_NOREX+GR64_NOREX_and_GR64_TC=8
GR8_NOREX+GR64_TCW64=8
GR64_NOREX=12
GR8=8
GR8_NOREX+GR64_TC=8
GR8+GR64_TCW64=8
GR64_NOREX+GR64_TC=12
GR8+GR64_NOREX=12
GR8+GR64_TC=8
GR64=106
Live In: %BL %BP %BX %CH %CL %DL
Live Out:
Live Thru:
  LiveReg: AH
  LiveReg: AL
GR8_NOREX Limit 4 Actual 8
GR64 Limit 30 Actual 106
Excess PSets: GR8_NOREX GR64
Disabled scoreboard hazard recognizer
Disabled scoreboard hazard recognizer
... Then a bunch of SU(*) entries.

Note that you should be able to run llc on the .ll (gziped) file that I added,
and I'm guessing that you'll obtain the same thing.
Quuxplusone commented 10 years ago

Again, only CEvaluator::EvalMobs is of consequence. If need be, I can streamline the file to include that function and its inlined callees.

Quuxplusone commented 10 years ago
> * there's an issue with join-liveintervals cause about 38 spills

Yes, this is a known issue with our splitting mechanism. PR18825 is another
example where we hit that.
http://llvm.org/bugs/show_bug.cgi?id=18825

> * finally, a + 2 * b where a and b are unsigned 32 bit integers is not
> folded into a single lea instruction.

Is it worth filling another PR for this specific issue?

Regarding the register allocator problem, could reduce even more the test case?
I.e., even if the reduced test case does not show such a huge difference in
spill code between clang and gcc/icc, if we can work on something smaller but
still showing a reasonably increase in spill, that would speed-up the process
:).

Thanks.
Quuxplusone commented 10 years ago
Ok, I'll file another report for a + 2*b not being LEA'd. It's a different
issue indeed.

I'll make an attempt to simplify the function as well.
Quuxplusone commented 10 years ago

Attached EvalStripped.cpp.ll (262330 bytes, application/octet-stream): .ll file with just the pathological function (CEvaluator::EvalMobs)

Quuxplusone commented 10 years ago
Alright, so I've added a file with just the function. The function is still the
same (it is not simplified). While preparing it, I noticed something
interesting.

Basically, at a higher level the function uses a lot of accumulator addition:

value += compute something 0
value += compute something 1
...
value += compute something N

return value

The frontend (which produces the .ll file) seems to completely disregard the
ordering though. It calculates something0 to somethingN, and puts them into
some IR registers. Then at the end it does all the chain additions.

  %628 = add i32 %281, %276, !dbg !1207
  %629 = add i32 %628, %287, !dbg !1212
  %630 = add i32 %629, %370, !dbg !1178
  %631 = add i32 %630, %375, !dbg !1183
  %632 = add i32 %631, %381, !dbg !1135
  %633 = add i32 %632, %459, !dbg !1140
  %634 = add i32 %633, %464, !dbg !1079
  %635 = add i32 %634, %470, !dbg !1084
  %636 = add i32 %635, %506, !dbg !1121
  %637 = add i32 %636, %511, !dbg !1164
  %638 = add i32 %637, %517, !dbg !1193
  %639 = add i32 %638, %611, !dbg !1226
  %640 = add i32 %639, %607, !dbg !1231
  %641 = add i32 %640, %615, !dbg !1250
  %642 = add i32 %641, %619, !dbg !1236
  %643 = add i32 %642, %623, !dbg !1241
  %644 = add i32 %643, %627, !dbg !1245
  ret i32 %644, !dbg !970

By distilling the chain addition from the actual computation, the frontend
(TTBOMK) significantly increases the register pressure. I'm assuming that the
misched pass would try to improve schedule the instructions and reduce the
register pressure (which in this case seems to be the result of an artificial
destruction by the frontend). Interestingly, turning off misched helps (though
not as much as turning off join-liveintervals). Here are some updated stats
(note that I have revised my methodology slightly, e.g. I'm counting an add
%reg, (memory location) as a load, add and store).

Baseline, neither opt turned off:

Total st stores     loads  st loads     moves       alu      mult    popcnt
  734        76       122        76        85       346        29         0

    add   134
    and    95
    dec     1
   imul    29
    lea     8
    mov   233
 movabs    30
 movzbl     8
 movzwl     6
     or    12
    pop     6
   push     6
    sar     1
    shl     6
    shr    73
    sub     2

misched turned off:

Total st stores     loads  st loads     moves       alu      mult    popcnt
  693        45       122        45       106       346        29         0

    add   134
    and    95
    dec     1
   imul    29
    lea     8
    mov   192
 movabs    31
 movzbl     8
 movzwl     6
     or    12
    pop     6
   push     6
    sar     1
    shl     6
    shr    73
    sub     2

livejoin turned off:

Total st stores     loads  st loads     moves       alu      mult    popcnt
  702        37       122        37       131       346        29         0

    add   134
    and    95
    dec     1
   imul    29
    lea     8
    mov   202
 movabs    30
 movzbl     8
 movzwl     6
     or    12
    pop     6
   push     6
    sar     1
    shl     6
    shr    73
    sub     2

Both livejoin and misched turned off:

Total st stores     loads  st loads     moves       alu      mult    popcnt
  714        45       122        45       127       346        29         0

    add   134
    and    95
    dec     1
   imul    29
    lea     8
    mov   214
 movabs    30
 movzbl     8
 movzwl     6
     or    12
    pop     6
   push     6
    sar     1
    shl     6
    shr    73
    sub     2

The interaction between the two (livejoin and misched) is non-linear.
Quuxplusone commented 10 years ago

Attached EvalStripped2.ll (218027 bytes, application/octet-stream): .ll file with simplified pathological function

Quuxplusone commented 10 years ago
I've added another file, which has about half of the pathological function.
It's still misbehaving.

Note that reducing it further (half that function) does not help, llvm works
really well with either half. It seems that certain threshold needs to be met
for the massive spills to come.

In any case, stats for EvalStripped2.ll:
baseline:
Total st stores     loads  st loads     moves       alu      mult    popcnt
  389        19        69        19        59       194        29         0

    add    75
    and    62
    dec     1
   imul    29
    mov    85
 movabs    22
 movzbl     4
 movzwl     2
    pop     6
   push     6
    shr    49
    sub     1

misched disabled:
Total st stores     loads  st loads     moves       alu      mult    popcnt
  396        21        69        21        62       194        29         0

    add    75
    and    62
    dec     1
   imul    29
    mov    92
 movabs    22
 movzbl     4
 movzwl     2
    pop     6
   push     6
    shr    49
    sub     1

join liveintervals disabled:
Total st stores     loads  st loads     moves       alu      mult    popcnt
  408        18        69        18        80       194        29         0

    add    75
    and    62
    dec     1
   imul    29
    mov   104
 movabs    22
 movzbl     4
 movzwl     2
    pop     6
   push     6
    shr    49
    sub     1

both misched and join liveintervals disabled:

Total st stores     loads  st loads     moves       alu      mult    popcnt
  413        20        69        20        81       194        29         0

    add    75
    and    62
    dec     1
   imul    29
    mov   109
 movabs    22
 movzbl     4
 movzwl     2
    pop     6
   push     6
    shr    49
    sub     1

Based on my visual examination of EvalStripped2.ll:

Again, the frontend distills the chained summation, thereby significantly
increasing the register pressure, and misched can't undo the damage. Of course,
this is my non-llvm-er opinion.
Quuxplusone commented 10 years ago

Thanks Vlad!

I'll try to have a look this week.

Quuxplusone commented 10 years ago
(In reply to comment #15)
> Again, the frontend distills the chained summation, thereby significantly
> increasing the register pressure, and misched can't undo the damage.

I had a quick look into EvalStripped2.ll and it seems indeed, that we hoist the
something0...somethingN before doing the computations.
To be able to recover from this situation in the register allocator, we would
need to be able to rematerialize the whole chain of dependency.
This seems heroic to me and of little use in the general case. Therefore, I
believe the proper fix would be either in misched or earlier in the pipeline.

There may be good reasons why misched cannot recover from this situation (maybe
some kind of aliasing we need to honor?)

Vlad, could you check why misched cannot undo the damage?
Andy would be your source of information if you need help.

That way, we could determine where is the proper place to fix that.

I'll have a closer look, at some point, of the register allocator to be sure I
am not missing something.
Quuxplusone commented 10 years ago

I completely agree that it's not a register allocation problem.

By "earlier in the pipeline" you mean the front-end? It seems a bit strange to me that it separates computation chains like this. Obviously, neither gcc nor icc do it. Should I file a bug against clang then?

With respect to misched, it seems to be an issue of size. If I halve the chained additions, it is actually doing a much better job.

Andy, any thoughts on misched?

Quuxplusone commented 10 years ago
I agree with Quentin's guess that the scheduler should be able to handle it by
reducing pressure but there might be some unexpected dependency that prevents
the scheduler from correcting.

-debug-only=misched should show you the scheduling DAG, the ready queue at each
step, and why the scheduler chose to schedule an instruction.

Note that the scheduler should detect high register pressure *before* it begins
scheduling and use select heuristics in an attempt to reduce pressure. It is
also possible that the heuristics are failing to handle this case.
Unfortunately, I don't have time to look at it for the next couple weeks.
Quuxplusone commented 9 years ago

Sorry, I haven't looked into this case, but I want to point out that the problem may be in the DAG builder, not the scheduler itself.

On thing that mutates the DAG is load clustering. You can try -misched-cluster=false to see of that's the problem.