Open Quuxplusone opened 10 years ago
Attached Evaluator.ll.gz
(187631 bytes, application/gzip): File which includes CEvaluator::EvalMobs pathological function.
I haven’t looked at the test case, but based on the statistics I would say we,
at least partly, pay the price of the aggressiveness of the register coalescer
(see the numbers of moves).
You can try with -join-liveintervals=false to see what happens.
Also, do you know why we have so many alu operations?
I do not think that those are added by the register allocator, so it is hard to
blame the register allocator if the input unallocated code is not somehow close.
Are we generates moves that look like alu instructions?
Is the schedule the same?
-debug-only=misched should report the register pressure after coalescing somewhere in its output.
Thanks for the comments!
Passing down (-mllvm ) -join-liveintervals=false does help:
Total st stores loads st loads moves alu mult
popcnt
694 37 122 37 147 322 28 1
Sure, there's more moves, but those should be cheaper than stores and loads
(moves may even be handled entirely in the frontend with recent processors)
Andrew, I passed the following param:
-mllvm -debug-only=misched
$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-
result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -
mllvm -debug-only=misched
clang (LLVM option parsing): Unknown command line argument '-debug-
only=misched'. Try: 'clang (LLVM option parsing) -help'
clang (LLVM option parsing): Did you mean '-debug-pass=misched'?
Regarding the ALU operations: I think I know what's going on with respect to
the ALU->move
A pattern in the code is the following is to extract a row from a bitboard (a
bitboard is a 64-bit binary representation of the board; usually two such
bitboards are used to represent a position). The code is pretty straightforward:
#define BB_EXTRACT_ROW_PATTERN(ROW) \
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
value += pR1[Row0];
TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
value += pR2[Row1];
(and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)
So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers.
The address is actually ((empty >> (8 * (ROW))) & 0xff) * 4.
LLVM is folding the * 4 operation with the AND and the shift. So you end up
with (empty >> (8 * ROW - 2)) & (0xff << 2 which is 0x3fc)
GCC and ICC don't do that; they instead use the more complex addressing modes.
The number of instructions is the same though:
LLVM:
1: empty >> (8 * ROW - 2)
2: and #1 with 0x3fc
3: load #2
GCC and ICC:
1: empty >> (8 * ROW)
2: and with 0xff which is implemented as a movzbl
3: load #2 with a multiplier of 4
To improve the comparison I migrated movzbl and movzwl into the ALU section.
gcc:
Total st stores loads st loads moves alu mult popcnt
547 9 122 9 81 297 28 1
clang with regular flags:
Total st stores loads st loads moves alu mult popcnt
713 75 122 75 76 336 28 1
clang with joining of live intervals disabled:
Total st stores loads st loads moves alu mult popcnt
694 37 122 37 133 336 28 1
Anyway:
a) I'm still trying to figure out where the reset of the difference in alu
instructions is coming from
b) is it possible to disable as well the shift/multiply (by 4 in the example
above) folded into an AND?
Thanks for the comments!
Passing down (-mllvm ) -join-liveintervals=false does help:
Total st stores loads st loads moves alu mult
popcnt
694 37 122 37 147 322 28 1
Sure, there's more moves, but those should be cheaper than stores and loads
(moves may even be handled entirely in the frontend with recent processors)
Andrew, I passed the following param:
-mllvm -debug-only=misched
$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-
result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -
mllvm -debug-only=misched
clang (LLVM option parsing): Unknown command line argument '-debug-
only=misched'. Try: 'clang (LLVM option parsing) -help'
clang (LLVM option parsing): Did you mean '-debug-pass=misched'?
Regarding the ALU operations: I think I know what's going on with respect to
the ALU->move
A pattern in the code is the following is to extract a row from a bitboard (a
bitboard is a 64-bit binary representation of the board; usually two such
bitboards are used to represent a position). The code is pretty straightforward:
#define BB_EXTRACT_ROW_PATTERN(ROW) \
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
value += pR1[Row0];
TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
value += pR2[Row1];
(and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)
So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers.
The address is actually ((empty >> (8 * (ROW))) & 0xff) * 4.
LLVM is folding the * 4 operation with the AND and the shift. So you end up
with (empty >> (8 * ROW - 2)) & (0xff << 2 which is 0x3fc)
GCC and ICC don't do that; they instead use the more complex addressing modes.
The number of instructions is the same though:
LLVM:
1: empty >> (8 * ROW - 2)
2: and #1 with 0x3fc
3: load #2
GCC and ICC:
1: empty >> (8 * ROW)
2: and with 0xff which is implemented as a movzbl
3: load #2 with a multiplier of 4
To improve the comparison I migrated movzbl and movzwl into the ALU section.
gcc:
Total st stores loads st loads moves alu mult popcnt
547 9 122 9 81 297 28 1
clang with regular flags:
Total st stores loads st loads moves alu mult popcnt
713 75 122 75 76 336 28 1
clang with joining of live intervals disabled:
Total st stores loads st loads moves alu mult popcnt
694 37 122 37 133 336 28 1
Anyway:
a) I'm still trying to figure out where the reset of the difference in alu
instructions is coming from
b) is it possible to disable as well the shift/multiply (by 4 in the example
above) folded into an AND?
Oh yeah. Hidden flags like -debug-only are not available in the Release build. You (or someone else) needs to run with Release+Asserts or Debug build. If you ping me next week I can probably do it for you.
Ok, so I've identified another issue:
As I implied earlier, the function computes a lot of base-3 patterns. Most of
these base-3 pattern calculations take the form of:
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff]
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
Things work well up to the final summation. Essentially, gcc and icc generate
an lea instruction to do the addition, e.g.:
439fad: mov 0x768880(,%rax,4),%eax
439fb4: mov 0x768880(,%rsi,4),%esi
439fbb: lea (%rax,%rsi,2),%r11d
However, llvm produces an extra add:
434755: mov 0x6ad540(%rcx),%ecx
43475b: add %ecx,%ecx
43475d: add 0x6ad540(%rdx),%ecx
True, the number of instructions is the same. However(!), the final add is in
fact two micro-ops (see Agner's instruction tables
http://agner.org/optimize/instruction_tables.pdf , page 185; it's not just
Haswell though).
This explains the discrepancy in ALU counts almost entirely, as I'm counting an
instruction such as add 0x6ad540(%rdx),%ecx as a load and an add.
To summarize:
* there's an issue with join-liveintervals cause about 38 spills
* there's another issue with the rest 37 spills.
* finally, a + 2 * b where a and b are unsigned 32 bit integers is not folded
into a single lea instruction.
Andrew,
Per your request, I ran the misched thing. The output regarding register
pressure on EvalMobs looks as follows:
********** MI Scheduling **********
_ZNK10CEvaluator8EvalMobsERK4Pos2jj:BB#0
From: DBG_VALUE %RDI, %noreg, !"this"; line no:0
To: RETQ %EAX<kill>; dbg:/home/vlad/pet/git/ntest/src/Evaluator.cpp:564
RegionInstrs: 643 Remaining: 0
Max Pressure: GR8_ABCD_H=2
GR8_ABCD_L=2
GR8_NOREX=8
GR8_NOREX+GR64_NOREX_and_GR64_TC=8
GR8_NOREX+GR64_TCW64=8
GR64_NOREX=12
GR8=8
GR8_NOREX+GR64_TC=8
GR8+GR64_TCW64=8
GR64_NOREX+GR64_TC=12
GR8+GR64_NOREX=12
GR8+GR64_TC=8
GR64=106
Live In: %BL %BP %BX %CH %CL %DL
Live Out:
Live Thru:
LiveReg: AH
LiveReg: AL
GR8_NOREX Limit 4 Actual 8
GR64 Limit 30 Actual 106
Excess PSets: GR8_NOREX GR64
Disabled scoreboard hazard recognizer
Disabled scoreboard hazard recognizer
... Then a bunch of SU(*) entries.
Note that you should be able to run llc on the .ll (gziped) file that I added,
and I'm guessing that you'll obtain the same thing.
Again, only CEvaluator::EvalMobs is of consequence. If need be, I can streamline the file to include that function and its inlined callees.
> * there's an issue with join-liveintervals cause about 38 spills
Yes, this is a known issue with our splitting mechanism. PR18825 is another
example where we hit that.
http://llvm.org/bugs/show_bug.cgi?id=18825
> * finally, a + 2 * b where a and b are unsigned 32 bit integers is not
> folded into a single lea instruction.
Is it worth filling another PR for this specific issue?
Regarding the register allocator problem, could reduce even more the test case?
I.e., even if the reduced test case does not show such a huge difference in
spill code between clang and gcc/icc, if we can work on something smaller but
still showing a reasonably increase in spill, that would speed-up the process
:).
Thanks.
Ok, I'll file another report for a + 2*b not being LEA'd. It's a different
issue indeed.
I'll make an attempt to simplify the function as well.
Attached EvalStripped.cpp.ll
(262330 bytes, application/octet-stream): .ll file with just the pathological function (CEvaluator::EvalMobs)
Alright, so I've added a file with just the function. The function is still the
same (it is not simplified). While preparing it, I noticed something
interesting.
Basically, at a higher level the function uses a lot of accumulator addition:
value += compute something 0
value += compute something 1
...
value += compute something N
return value
The frontend (which produces the .ll file) seems to completely disregard the
ordering though. It calculates something0 to somethingN, and puts them into
some IR registers. Then at the end it does all the chain additions.
%628 = add i32 %281, %276, !dbg !1207
%629 = add i32 %628, %287, !dbg !1212
%630 = add i32 %629, %370, !dbg !1178
%631 = add i32 %630, %375, !dbg !1183
%632 = add i32 %631, %381, !dbg !1135
%633 = add i32 %632, %459, !dbg !1140
%634 = add i32 %633, %464, !dbg !1079
%635 = add i32 %634, %470, !dbg !1084
%636 = add i32 %635, %506, !dbg !1121
%637 = add i32 %636, %511, !dbg !1164
%638 = add i32 %637, %517, !dbg !1193
%639 = add i32 %638, %611, !dbg !1226
%640 = add i32 %639, %607, !dbg !1231
%641 = add i32 %640, %615, !dbg !1250
%642 = add i32 %641, %619, !dbg !1236
%643 = add i32 %642, %623, !dbg !1241
%644 = add i32 %643, %627, !dbg !1245
ret i32 %644, !dbg !970
By distilling the chain addition from the actual computation, the frontend
(TTBOMK) significantly increases the register pressure. I'm assuming that the
misched pass would try to improve schedule the instructions and reduce the
register pressure (which in this case seems to be the result of an artificial
destruction by the frontend). Interestingly, turning off misched helps (though
not as much as turning off join-liveintervals). Here are some updated stats
(note that I have revised my methodology slightly, e.g. I'm counting an add
%reg, (memory location) as a load, add and store).
Baseline, neither opt turned off:
Total st stores loads st loads moves alu mult popcnt
734 76 122 76 85 346 29 0
add 134
and 95
dec 1
imul 29
lea 8
mov 233
movabs 30
movzbl 8
movzwl 6
or 12
pop 6
push 6
sar 1
shl 6
shr 73
sub 2
misched turned off:
Total st stores loads st loads moves alu mult popcnt
693 45 122 45 106 346 29 0
add 134
and 95
dec 1
imul 29
lea 8
mov 192
movabs 31
movzbl 8
movzwl 6
or 12
pop 6
push 6
sar 1
shl 6
shr 73
sub 2
livejoin turned off:
Total st stores loads st loads moves alu mult popcnt
702 37 122 37 131 346 29 0
add 134
and 95
dec 1
imul 29
lea 8
mov 202
movabs 30
movzbl 8
movzwl 6
or 12
pop 6
push 6
sar 1
shl 6
shr 73
sub 2
Both livejoin and misched turned off:
Total st stores loads st loads moves alu mult popcnt
714 45 122 45 127 346 29 0
add 134
and 95
dec 1
imul 29
lea 8
mov 214
movabs 30
movzbl 8
movzwl 6
or 12
pop 6
push 6
sar 1
shl 6
shr 73
sub 2
The interaction between the two (livejoin and misched) is non-linear.
Attached EvalStripped2.ll
(218027 bytes, application/octet-stream): .ll file with simplified pathological function
I've added another file, which has about half of the pathological function.
It's still misbehaving.
Note that reducing it further (half that function) does not help, llvm works
really well with either half. It seems that certain threshold needs to be met
for the massive spills to come.
In any case, stats for EvalStripped2.ll:
baseline:
Total st stores loads st loads moves alu mult popcnt
389 19 69 19 59 194 29 0
add 75
and 62
dec 1
imul 29
mov 85
movabs 22
movzbl 4
movzwl 2
pop 6
push 6
shr 49
sub 1
misched disabled:
Total st stores loads st loads moves alu mult popcnt
396 21 69 21 62 194 29 0
add 75
and 62
dec 1
imul 29
mov 92
movabs 22
movzbl 4
movzwl 2
pop 6
push 6
shr 49
sub 1
join liveintervals disabled:
Total st stores loads st loads moves alu mult popcnt
408 18 69 18 80 194 29 0
add 75
and 62
dec 1
imul 29
mov 104
movabs 22
movzbl 4
movzwl 2
pop 6
push 6
shr 49
sub 1
both misched and join liveintervals disabled:
Total st stores loads st loads moves alu mult popcnt
413 20 69 20 81 194 29 0
add 75
and 62
dec 1
imul 29
mov 109
movabs 22
movzbl 4
movzwl 2
pop 6
push 6
shr 49
sub 1
Based on my visual examination of EvalStripped2.ll:
Again, the frontend distills the chained summation, thereby significantly
increasing the register pressure, and misched can't undo the damage. Of course,
this is my non-llvm-er opinion.
Thanks Vlad!
I'll try to have a look this week.
(In reply to comment #15)
> Again, the frontend distills the chained summation, thereby significantly
> increasing the register pressure, and misched can't undo the damage.
I had a quick look into EvalStripped2.ll and it seems indeed, that we hoist the
something0...somethingN before doing the computations.
To be able to recover from this situation in the register allocator, we would
need to be able to rematerialize the whole chain of dependency.
This seems heroic to me and of little use in the general case. Therefore, I
believe the proper fix would be either in misched or earlier in the pipeline.
There may be good reasons why misched cannot recover from this situation (maybe
some kind of aliasing we need to honor?)
Vlad, could you check why misched cannot undo the damage?
Andy would be your source of information if you need help.
That way, we could determine where is the proper place to fix that.
I'll have a closer look, at some point, of the register allocator to be sure I
am not missing something.
I completely agree that it's not a register allocation problem.
By "earlier in the pipeline" you mean the front-end? It seems a bit strange to me that it separates computation chains like this. Obviously, neither gcc nor icc do it. Should I file a bug against clang then?
With respect to misched, it seems to be an issue of size. If I halve the chained additions, it is actually doing a much better job.
Andy, any thoughts on misched?
I agree with Quentin's guess that the scheduler should be able to handle it by
reducing pressure but there might be some unexpected dependency that prevents
the scheduler from correcting.
-debug-only=misched should show you the scheduling DAG, the ready queue at each
step, and why the scheduler chose to schedule an instruction.
Note that the scheduler should detect high register pressure *before* it begins
scheduling and use select heuristics in an attempt to reduce pressure. It is
also possible that the heuristics are failing to handle this case.
Unfortunately, I don't have time to look at it for the next couple weeks.
Sorry, I haven't looked into this case, but I want to point out that the problem may be in the DAG builder, not the scheduler itself.
On thing that mutates the DAG is load clustering. You can try -misched-cluster=false to see of that's the problem.
Evaluator.ll.gz
(187631 bytes, application/gzip)EvalStripped.cpp.ll
(262330 bytes, application/octet-stream)EvalStripped2.ll
(218027 bytes, application/octet-stream)