Open Quuxplusone opened 7 years ago
Attached salsabasicblock.png
(911996 bytes, image/png): dag dump pre-scheduling
Attached salsair.ll
(9472 bytes, text/plain): input IR
For easier fruition, on godbolt
Please note that gcc generates the same sequence (4 add followed by for eor, grouped by ror
shift even without tuning, so maybe there's a missing transform in llvm that groups these together. Thoughts?
Do you know why the LLVM version is slower?
I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.
Oh I see that there are difference in the ror modifier. Does that change the dependencies graph?
Cortex-A53 also doesn't have an out of order engine (I think). It's the little core.
(In reply to Quentin Colombet from comment #4)
> Do you know why the LLVM version is slower?
>
> I mean from a high level view (meaning I actually didn't look at the
> dependencies and such), I would have expected the OOO engine to perform well
> in both cases.
It does on an out-of-order engine (checked) but there are in order processors
available, still (same would be true for ARM, e.g. Cortex-A7 which suffers from
the same stalling problem). I measured with a CPU profiler on Linux and the two
run show a substantial difference in the number of stalls.
Thanks. That explains it :).
It's scheduling bottom up, and after scheduling two rotates
, decides to avoid increasing register pressure by scheduling the add
. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.
In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.
Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.
Attached scheduling.txt
(194655 bytes, text/plain): -debug-only=misched output
Attached scheduling-salsa.txt
(401027 bytes, text/plain): correct schedule dump
(In reply to Andrew Trick from comment #9)
> It's scheduling bottom up, and after scheduling two `rotates`, decides to
> avoid increasing register pressure by scheduling the `add`. Supposedly,
> that's ok because the machine is dual-issue and the instructions only have
> one cycle latency.
>
I also tried to disable register pressure as input to the scheduler and the
result I got is very similar (snippet:
[...]
40: 0b120113 add w19, w8, w18
44: 0b0d0134 add w20, w9, w13
48: 4ad365ce eor w14, w14, w19, ror #25
4c: 4ad464a5 eor w5, w5, w20, ror #25
50: 0b1201d3 add w19, w14, w18
54: 0b0d00b4 add w20, w5, w13
58: 4ad35d6b eor w11, w11, w19, ror #23
5c: 0b020195 add w21, w12, w2
60: 4ad45e10 eor w16, w16, w20, ror #23
64: 0b0e0173 add w19, w11, w14
68: 0b050214 add w20, w16, w5
6c: 4ad56484 eor w4, w4, w21, ror #25
70: 4ad34d08 eor w8, w8, w19, ror #19
74: 0b0f0153 add w19, w10, w15
78: 0b020095 add w21, w4, w2
7c: 4ad44d29 eor w9, w9, w20, ror #19
80: 4ad364e7 eor w7, w7, w19, ror #25
84: 0b100133 add w19, w9, w16
88: 4ad55cc6 eor w6, w6, w21, ror #23
8c: 0b0400d5 add w21, w6, w4
90: 4ad339ad eor w13, w13, w19, ror #14
94: 0b0f00f3 add w19, w7, w15
98: 0b0b0114 add w20, w8, w11
9c: 4ad54d8c eor w12, w12, w21, ror #19
a0: 4ad35e31 eor w17, w17, w19, ror #23
a4: 4ad43a52 eor w18, w18, w20, ror #14
a8: 0b060194 add w20, w12, w6
ac: 0b070233 add w19, w17, w7
b0: 4ad43842 eor w2, w2, w20, ror #14
b4: 4ad34d4a eor w10, w10, w19, ror #19
b8: 0b1200f4 add w20, w7, w18
bc: 0b110153 add w19, w10, w17
c0: 4ad46529 eor w9, w9, w20, ror #25
c4: 4ad339ef eor w15, w15, w19, ror #14
c8: 0b120133 add w19, w9, w18
cc: 0b0d01d4 add w20, w14, w13
d0: 4ad35cc6 eor w6, w6, w19, ror #23
d4: 4ad4658c eor w12, w12, w20, ror #25
[...]
In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling
65 instructions in 66 cycles.
Scheduling SU(63) %vreg175<def> = EORWrs %vreg105, %vreg136, 206;
GPR32:%vreg175,%vreg105,%vreg136
Ready @2c
** ScheduleDAGMILive::schedule picking next node
SU(55) A53UnitALU=3c
SU(47) A53UnitALU=3c
SU(39) A53UnitALU=3c
checkHazard thinks that the ALU is reserved until the next cycle. Like there's
only one ALU.
Try with the latest LLVM if you can. Then I think you could create a very small
.ll test case to expose this problem. Sorry I'm not available to debug it.
(In reply to Andrew Trick from comment #12)
> In the correct_schedule dump, it looks like the pre-RA scheduler is
> scheduling 65 instructions in 66 cycles.
>
> Scheduling SU(63) %vreg175<def> = EORWrs %vreg105, %vreg136, 206;
> GPR32:%vreg175,%vreg105,%vreg136
> Ready @2c
>
> ** ScheduleDAGMILive::schedule picking next node
> SU(55) A53UnitALU=3c
> SU(47) A53UnitALU=3c
> SU(39) A53UnitALU=3c
>
> checkHazard thinks that the ALU is reserved until the next cycle. Like
> there's only one ALU.
>
> Try with the latest LLVM if you can. Then I think you could create a very
> small .ll test case to expose this problem. Sorry I'm not available to debug
> it.
Thanks Andrew.
This is the latest LLVM ('ish)
$ ./clang --version
clang version 5.0.0 (https://github.com/llvm-mirror/clang
31801a78220872a1dcee9b261328ce7239eaa7e9) (https://github.com/llvm-mirror/llvm
39984d5813cf79959c014093f5b9a3c71a6d7272)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/davide/compilers/clang-install/bin/.
I'll try to reduced to something manageable and debug. Sorry I can't be more
helpful, the scheduler is not my cup of tea (just happened to notice this
problem while looking at something else).
salsabasicblock.png
(911996 bytes, image/png)salsair.ll
(9472 bytes, text/plain)scheduling.txt
(194655 bytes, text/plain)scheduling-salsa.txt
(401027 bytes, text/plain)Created attachment 18180 dag dump pre-scheduling
This is the salsa20 benchmark from the testsuite (SingleSource). I'm not sure if the model can be improved or this is a general issue with the instruction scheduler heuristics.
passing -O3
-mcpu=cortex-a53 -mtune=cortex-a53
LLVM generates the following code for the hot loop (subset of instructions):while gcc 7:
The latter results in many more stalls and ~ 20% runtime regression. SelectionDAG for the BB pre scheduling and initial IR attached.