Poor instruction scheduling for salsa20 cypher hot loop

Quuxplusone commented 7 years ago


Bugzilla Link	PR32439
Status	NEW
Importance	P enhancement
Reported by	Davide Italiano (ditaliano@apple.com)
Reported on	2017-03-27 13:51:23 -0700
Last modified on	2017-03-29 14:52:57 -0700
Version	trunk
Hardware	PC All
CC	atrick@apple.com, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, matze@braunis.de, mcrosier@codeaurora.org, quentin.colombet@gmail.com, t.p.northover@gmail.com
Fixed by commit(s)
Attachments	`salsabasicblock.png` (911996 bytes, image/png) `salsair.ll` (9472 bytes, text/plain) `scheduling.txt` (194655 bytes, text/plain) `scheduling-salsa.txt` (401027 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 18180 dag dump pre-scheduling

This is the salsa20 benchmark from the testsuite (SingleSource). I'm not sure if the model can be improved or this is a general issue with the instruction scheduler heuristics.

passing -O3 -mcpu=cortex-a53 -mtune=cortex-a53 LLVM generates the following code for the hot loop (subset of instructions):

  400638:       0b100106        add     w6, w8, w16
  40063c:       0b0d0127        add     w7, w9, w13
  400640:       4ac66400        eor     w0, w0, w6, ror #25
  400644:       0b120166        add     w6, w11, w18
  400648:       4ac76463        eor     w3, w3, w7, ror #25
  40064c:       4ac66442        eor     w2, w2, w6, ror #25
  400650:       0b100006        add     w6, w0, w16
  400654:       0b0d0067        add     w7, w3, w13
  400658:       4ac65d8c        eor     w12, w12, w6, ror #23
  40065c:       0b120046        add     w6, w2, w18
  400660:       4ac75def        eor     w15, w15, w7, ror #23
  400664:       4ac65c84        eor     w4, w4, w6, ror #23
  400668:       0b000186        add     w6, w12, w0
  40066c:       0b0301e7        add     w7, w15, w3
  400670:       4ac64d08        eor     w8, w8, w6, ror #19
  400674:       0b020086        add     w6, w4, w2
  400678:       4ac74d29        eor     w9, w9, w7, ror #19
  40067c:       4ac64d6b        eor     w11, w11, w6, ror #19

while gcc 7:

  400688:       0b020175        add     w21, w11, w2
  40068c:       0b040214        add     w20, w16, w4
  400690:       0b050233        add     w19, w17, w5
  400694:       0b030192        add     w18, w12, w3
  400698:       4ad56508        eor     w8, w8, w21, ror #25
  40069c:       4ad464e7        eor     w7, w7, w20, ror #25
  4006a0:       4ad364c6        eor     w6, w6, w19, ror #25
  4006a4:       4ad26529        eor     w9, w9, w18, ror #25
  4006a8:       0b0b0115        add     w21, w8, w11
  4006ac:       0b1000f4        add     w20, w7, w16
  4006b0:       0b1100d3        add     w19, w6, w17
  4006b4:       0b0c0132        add     w18, w9, w12
  4006b8:       4ad55d4a        eor     w10, w10, w21, ror #23
  4006bc:       4ad45dad        eor     w13, w13, w20, ror #23
  4006c0:       4ad35def        eor     w15, w15, w19, ror #23
  4006c4:       4ad25dce        eor     w14, w14, w18, ror #23

The latter results in many more stalls and ~ 20% runtime regression. SelectionDAG for the BB pre scheduling and initial IR attached.

Quuxplusone commented 7 years ago

Attached salsabasicblock.png (911996 bytes, image/png): dag dump pre-scheduling

Quuxplusone commented 7 years ago

Attached salsair.ll (9472 bytes, text/plain): input IR

Quuxplusone commented 7 years ago

For easier fruition, on godbolt

https://godbolt.org/g/8XWHkg

Please note that gcc generates the same sequence (4 add followed by for eor, grouped by ror shift even without tuning, so maybe there's a missing transform in llvm that groups these together. Thoughts?

Quuxplusone commented 7 years ago

Do you know why the LLVM version is slower?

I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.

Quuxplusone commented 7 years ago

Oh I see that there are difference in the ror modifier. Does that change the dependencies graph?

Quuxplusone commented 7 years ago

Cortex-A53 also doesn't have an out of order engine (I think). It's the little core.

Quuxplusone commented 7 years ago

(In reply to Quentin Colombet from comment #4)
> Do you know why the LLVM version is slower?
>
> I mean from a high level view (meaning I actually didn't look at the
> dependencies and such), I would have expected the OOO engine to perform well
> in both cases.

It does on an out-of-order engine (checked) but there are in order processors
available, still (same would be true for ARM, e.g. Cortex-A7 which suffers from
the same stalling problem). I measured with a CPU profiler on Linux and the two
run show a substantial difference in the number of stalls.

Quuxplusone commented 7 years ago

Thanks. That explains it :).

Quuxplusone commented 7 years ago

It's scheduling bottom up, and after scheduling two rotates, decides to avoid increasing register pressure by scheduling the add. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.

In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.

Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.

Quuxplusone commented 7 years ago

Attached scheduling.txt (194655 bytes, text/plain): -debug-only=misched output

Quuxplusone commented 7 years ago

Attached scheduling-salsa.txt (401027 bytes, text/plain): correct schedule dump

Quuxplusone commented 7 years ago

(In reply to Andrew Trick from comment #9)
> It's scheduling bottom up, and after scheduling two `rotates`, decides to
> avoid increasing register pressure by scheduling the `add`. Supposedly,
> that's ok because the machine is dual-issue and the instructions only have
> one cycle latency.
>

I also tried to disable register pressure as input to the scheduler and the
result I got is very similar (snippet:

[...]
  40:   0b120113        add     w19, w8, w18
  44:   0b0d0134        add     w20, w9, w13
  48:   4ad365ce        eor     w14, w14, w19, ror #25
  4c:   4ad464a5        eor     w5, w5, w20, ror #25
  50:   0b1201d3        add     w19, w14, w18
  54:   0b0d00b4        add     w20, w5, w13
  58:   4ad35d6b        eor     w11, w11, w19, ror #23
  5c:   0b020195        add     w21, w12, w2
  60:   4ad45e10        eor     w16, w16, w20, ror #23
  64:   0b0e0173        add     w19, w11, w14
  68:   0b050214        add     w20, w16, w5
  6c:   4ad56484        eor     w4, w4, w21, ror #25
  70:   4ad34d08        eor     w8, w8, w19, ror #19
  74:   0b0f0153        add     w19, w10, w15
  78:   0b020095        add     w21, w4, w2
  7c:   4ad44d29        eor     w9, w9, w20, ror #19
  80:   4ad364e7        eor     w7, w7, w19, ror #25
  84:   0b100133        add     w19, w9, w16
  88:   4ad55cc6        eor     w6, w6, w21, ror #23
  8c:   0b0400d5        add     w21, w6, w4
  90:   4ad339ad        eor     w13, w13, w19, ror #14
  94:   0b0f00f3        add     w19, w7, w15
  98:   0b0b0114        add     w20, w8, w11
  9c:   4ad54d8c        eor     w12, w12, w21, ror #19
  a0:   4ad35e31        eor     w17, w17, w19, ror #23
  a4:   4ad43a52        eor     w18, w18, w20, ror #14
  a8:   0b060194        add     w20, w12, w6
  ac:   0b070233        add     w19, w17, w7
  b0:   4ad43842        eor     w2, w2, w20, ror #14
  b4:   4ad34d4a        eor     w10, w10, w19, ror #19
  b8:   0b1200f4        add     w20, w7, w18
  bc:   0b110153        add     w19, w10, w17
  c0:   4ad46529        eor     w9, w9, w20, ror #25
  c4:   4ad339ef        eor     w15, w15, w19, ror #14
  c8:   0b120133        add     w19, w9, w18
  cc:   0b0d01d4        add     w20, w14, w13
  d0:   4ad35cc6        eor     w6, w6, w19, ror #23
  d4:   4ad4658c        eor     w12, w12, w20, ror #25
[...]

Quuxplusone commented 7 years ago

In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling
65 instructions in 66 cycles.

Scheduling SU(63) %vreg175<def> = EORWrs %vreg105, %vreg136, 206;
GPR32:%vreg175,%vreg105,%vreg136
  Ready @2c

** ScheduleDAGMILive::schedule picking next node
  SU(55) A53UnitALU=3c
  SU(47) A53UnitALU=3c
  SU(39) A53UnitALU=3c

checkHazard thinks that the ALU is reserved until the next cycle. Like there's
only one ALU.

Try with the latest LLVM if you can. Then I think you could create a very small
.ll test case to expose this problem. Sorry I'm not available to debug it.

Quuxplusone commented 7 years ago

(In reply to Andrew Trick from comment #12)
> In the correct_schedule dump, it looks like the pre-RA scheduler is
> scheduling 65 instructions in 66 cycles.
>
> Scheduling SU(63) %vreg175<def> = EORWrs %vreg105, %vreg136, 206;
> GPR32:%vreg175,%vreg105,%vreg136
>   Ready @2c
>
> ** ScheduleDAGMILive::schedule picking next node
>   SU(55) A53UnitALU=3c
>   SU(47) A53UnitALU=3c
>   SU(39) A53UnitALU=3c
>
> checkHazard thinks that the ALU is reserved until the next cycle. Like
> there's only one ALU.
>
> Try with the latest LLVM if you can. Then I think you could create a very
> small .ll test case to expose this problem. Sorry I'm not available to debug
> it.

Thanks Andrew.

This is the latest LLVM ('ish)

$ ./clang --version
clang version 5.0.0 (https://github.com/llvm-mirror/clang
31801a78220872a1dcee9b261328ce7239eaa7e9) (https://github.com/llvm-mirror/llvm
39984d5813cf79959c014093f5b9a3c71a6d7272)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/davide/compilers/clang-install/bin/.

I'll try to reduced to something manageable and debug. Sorry I can't be more
helpful, the scheduler is not my cup of tea (just happened to notice this
problem while looking at something else).

Quuxplusone / LLVMBugzillaTest

Poor instruction scheduling for salsa20 cypher hot loop #31411