[llvm-mca] bottleneck-analysis result conflict with timeline result on aarch64


Bugzilla Link	PR47380
Status	NEW
Importance	P normal
Reported by	Eliana Xie(谢洁) (eliana.x@huawei.com)
Reported on	2020-09-01 03:28:01 -0700
Last modified on	2021-03-04 05:29:27 -0800
Version	trunk
Hardware	Other Linux
CC	andrea.dibiagio@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, matthew.davis@sony.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

LLVM (http://llvm.org/):
  LLVM version 12.0.0git
  DEBUG build with assertions.
  Default target: aarch64-unknown-linux-gnu
  Host CPU: tsv110

## bottleneck-analysis
command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu -
iterations=10 --bottleneck-analysis float_test.s

result:

Critical sequence based on the simulation:

              Instruction                                 Dependency Information
 +----< 0.    adrp  x6, #0
 +----> 1.    add   x7, x6, #2864                     ## REGISTER dependency:  x6
 |      2.    fmov  d6, #1.00000000
 +----> 3.    ldp   d0, d4, [x7, #296]                ## REGISTER dependency:  x7
 |      4.    ldr   d7, [x7, #312]
 |      5.    ldr   d19, [x7, #320]
 +----> 6.    fmul  d16, d2, d0                       ## REGISTER dependency:  d0
 +----> 7.    frinta    d5, d16                           ## REGISTER dependency:  d16
 +----> 8.    fcvtas    x8, d16                           ## RESOURCE interference:  A57UnitX [ probability: 10% ]
 |      9.    fsub  d17, d16, d5
 +----> 10.   and   x9, x8, #0x1f                     ## REGISTER dependency:  x8
 |      11.   fmadd d18, d4, d17, d7
 |      12.   fmadd d20, d19, d17, d6
 |      13.   fmul  d21, d17, d17
 +----> 14.   ldr   x10, [x7, x9, lsl #3]             ## REGISTER dependency:  x9
 |      15.   add   x11, x10, x8, lsl #47
 |      16.   fmov  d23, x11
 |      17.   fmadd d22, d18, d21, d20
 |      18.   fmul  d24, d22, d23
 |      19.   fcvt  s0, d24
 |      20.   ldr   x14, [x1, #4056]
 |      21.   ldr   x2, [sp, #24]
 +----> 22.   ldr   x15, [x14]                        ## RESOURCE interference:  A57UnitL [ probability: 20% ]
 +----> 23.   eor   x15, x2, x15                      ## REGISTER dependency:  x15

## timeline

command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu -
iterations=1 --timeline float_test.s

result:

Timeline view:
                    0123456789          0123456789          012
Index     0123456789          0123456789          0123456789

[0,0]     DeER .    .    .    .    .    .    .    .    .    . .   adrp  x6, #0
[0,1]     D=eER.    .    .    .    .    .    .    .    .    . .   add   x7, x6,
#2864
[0,2]     DeeeER    .    .    .    .    .    .    .    .    . .   fmov  d6,
#1.00000000
[0,3]     .D=eeeeeER.    .    .    .    .    .    .    .    . .   ldp   d0, d4,
[x7, #296]
[0,4]     .D==eeeeeER    .    .    .    .    .    .    .    . .   ldr   d7, [x7,
#312]
[0,5]     .D===eeeeeER   .    .    .    .    .    .    .    . .   ldr   d19, [x7,
#320]
[0,6]     . D=====eeeeeER.    .    .    .    .    .    .    . .   fmul  d16, d2,
d0
[0,7]     . D==========eeeeeER.    .    .    .    .    .    . .   frinta    d5, d16
[0,8]     . D==========eeeeeeeeeeER.    .    .    .    .    . .   fcvtas    x8, d16
[0,9]     .  D==============eeeeeER.    .    .    .    .    . .   fsub  d17,
d16, d5
[0,10]    .  D===================eER    .    .    .    .    . .   and   x9, x8,
#0x1f
[0,11]    .  D===================eeeeeeeeeER .    .    .    . .   fmadd d18,
d4, d17, d7
[0,12]    .  D===================eeeeeeeeeER .    .    .    . .   fmadd d20,
d19, d17, d6
[0,13]    .   D===================eeeeeE---R .    .    .    . .   fmul  d21,
d17, d17
[0,14]    .   D===================eeeeE----R .    .    .    . .   ldr   x10, [x7,
x9, lsl #3]
[0,15]    .   D=======================eeE--R .    .    .    . .   add   x11, x10,
x8, lsl #47
[0,16]    .    D========================eeeeeER   .    .    . .   fmov  d23, x11
[0,17]    .    D==========================eeeeeeeeeER  .    . .   fmadd d22,
d18, d21, d20
[0,18]    .    D===================================eeeeeER  . .   fmul  d24,
d22, d23
[0,19]    .    D========================================eeeeeER   fcvt  s0, d24
[0,20]    .    .DeeeeE----------------------------------------R   ldr   x14, [x1,
#4056]
[0,21]    .    .D=eeeeE---------------------------------------R   ldr   x2, [sp,
#24]
[0,22]    .    .D====eeeeE------------------------------------R   ldr   x15, [x14]
[0,23]    .    .D========eE-----------------------------------R   eor   x15, x2,
x15

Float Instructions need to be in the Critical sequence for FSU and register
dependency

 |      17.   fmadd d22, d18, d21, d20
 |      18.   fmul  d24, d22, d23
 |      19.   fcvt  s0, d24

The bottleneck analysis is not a "critical-path" analysis. The analysis is
conducted at simulation time; it is purely based on the observation of so-
called "pressure increase" events, usually generated by a Scheduler component.
Pressure events are generated only in two situations:
1.  hardware pipeline utilisation could be increased if instructions weren't
subject to data dependencies.
2.  There are instructions ready to execute. However pipelines are fully
booked, and the number of instructions dispatched during that cycle was bigger
than the number of instructions issued on the underlying pipes.

Essentially: point 1. is about data dependencies limiting the issue throughput.
Point 2. is instead about pipeline resources being unavailable, and a too low
issue rate (despite instructions are free from data dependencies).

Therefore, not all a data dependencies are necessarily seen as "problematic"
for the purpouse of this analysis. Only those that limit the issue throughput
are problematic.

Back to your example: it may be that those dependencies are not problematic
during the first ten iterations of the loop. Those may introduce problems if
the number of iterations is increased.

Your timeline only shows that there are data dependencies. Nothing more. The
effects of those dependencies on the throughput may only becoming apparent if
you increase the number of iterations to something more than 10.
During that short simulation, the scheduler was probably still able to extract
enough ILP and feed the underlying pipes.
Over time, problematic dependencies would induce an increase in back-pressure
on the scheduler buffers, eventually leading to compulsory stalls. It may be
that 10 iterations wasn't enough to reach that critical point.

Generally speaking, when doing bottleneck analysis (or throughput analysis in
general), it is strongly advised to use a large number of iterations. If
possible, I recommend to stick with the default (i.e. 100 iterations) unless
there are compelling reasons for doing it differently.

Quuxplusone / LLVMBugzillaTest

[llvm-mca] bottleneck-analysis result conflict with timeline result on aarch64 #46349