Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[mca] Multiple reservation station handling #41277

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR42307
Status NEW
Importance P enhancement
Reported by Roman Lebedev (lebedev.ri@gmail.com)
Reported on 2019-06-18 09:40:01 -0700
Last modified on 2019-06-18 13:41:52 -0700
Version trunk
Hardware PC Linux
CC andrea.dibiagio@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, matthew.davis@sony.com
Fixed by commit(s)
Attachments patch.patch (172929 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 22119
X86ScheduleBarcelona.td

Currently llvm-mca *seems* to only handle the case when there
is a single reservation station for a number of execution pipes.

That is more or less isn't the case for for -mcpu=amdfam10h
References:
* AMD Software Optimization Guide for AMD Family 10h and 12h Processors
  https://support.amd.com/TechDocs/40546.pdf
  Appendix A Microarchitecture of AMD Family 10h and 12h Processors
* https://www.realworldtech.com/barcelona/
* I think this is more or less stated as such in Agner, but not specifically

Bach of 6 pipes (3 integer, 3 fp) has it's own scheduler with it's own
reservation station.
But as you can see from test/tools/llvm-mca/X86/scheduler-queue-usage.s
it seems, MCA only currently recognizes this:

def BnInt : ProcResGroup<[BnInt0, BnInt1, BnInt2]> {
  let BufferSize = 24;
}

^ only BnInt queue ever is being used.

A simple
  def BnI0 : ProcResGroup<[BnInt0]> {
    let BufferSize = 24;
  }

isn't recognized, neither is

  def BnInt0 : ProcResource<1> {
    let BufferSize = 8;
  }

Am i missing something truly obvious here?
Quuxplusone commented 5 years ago

Attached patch.patch (172929 bytes, text/plain): X86ScheduleBarcelona.td

Quuxplusone commented 5 years ago
According to the scheduling model,
XOR32rr is associated with scheduling class ID #782.

That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[]) which
declares 3 writes.
  - BnALU012,  6cy
  - BnInt,     1cy
  - BnInt012,  1cy

Of these resources, only BnInt declares a buffer of 24 entries (according to
BarcelonaModelProcResources[]).

That's why mca only reports BnInt as consumed in the scheduler-stats view.

In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each of
those is also implemented as a scheduler with 24 resources.

It is unclear how to accurately model the buffer consumption for your
particular case since buffer resources are consumed at dispatch stage and not
at issue stage. At dispatch stage still we don't know if the XOR32rr will be
sent to BnI0 or BnI1 or BnI2.
It is only when the instruction reaches the issue stage that the simulated
reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }.

The model simulated by the machine scheduler in LLVM is much more simple. There
is no dispatch stage, and buffered resources are not correctly modeled (except
for special cases where BufferSize=[0|1]).

From an llvm-mca point of view, it is undefined the behavior of a (potentially
buffered) group of buffered resources.

There are two options:
 1) we keep everything as it is now.
 2) we change how the dispatch logic works in llvm-mca.

About point 2)
It requires that we potentially change how buffers are allocated. It also means
that we potentially pre-assign resources to specific pipelines at dispatch
stage.

For example, if an instruction consumes a resource group, and that group
contains only buffered resources, the inner resource is pre-assigned at
dispatch stage rather than at issue stage.
It would fix the issue that you have encountered with XOR32rr. However it
introduces a potentially annoying limitation: by pre-assigning opcodes to
specific resource buffers we are forcing the instruction to execute on a
pipeline selected at dispatch stage. That gives less flexibility.
Normally the pipeline would be selected at issue stage from the original pool
of resources.

Even assuming that we want to model that behavior: what should be the semantic
for the case where a group contains some (but not all) buffered resources?

It is an interesting topic. You could say that a similar problem exists in
other models too. In Jaguar, ALU pipes as served by a unified reservation
station (even though in practice the schedulers are distributed).
In practice, these details rarely matter when doing throughput analysis. In my
experience I have never observed a wrong prediction where the error was due to
the lack of accurate description of these distributed schedulers.
Quuxplusone commented 5 years ago
(In reply to Andrea Di Biagio from comment #3)
> According to the scheduling model,
> XOR32rr is associated with scheduling class ID #782.
>
> That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[])
> which declares 3 writes.
>   - BnALU012,  6cy
>   - BnInt,     1cy
>   - BnInt012,  1cy
>
> Of these resources, only BnInt declares a buffer of 24 entries (according to
> BarcelonaModelProcResources[]).
>
> That's why mca only reports BnInt as consumed in the scheduler-stats view.
>
> In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each
> of those is also implemented as a scheduler with 24 resources.
>
> It is unclear how to accurately model the buffer consumption for your
> particular case since buffer resources are consumed at dispatch stage and
> not at issue stage. At dispatch stage still we don't know if the XOR32rr
> will be sent to BnI0 or BnI1 or BnI2.
> It is only when the instruction reaches the issue stage that the simulated
> reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }.
>
> The model simulated by the machine scheduler in LLVM is much more simple.
> There is no dispatch stage, and buffered resources are not correctly modeled
> (except for special cases where BufferSize=[0|1]).
>
> From an llvm-mca point of view, it is undefined the behavior of a
> (potentially buffered) group of buffered resources.
>
> There are two options:
>  1) we keep everything as it is now.
>  2) we change how the dispatch logic works in llvm-mca.
>
> About point 2)
> It requires that we potentially change how buffers are allocated. It also
> means that we potentially pre-assign resources to specific pipelines at
> dispatch stage.
>
> For example, if an instruction consumes a resource group, and that group
> contains only buffered resources, the inner resource is pre-assigned at
> dispatch stage rather than at issue stage.
> It would fix the issue that you have encountered with XOR32rr. However it
> introduces a potentially annoying limitation: by pre-assigning opcodes to
> specific resource buffers we are forcing the instruction to execute on a
> pipeline selected at dispatch stage. That gives less flexibility.
> Normally the pipeline would be selected at issue stage from the original
> pool of resources.

Hmm. Maybe i'm reading too much into it?

SOG:
"Early decoding produces three macro-ops per cycle from either path.
The outputs of both decoders are multiplexed together and passed to
the next stage in the pipeline, the instruction control unit. "

"The instruction control unit takes the three macro-ops that are produced
during each cycle from the early decoders and places them in a centralized,
fixed-issue reorder buffer. For AMD Family 12h processors, this buffer is
organized into 28 lines of three macro-ops each."

"The integer execution pipeline is organized to match the three macro-op
dispatch pipes in the ICU as shown in Figure 10."

"The floating-point scheduler handles register renaming and has a dedicated
42-entry scheduler buffer organized as 14 lines of three macro-ops each"

Agner:
"The instructions are distributed between the three pipelines right after
the fetch stage. In simple cases, the instructions stay in each their
pipeline all the way to retirement."

"3. Pick/Scan. Can buffer up to 7 instructions. Distributes three instructions
into the three decoder pipelines. The following stages are all split into
three parallel pipes."

"6. Pack. Up to six macro-operations generated from the decoders are arranged
into lines of three macro-operations for the three execution pipelines."

^ Alternatively, is something like that not being modelled?

> Even assuming that we want to model that behavior: what should be the
> semantic for the case where a group contains some (but not all) buffered
> resources?
>
> It is an interesting topic. You could say that a similar problem exists in
> other models too. In Jaguar, ALU pipes as served by a unified reservation
> station (even though in practice the schedulers are distributed).
> In practice, these details rarely matter when doing throughput analysis. In
> my experience I have never observed a wrong prediction where the error was
> due to the lack of accurate description of these distributed schedulers.