[mca] Multiple reservation station handling


Bugzilla Link	PR42307
Status	NEW
Importance	P enhancement
Reported by	Roman Lebedev (lebedev.ri@gmail.com)
Reported on	2019-06-18 09:40:01 -0700
Last modified on	2019-06-18 13:41:52 -0700
Version	trunk
Hardware	PC Linux
CC	andrea.dibiagio@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, matthew.davis@sony.com
Fixed by commit(s)
Attachments	`patch.patch` (172929 bytes, text/plain)
Blocks
Blocked by
See also

Bugzilla Link

PR42307

Status

NEW

Importance

P enhancement

Reported by

Roman Lebedev (lebedev.ri@gmail.com)

Reported on

2019-06-18 09:40:01 -0700

Last modified on

2019-06-18 13:41:52 -0700

Version

trunk

Hardware

PC Linux

andrea.dibiagio@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, matthew.davis@sony.com

Fixed by commit(s)

Attachments

patch.patch (172929 bytes, text/plain)

Blocks

Blocked by

See also

Created attachment 22119 X86ScheduleBarcelona.td Currently llvm-mca *seems* to only handle the case when there is a single reservation station for a number of execution pipes. That is more or less isn't the case for for -mcpu=amdfam10h References: * AMD Software Optimization Guide for AMD Family 10h and 12h Processors https://support.amd.com/TechDocs/40546.pdf Appendix A Microarchitecture of AMD Family 10h and 12h Processors * https://www.realworldtech.com/barcelona/ * I think this is more or less stated as such in Agner, but not specifically Bach of 6 pipes (3 integer, 3 fp) has it's own scheduler with it's own reservation station. But as you can see from test/tools/llvm-mca/X86/scheduler-queue-usage.s it seems, MCA only currently recognizes this: def BnInt : ProcResGroup<[BnInt0, BnInt1, BnInt2]> { let BufferSize = 24; } ^ only BnInt queue ever is being used. A simple def BnI0 : ProcResGroup<[BnInt0]> { let BufferSize = 24; } isn't recognized, neither is def BnInt0 : ProcResource<1> { let BufferSize = 8; } Am i missing something truly obvious here?

According to the scheduling model, XOR32rr is associated with scheduling class ID #782. That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[]) which declares 3 writes. - BnALU012, 6cy - BnInt, 1cy - BnInt012, 1cy Of these resources, only BnInt declares a buffer of 24 entries (according to BarcelonaModelProcResources[]). That's why mca only reports BnInt as consumed in the scheduler-stats view. In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each of those is also implemented as a scheduler with 24 resources. It is unclear how to accurately model the buffer consumption for your particular case since buffer resources are consumed at dispatch stage and not at issue stage. At dispatch stage still we don't know if the XOR32rr will be sent to BnI0 or BnI1 or BnI2. It is only when the instruction reaches the issue stage that the simulated reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }. The model simulated by the machine scheduler in LLVM is much more simple. There is no dispatch stage, and buffered resources are not correctly modeled (except for special cases where BufferSize=[0|1]). From an llvm-mca point of view, it is undefined the behavior of a (potentially buffered) group of buffered resources. There are two options: 1) we keep everything as it is now. 2) we change how the dispatch logic works in llvm-mca. About point 2) It requires that we potentially change how buffers are allocated. It also means that we potentially pre-assign resources to specific pipelines at dispatch stage. For example, if an instruction consumes a resource group, and that group contains only buffered resources, the inner resource is pre-assigned at dispatch stage rather than at issue stage. It would fix the issue that you have encountered with XOR32rr. However it introduces a potentially annoying limitation: by pre-assigning opcodes to specific resource buffers we are forcing the instruction to execute on a pipeline selected at dispatch stage. That gives less flexibility. Normally the pipeline would be selected at issue stage from the original pool of resources. Even assuming that we want to model that behavior: what should be the semantic for the case where a group contains some (but not all) buffered resources? It is an interesting topic. You could say that a similar problem exists in other models too. In Jaguar, ALU pipes as served by a unified reservation station (even though in practice the schedulers are distributed). In practice, these details rarely matter when doing throughput analysis. In my experience I have never observed a wrong prediction where the error was due to the lack of accurate description of these distributed schedulers.

(In reply to Andrea Di Biagio from comment #3)
> According to the scheduling model,
> XOR32rr is associated with scheduling class ID #782.
>
> That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[])
> which declares 3 writes.
>   - BnALU012,  6cy
>   - BnInt,     1cy
>   - BnInt012,  1cy
>
> Of these resources, only BnInt declares a buffer of 24 entries (according to
> BarcelonaModelProcResources[]).
>
> That's why mca only reports BnInt as consumed in the scheduler-stats view.
>
> In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each
> of those is also implemented as a scheduler with 24 resources.
>
> It is unclear how to accurately model the buffer consumption for your
> particular case since buffer resources are consumed at dispatch stage and
> not at issue stage. At dispatch stage still we don't know if the XOR32rr
> will be sent to BnI0 or BnI1 or BnI2.
> It is only when the instruction reaches the issue stage that the simulated
> reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }.
>
> The model simulated by the machine scheduler in LLVM is much more simple.
> There is no dispatch stage, and buffered resources are not correctly modeled
> (except for special cases where BufferSize=[0|1]).
>
> From an llvm-mca point of view, it is undefined the behavior of a
> (potentially buffered) group of buffered resources.
>
> There are two options:
>  1) we keep everything as it is now.
>  2) we change how the dispatch logic works in llvm-mca.
>
> About point 2)
> It requires that we potentially change how buffers are allocated. It also
> means that we potentially pre-assign resources to specific pipelines at
> dispatch stage.
>
> For example, if an instruction consumes a resource group, and that group
> contains only buffered resources, the inner resource is pre-assigned at
> dispatch stage rather than at issue stage.
> It would fix the issue that you have encountered with XOR32rr. However it
> introduces a potentially annoying limitation: by pre-assigning opcodes to
> specific resource buffers we are forcing the instruction to execute on a
> pipeline selected at dispatch stage. That gives less flexibility.
> Normally the pipeline would be selected at issue stage from the original
> pool of resources.

Hmm. Maybe i'm reading too much into it?

SOG:
"Early decoding produces three macro-ops per cycle from either path.
The outputs of both decoders are multiplexed together and passed to
the next stage in the pipeline, the instruction control unit. "

"The instruction control unit takes the three macro-ops that are produced
during each cycle from the early decoders and places them in a centralized,
fixed-issue reorder buffer. For AMD Family 12h processors, this buffer is
organized into 28 lines of three macro-ops each."

"The integer execution pipeline is organized to match the three macro-op
dispatch pipes in the ICU as shown in Figure 10."

"The floating-point scheduler handles register renaming and has a dedicated
42-entry scheduler buffer organized as 14 lines of three macro-ops each"

Agner:
"The instructions are distributed between the three pipelines right after
the fetch stage. In simple cases, the instructions stay in each their
pipeline all the way to retirement."

"3. Pick/Scan. Can buffer up to 7 instructions. Distributes three instructions
into the three decoder pipelines. The following stages are all split into
three parallel pipes."

"6. Pack. Up to six macro-operations generated from the decoders are arranged
into lines of three macro-operations for the three execution pipelines."

^ Alternatively, is something like that not being modelled?

> Even assuming that we want to model that behavior: what should be the
> semantic for the case where a group contains some (but not all) buffered
> resources?
>
> It is an interesting topic. You could say that a similar problem exists in
> other models too. In Jaguar, ALU pipes as served by a unified reservation
> station (even though in practice the schedulers are distributed).
> In practice, these details rarely matter when doing throughput analysis. In
> my experience I have never observed a wrong prediction where the error was
> due to the lack of accurate description of these distributed schedulers.

Quuxplusone / LLVMBugzillaTest

[mca] Multiple reservation station handling #41277