Open Quuxplusone opened 5 years ago
Attached patch.patch
(172929 bytes, text/plain): X86ScheduleBarcelona.td
According to the scheduling model,
XOR32rr is associated with scheduling class ID #782.
That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[]) which
declares 3 writes.
- BnALU012, 6cy
- BnInt, 1cy
- BnInt012, 1cy
Of these resources, only BnInt declares a buffer of 24 entries (according to
BarcelonaModelProcResources[]).
That's why mca only reports BnInt as consumed in the scheduler-stats view.
In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each of
those is also implemented as a scheduler with 24 resources.
It is unclear how to accurately model the buffer consumption for your
particular case since buffer resources are consumed at dispatch stage and not
at issue stage. At dispatch stage still we don't know if the XOR32rr will be
sent to BnI0 or BnI1 or BnI2.
It is only when the instruction reaches the issue stage that the simulated
reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }.
The model simulated by the machine scheduler in LLVM is much more simple. There
is no dispatch stage, and buffered resources are not correctly modeled (except
for special cases where BufferSize=[0|1]).
From an llvm-mca point of view, it is undefined the behavior of a (potentially
buffered) group of buffered resources.
There are two options:
1) we keep everything as it is now.
2) we change how the dispatch logic works in llvm-mca.
About point 2)
It requires that we potentially change how buffers are allocated. It also means
that we potentially pre-assign resources to specific pipelines at dispatch
stage.
For example, if an instruction consumes a resource group, and that group
contains only buffered resources, the inner resource is pre-assigned at
dispatch stage rather than at issue stage.
It would fix the issue that you have encountered with XOR32rr. However it
introduces a potentially annoying limitation: by pre-assigning opcodes to
specific resource buffers we are forcing the instruction to execute on a
pipeline selected at dispatch stage. That gives less flexibility.
Normally the pipeline would be selected at issue stage from the original pool
of resources.
Even assuming that we want to model that behavior: what should be the semantic
for the case where a group contains some (but not all) buffered resources?
It is an interesting topic. You could say that a similar problem exists in
other models too. In Jaguar, ALU pipes as served by a unified reservation
station (even though in practice the schedulers are distributed).
In practice, these details rarely matter when doing throughput analysis. In my
experience I have never observed a wrong prediction where the error was due to
the lack of accurate description of these distributed schedulers.
(In reply to Andrea Di Biagio from comment #3)
> According to the scheduling model,
> XOR32rr is associated with scheduling class ID #782.
>
> That ID identifies a MCSchedClassDesc (in BarcelonaModelSchedClasses[])
> which declares 3 writes.
> - BnALU012, 6cy
> - BnInt, 1cy
> - BnInt012, 1cy
>
> Of these resources, only BnInt declares a buffer of 24 entries (according to
> BarcelonaModelProcResources[]).
>
> That's why mca only reports BnInt as consumed in the scheduler-stats view.
>
> In your particular case, BnInt is a composition of BnI0, BnI1 and BnI2. Each
> of those is also implemented as a scheduler with 24 resources.
>
> It is unclear how to accurately model the buffer consumption for your
> particular case since buffer resources are consumed at dispatch stage and
> not at issue stage. At dispatch stage still we don't know if the XOR32rr
> will be sent to BnI0 or BnI1 or BnI2.
> It is only when the instruction reaches the issue stage that the simulated
> reservation station BnInt selects one from the set { BnI0, BnI1 , BnI2 }.
>
> The model simulated by the machine scheduler in LLVM is much more simple.
> There is no dispatch stage, and buffered resources are not correctly modeled
> (except for special cases where BufferSize=[0|1]).
>
> From an llvm-mca point of view, it is undefined the behavior of a
> (potentially buffered) group of buffered resources.
>
> There are two options:
> 1) we keep everything as it is now.
> 2) we change how the dispatch logic works in llvm-mca.
>
> About point 2)
> It requires that we potentially change how buffers are allocated. It also
> means that we potentially pre-assign resources to specific pipelines at
> dispatch stage.
>
> For example, if an instruction consumes a resource group, and that group
> contains only buffered resources, the inner resource is pre-assigned at
> dispatch stage rather than at issue stage.
> It would fix the issue that you have encountered with XOR32rr. However it
> introduces a potentially annoying limitation: by pre-assigning opcodes to
> specific resource buffers we are forcing the instruction to execute on a
> pipeline selected at dispatch stage. That gives less flexibility.
> Normally the pipeline would be selected at issue stage from the original
> pool of resources.
Hmm. Maybe i'm reading too much into it?
SOG:
"Early decoding produces three macro-ops per cycle from either path.
The outputs of both decoders are multiplexed together and passed to
the next stage in the pipeline, the instruction control unit. "
"The instruction control unit takes the three macro-ops that are produced
during each cycle from the early decoders and places them in a centralized,
fixed-issue reorder buffer. For AMD Family 12h processors, this buffer is
organized into 28 lines of three macro-ops each."
"The integer execution pipeline is organized to match the three macro-op
dispatch pipes in the ICU as shown in Figure 10."
"The floating-point scheduler handles register renaming and has a dedicated
42-entry scheduler buffer organized as 14 lines of three macro-ops each"
Agner:
"The instructions are distributed between the three pipelines right after
the fetch stage. In simple cases, the instructions stay in each their
pipeline all the way to retirement."
"3. Pick/Scan. Can buffer up to 7 instructions. Distributes three instructions
into the three decoder pipelines. The following stages are all split into
three parallel pipes."
"6. Pack. Up to six macro-operations generated from the decoders are arranged
into lines of three macro-operations for the three execution pipelines."
^ Alternatively, is something like that not being modelled?
> Even assuming that we want to model that behavior: what should be the
> semantic for the case where a group contains some (but not all) buffered
> resources?
>
> It is an interesting topic. You could say that a similar problem exists in
> other models too. In Jaguar, ALU pipes as served by a unified reservation
> station (even though in practice the schedulers are distributed).
> In practice, these details rarely matter when doing throughput analysis. In
> my experience I have never observed a wrong prediction where the error was
> due to the lack of accurate description of these distributed schedulers.
patch.patch
(172929 bytes, text/plain)