Open Quuxplusone opened 7 years ago
Attached file_32863.txt
(2468 bytes, text/plain): int-float-test-cases.c
A conservative default for how many instructions ago a register was last
written to be considered cold would be ROB size of whatever -mtune= is tuning
for. That's the size of the out-of-order execution window. If the core has
issued an instruction, all previous instructions older than that are retired
(because retirement happens in-order on all to support precise exceptions).
The ROB size on Intel Skylake is 224 fused-domain uops. Every instruction
decodes to at least one (except macro-fused compare-and-branch), even if they
don't need an execution unit. 224 instructions would be a reasonable
approximation, too (except that rep stosb and other micro-coded insns should
count as many more, at least 1 per 32B of data movement).
Low-power architectures have *much* smaller ROB sizes. Older uarches have
smaller ROBs, too. Intel Sandybridge was 168 entries.
(http://www.realworldtech.com/sandy-bridge/5/).
A reasonable default for -mtune=generic with some future-proofing might be
300.And in *most* cases, the register we pick won't be the one that has the
most stalls. But unless we want to predict (or use profiling info to find)
which loads will cache-miss often, and which dependency-chains will be
bottlenecks, using a shorter default could cut down the effective out-of-order
for a given instruction. OTOH, if the compile-time cost scales with the
number, we can probably pick a lower number (like 64) without serious impact in
*most* cases. Or not get nearly as fancy as full dep-chain tracking, and just
look for loop-invariant registers for code inside loops.
----
If anyone cares about Intel Nehalem and earlier, tracking register dependencies
and how recently-written they are could allow code-gen to minimize register-
read-port stalls. e.g. group together instructions that read the same cold
register so they might issue in the same group. Or if given the choice to
increment a loop counter before/after using it in an addressing mode, increment
it before if tuning for those old architectures. (That kind of thing can hurt
other CPUs by requiring more out-of-order resources to find as much ILP,
though, so don't do this by default.)
For register-read port stalls, we'd be worried about the best/average case of
execution throughput, and Agner Fog says a register can become "cold" as few as
5 cycles after execution on Nehalem.
This is different from avoiding false dependencies, where we're worried about
the worst-case of stalls or loop-carried dependencies on the register we want
to use.
Tracking dep chains would also let the compiler make better decisions about whether to use the copy produced by a MOV, or whether to modify the original. I forget how well clang does with this, but integer MOV reg,reg has non-zero latency on architectures before Intel IvB and AMD Zen. Vector MOVDQA xmm,xmm has zero latency on Bulldozer and IvB and later.
Anyway, making a copy for later use and then modifying the original may shorten latency chains in some cases for some CPUs. I definitely see compiler output that copies a register and then uses that as the input to something else, instead of using the original. I assume this behaviour was baked-in when Intel P6-family register-read port limitations made it desirable.
I don't think there's any downside for the number of physical-register-file entries used. The CPU can free them once no future instructions need them, if they're not part of the architectural state (i.e. the state after the last retired instruction), so I don't think it's worse to make more architectural registers be "recently-written" for most designs.
file_32863.txt
(2468 bytes, text/plain)