feature rq: track "cold" vector registers for use as don't-care sources to avoid false dependencies

Quuxplusone commented 7 years ago


Bugzilla Link	PR32863
Status	NEW
Importance	P enhancement
Reported by	Peter Cordes (peter@cordes.ca)
Reported on	2017-05-01 01:36:27 -0700
Last modified on	2017-05-01 07:00:34 -0700
Version	trunk
Hardware	PC All
CC	llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`file_32863.txt` (2468 bytes, text/plain)
Blocks
Blocked by
See also

(This is a summary / rewrite of what I wrote while having this idea on an old
closed bug: https://bugs.llvm.org/show_bug.cgi?id=22024#c11.  See that for some
Haswell perf analysis of scalar int->FP conversion.)

x86 has several cases of inconvenient input-dependencies, either for scalar
stuff in vector regs or for stuff like generating a vector of all-ones on CPUs
that don't recognize PCMPEQD same,same as independent of its inputs.  The usual
solution is to break dependencies with pxor same,same before doing something,
or to guess / hope that a register unused by this function is safe to use.

But with AVX 3-operand instructions, we can use a different strategy:  reuse a
known-safe register as the don't-care input without destroying it.

Such a register doesn't have to have been xor-zeroed; it can be holding a loop-
invariant constant.  Or we can vpxor one such register and reuse it for the
rest of the function (or until we make a function call, which could return with
OOO execution still chewing through a long dep chain on that register).

The use cases where having a safe read-only register helps include:

 * vcvtsi2ss/sd %r64,%merge_into, %xmm destination  # badly-designed instruction
 * vsqrtss     (mem),%merge_into, %xmm
 * vpcmpeqd    %same,%same, %dest    # false dep on KNL / Silvermont
 * vcmptrueps  %same,%same, %ymm     # splat -1 without AVX2.  false dep on all known uarches
 * Maybe the some weird shuffle use-cases?

The most important / common one by far is int->float conversion, due to Intel's
short-sighted design of SSE, and decision to keep that behaviour in the AVX
versions.  (good for consistency, bad for performance).  Anyway, hoisting a
VXORPS out of a loop that includes a vcvtsi2sd is an obvious win.

clang already sort of does this for int->float conversions: AFAICT it picks a
register unused in the function, and gambles that it is cold.  This is a
reasonable strategy, but it falls apart under register pressure.  (And more
sophisticated tracking can also avoid gambling that a caller or callee didn't
leave a register at the end of a long dep chain independent from the int->float
conversion we're doing.  e.g. near the end of a dep chain that includes a cache-
miss or a loop accumulator.)  Although perhaps this gamble is still worth the
code-size savings from leaving out a lot of vxorps instructions.

If you use up all the xmm regs with constants, then clang will put a vxorps-
zeroing instruction into the loop and then replace it with a constant.  Better
would be to simply use one of the constants as the merge-dest for vcvtsi2sd.
The x86-64 SysV ABI (and I assume other ABIs) allows passing scalar
float/double args with garbage (not zeros) in the high bytes, and this is
already something that happens in practice, so there's no reason to worry about
not "cleaning up" the results of this before making function calls.

See this test-case on godbolt for clang trunk 301740, gcc8 20170429, icc17, and
MSVC CL19 2017.  With 16 constants needed, clang keeps two of them in memory
since it uses two scratch regs for no reason/benefit.  (And has a vxorps in the
loop).

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 which I reported
just about the int->float part of this issue.

-----

Related cases: non-read-only, with a false dep on the output: tracking not-
recently-modified registers can let us pick a not-recently-modified register to
clobber, and/or decide whether to use a vpxor-zeroing instruction depending on
how long ago the last modification was of the register we picked.  e.g. after
one loop, before another loop, all the dead constant registers from the
finished loop are safe to read.

* all of the above without AVX, where dst=src2.

* vpternlogd  $0xff,any,any, src3/dst  # zmm splat -1: false dep on the dst
which is also a 3rd source reg.   All 3 vectors are inputs, so we need a stale
reg we can clobber (or a vpxor dep-breaker).  Hardware could avoid this by
treating imm8=0xFF as as special case, but neither KNL nor skylake-avx512 do.
(I checked skylake-avx512 on a google-compute-engine VM: definitely a false
dep: it runs about twice as fast when adding a vpxor to the loop.  Appears to
be something like 1c latency, one per 0.5c throughput).

vsqrtss with a register (not memory) source can use  src,src,dest with AVX,
avoiding the false dependency that src,dest,dest has.  (clang4.0 gets this
right; 3.9.1 and earlier are like gcc and use vsqrtss %xmm1,%xmm0,%xmm0).  ICC
uses vsqrtss %xmm1,%xmm1,%xmm1 and then vmovaps.)  int->float conversion with
vcvtsi2ss can't use this trick because the source operands aren't both vector
regs.

---

If we had such a readyness/coldness/dep-chain tracking infrastructure,
_mm_undefined_ps() could take advantage of it to make a good choice for which
dead register to pick.  (And whether to dep-break it.)  This is useful for
things like a horizontal-sum function that wants to use MOVHLPS to avoid extra
MOVAPS instructions when extracting the high half of a vector with only SSE2.
(With SSE3, MOVSHDUP is a great first-step as an FP copy+shuffle.  Then you can
use the original __m128 C variable as a destination for MOVHLPS, since it's
from earlier in the same dep chain but dead now.)

---------

That reminds me: for instructions that do have a real source (like sqrtss), an
output dependency on a register that had to be ready earlier in the same dep
chain is always safe.

e.g. if we want to keep around a*b and sqrtf(a*b), we can do this without any
ill effects from sqrtss's dependency on its output:

  mulss    %xmm0, %xmm1
  sqrtss   %xmm1, %xmm0    # xmm1 being ready means xmm0 is also ready

Quuxplusone commented 7 years ago

Attached file_32863.txt (2468 bytes, text/plain): int-float-test-cases.c

Quuxplusone commented 7 years ago

A conservative default for how many instructions ago a register was last
written to be considered cold would be ROB size of whatever -mtune= is tuning
for.  That's the size of the out-of-order execution window.  If the core has
issued an instruction, all previous instructions older than that are retired
(because retirement happens in-order on all to support precise exceptions).

The ROB size on Intel Skylake is 224 fused-domain uops.  Every instruction
decodes to at least one (except macro-fused compare-and-branch), even if they
don't need an execution unit.  224 instructions would be a reasonable
approximation, too (except that rep stosb and other micro-coded insns should
count as many more, at least 1 per 32B of data movement).

Low-power architectures have *much* smaller ROB sizes.  Older uarches have
smaller ROBs, too.  Intel Sandybridge was 168 entries.
(http://www.realworldtech.com/sandy-bridge/5/).

A reasonable default for -mtune=generic with some future-proofing might be
300.And in *most* cases, the register we pick won't be the one that has the
most stalls.  But unless we want to predict (or use profiling info to find)
which loads will cache-miss often, and which dependency-chains will be
bottlenecks, using a shorter default could cut down the effective out-of-order
for a given instruction.  OTOH, if the compile-time cost scales with the
number, we can probably pick a lower number (like 64) without serious impact in
*most* cases.  Or not get nearly as fancy as full dep-chain tracking, and just
look for loop-invariant registers for code inside loops.

----

If anyone cares about Intel Nehalem and earlier, tracking register dependencies
and how recently-written they are could allow code-gen to minimize register-
read-port stalls.  e.g. group together instructions that read the same cold
register so they might issue in the same group.  Or if given the choice to
increment a loop counter before/after using it in an addressing mode, increment
it before if tuning for those old architectures.  (That kind of thing can hurt
other CPUs by requiring more out-of-order resources to find as much ILP,
though, so don't do this by default.)

For register-read port stalls, we'd be worried about the best/average case of
execution throughput, and Agner Fog says a register can become "cold" as few as
5 cycles after execution on Nehalem.

This is different from avoiding false dependencies, where we're worried about
the worst-case of stalls or loop-carried dependencies on the register we want
to use.

Quuxplusone commented 7 years ago

Tracking dep chains would also let the compiler make better decisions about whether to use the copy produced by a MOV, or whether to modify the original. I forget how well clang does with this, but integer MOV reg,reg has non-zero latency on architectures before Intel IvB and AMD Zen. Vector MOVDQA xmm,xmm has zero latency on Bulldozer and IvB and later.

Anyway, making a copy for later use and then modifying the original may shorten latency chains in some cases for some CPUs. I definitely see compiler output that copies a register and then uses that as the input to something else, instead of using the original. I assume this behaviour was baked-in when Intel P6-family register-read port limitations made it desirable.

I don't think there's any downside for the number of physical-register-file entries used. The CPU can free them once no future instructions need them, if they're not part of the architectural state (i.e. the state after the last retired instruction), so I don't think it's worse to make more architectural registers be "recently-written" for most designs.

Quuxplusone / LLVMBugzillaTest

feature rq: track "cold" vector registers for use as don't-care sources to avoid false dependencies #31835