_mm_movehl_ps pessimized to movaps/shufpd instead of movhlps

Quuxplusone commented 8 years ago


Bugzilla Link	PR26491
Status	NEW
Importance	P normal
Reported by	Peter Cordes (peter@cordes.ca)
Reported on	2016-02-05 06:43:07 -0800
Last modified on	2018-09-14 15:22:20 -0700
Version	3.7
Hardware	PC Linux
CC	craig.topper@gmail.com, david.l.kreitzer@intel.com, hjl.tools@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com, zia.ansari@intel.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR25808, PR22024, PR27573, PR38929

clang sometimes uses a different instruction that the "normal" one for the
intrinsic.  Usually it gets an improvement, but not in this case where it ends
up wasting a movaps and using a bigger instruction.

float hsum_ps(__m128 v) {
    __m128 tmp  = _mm_movehdup_ps(v);     // dest of MOVSHDUP is write-only
    __m128 sums = _mm_add_ps(v, tmp);
    tmp         = _mm_movehl_ps(tmp, sums);
    sums        = _mm_add_ss(sums, tmp);
    return  _mm_cvtss_f32(sums);
}

compiles to: clang 3.7.1 -O3 -march=core2

        movshdup        xmm1, xmm0      # xmm1 = xmm0[1,1,3,3]
        addps   xmm1, xmm0
        movaps  xmm0, xmm1
        shufpd  xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        addss   xmm0, xmm1
        ret

movhlps is very compact: OF 12 /r,  vs. shufpd being a 5 byte insn.  At least
it doesn't use shufps, which is slow on Core2.  A similar sequence after a loop
produced a movaps/unpckhpd instead of the movhlps.

What I was hoping for (and which gcc produces) is the obvious literal
translation from intrinsics to instructions:

        movshdup xmm1, xmm0
        addps   xmm0, xmm1
        movhlps xmm1, xmm0
        addss   xmm0, xmm1
        ret

I used movshdup first because its destination is write-only.  I can avoid a
movaps without needing the compiler to keep some other variable live so I can
movhlps into it.  (And so I could put it in a function that only takes one
parameter.)  I can't use _mm_undefined_ps() without risk of denormals or NaN
slowdowns.

Quuxplusone commented 8 years ago

[cc'ing Simon for shuffle lowering expertise]

The fact that the source uses _mm_movehl_ps() -> movhlps is lost immediately in
the clang headers because we translate this operation to an LLVM IR
shufflevector:

static __inline__ __m128 __DEFAULT_FN_ATTRS
_mm_movehl_ps(__m128 __a, __m128 __b)
{
  return __builtin_shufflevector(__a, __b, 6, 7, 2, 3);
}

We've had some discussion about whether this is the right approach or not
because it means that even unoptimized code may not match what the programmer
expected based on the intrinsics used.

But with optimization on, the goal is definitely to have the compiler produce
something equivalent or better than the specified x86 intrinsic.

This is the IR heading to the backend:

define float @hsum_ps(<4 x float> %v) #0 {
entry:
  %shuffle.i = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 1, i32 1, i32 3, i32 3>
  %add.i15 = fadd <4 x float> %shuffle.i, %v
  %vecext.i13 = extractelement <4 x float> %add.i15, i32 2
  %vecext1.i = extractelement <4 x float> %add.i15, i32 0
  %add.i = fadd float %vecext1.i, %vecext.i13
  ret float %add.i
}

Quuxplusone commented 8 years ago

Probably a better statement of this bug is that not using movhlps is a missed optimization for any case where it could be used. At 3 bytes, it's always at least as short as any other vector insn, and it's fast on every CPU on Agner Fog's tables.

According to Agner Fog's tables, movhlps is faster than shufpd on old CPUs that only have 64bit vector hardware (AMD Bobcat, K8, Intel Pentium-M). Otherwise they're always the same.

We absolutely want to use it when the source expressed the same shuffle with shuffle_ps() or shuffle_pd().

It's especially useful without AVX, where it can save a mov instruction. (Although doing that with a typical _mm_shuffle_ps(same,same, imm8); something_ps(); _mm_cvtss_f32() is very tricky: Even though some elements have their results unused, they need to have valid FP data that won't slow down the packed instruction with denormals, NaNs, or division by zero.)

Perhaps recognizing multiple variations on the whole horizontal-sum pattern would be useful, since it's often expressed with hadd_ps, or by storing to an array and summing. On current CPUs, the only advantage of 2x haddps is x86 code-size. On modern Intel CPUs with a uop-cache, extra uops isn't always just a local slowdown: it could evict other code somewhere else.

Recognizing a horizontal-sum pattern would let LLVM output whatever the optimal sequence is. (I think it's probably this one I've come up with, for targets that support SSE3 but not AVX. This might still win for code-size with AVX, given the lack of immediate bytes).

Quuxplusone commented 8 years ago

(In reply to comment #2)

> Perhaps recognizing multiple variations on the whole horizontal-sum pattern
> would be useful, since it's often expressed with hadd_ps, or by storing to
> an array and summing.

Yes - there's work going on currently to make this better. Please see bug 25808
and related bugs. The IR pattern here is similar to what we're hoping to match
in that bug, but there are 2 possible complications:

1. Your code already has one vector fadd; we'll need to be flexible enough to
match this case.

2. Your code is FP; I'm not sure if we can do the transform without -ffast-math.

Quuxplusone commented 8 years ago

Looking at this now.

I thought this was just a side effect of the shuffle widening, but it turns out
we don't lower to MOVHLPS/MOVLHPS even on SSE1-only hardware. We do combine to
them, but that only works for unary shuffles where we're splatting lower/upper
halves. (Note: I'm wanting to improve target shuffle combining but this is a
long way off right now - D16683 is an early step).

One of the reasons for poor support is probably as they can't fold loads (which
I think is the source of its smaller encoding?), so it'd be necessary to handle
custom memory folding. We might be able to match some of the domain equivalents
(unpck*pd/punpck*qdq) as well while we're at it.

Quuxplusone commented 8 years ago

(In reply to comment #4)
> One of the reasons for poor support is probably as they [MOVHLPS/MOVLHPS]
can't fold loads
> (which I think is the source of its smaller encoding?).

No, see the couple paragraphs about that below.

> so it'd be necessary
> to handle custom memory folding. We might be able to match some of the
> domain equivalents (unpck*pd/punpck*qdq) as well while we're at it.

Yeah, you'll need custom folding handling for MOVHLPS/MOVLHPS.  They *can*
actually work as loads, but the memory-operand form of the same opcode is
listed as a separate instruction.  Instead of an alignment-required m128 that
it only reads part of (like [p]unpck*), they use an m64 memory operand (and
thus alignment is not required, even without AVX).  You have to add a
displacement of 8 if you want the high half of a vector in memory.

   0F 12 /r MOVHLPS xmm1, xmm2   merge a new lower half from high(xmm2)
   0F 12 /r MOVLPS  xmm, m64     merge a new lower half from memory
   0F 13 /r MOVLPS  m64, xmm     Same as a movsd store (but possibly strange latency on Bulldozer?  see last section)

   0F 16 /r MOVLHPS xmm1, xmm2   merge a new high half from low(xmm2)
   0F 16 /r MOVHPS  xmm, m64     merge a new high half from memory
   0F 17 /r MOVHPS  m64, xmm     store upper half

66 0F 16 /r MOVHPD  xmm, m64   Identical to movhps, don't use
  same for the other PD versions: same data movement and CPUs don't care.

AFAICT, unpcklpd xmm,xmm does exactly the same thing as movlhps, but takes one
more byte in the non-VEX encoding (because it needs a 66 prefix for PD).  This
may be why there is no movlhpd / movhlpd (reg,reg).

unpckhpd is like movhlps, but stores the merged result in the other operand.
(So with AVX, you can use either one and just reverse the order of the two
source operands.)

AFAIK, there are no CPUs that care whether you use PS or PD shuffles between PS
or PD calculations.  There's only extra bypass delay when using integer
shuffles between FP instructions.  (Or on AMD and Core2, all shuffles are in
the integer domain so there's always bypass delay.)  SHUFPS is a 2-register
shuffle that you can use on FP or integer data with no penalty (on any HW that
Agner Fog has tested).  IDK if LLVM wants to take advantage of that, since it
might leave some people confused.

Some CPUs (i.e. Nehalem) care whether you use movaps or movdqa for reg,reg
moves, but not for load/store.  AMD has bypass delays, but not for movaps/dqa
because they're done at register-rename time (like Intel IvB).

----

The short encoding for movhlps is because they hadn't used up the 0F XX coding
space for packed-single yet, and it has no immediate: SSE1 PS instructions are
just 2B opcodes + mod/rm [+immediate], so the common case is 3B.  (Ignoring SIB
byte for 2-reg addressing modes, REX prefix, displacement, and segment
overrides for thread-local, since these are fixed overheads for any insn).

*PD double and P* integer instructions up to SSE3 are 3B opcodes, plus the same
up-to-3B for operands.   SSSE3 and SSE4 use 4B opcodes (66 0F 38/3A XX coding
spaces).  SSSE3 / SSE4 insns in AVX have to use 3B long-VEX prefixes even when
their operands only use low registers (that don't need any REX bits).

It appears that 0F 17 and 0F 13 (the store versions) with reg,reg operands are
illegal instructions, so they wasted some coding space here.  It's pretty hard
to use for anything else, though, since splitting the reg,reg and reg,mem
versions of an instruction between different opcodes would waste decoder
transistors.

----

If Agner Fog's tables are correct, Bulldozer/Piledriver have 1c higher latency
for a movlps store, vs. a movsd store.  He can only measure the combined
store/reload latency, and says that load and store timings have to split the
total round trip, so maybe he just tested differently with those insns.  Still,
it's odd.  Agner's table lists those insns as the same on other CPUs I looked
at, including Steamroller.  IDK why AMD wouldn't just decode those two opcodes
to the same internal m-op.  AMD does apparently tag FPU data in vector regs
with invisible extra data for float vs. double, but Agner says the penalty is
only observed when the instruction reading the data is an FP calculation (e.g.
addps / addpd.)  There's a separate "store" domain, so data has to leave the FP
domain on the way to a store anyway.

So, point is, movlps stores might be a good choice instead of movsd, to save an
instruction byte.  It has fewer prefixes as part of the opcode, which helps
Atom/Silvermont's decoders that choke on insns with more than 3 prefixes.

Note that the load version of movsd zeros the rest of the reg, but movlps
merges.

Quuxplusone commented 8 years ago

Also, is the following a new bug, or another symptom of shuffle work-in-
progress:

float add_in_wrong_order(__m128 v) {
    __m128 shuf = _mm_movehl_ps(v, v);
    __m128 sums = _mm_add_ps(shuf, v);  // Using _ss doesn't help
    return        _mm_cvtss_f32(sums);
}

with clang 3.7.1 -xc -O3 -Wall -fverbose-asm -march=haswell -ffast-math -mno-avx
compiles to (godbolt: http://goo.gl/sV9tMR)

        movaps  xmm1, xmm0
        movhlps xmm1, xmm1              # with add_ss, this is shufpd x1,x1, 1
        addps   xmm1, xmm0
        movaps  xmm0, xmm1

I've seen this failure to generate the result in the desired register before in
clang output, but didn't get around to reporting it.  I forget if I've seen
this with integer registers, or just with non-AVX vectors.

---

Would it be possible for LLVM to see that _mm_cvtss_f32 is only taking the low
element, and propagate that back to the inputs to add_ps?  Probably not, since
you still need to potentially fault on a signalling NaN in elements 1..3 if the
MXCSR has that enabled.  Still, doesn't -ffast-math mean you *don't* have to
care what happens to NaNs?

--------

As long as you don't cause slowdowns from doing FP calcs on garbage data, -
ffast-math lets you do a lot.  I think this sequence would be valid for that
source, with -ffast-math:

     movhlps    xmm1, xmm0   # upper half is garbage
     addss      xmm0, xmm1   # so don't use it

There's a false dependency on old contents of xmm1.  When inlined, you can use
any long-dead register, or better: one that had to be ready at some point
before v was ready.

gcc likes to xor-zero registers to avoid false dependencies (e.g. for popcnt if
it doesn't want to overwrite the src).

Quuxplusone commented 8 years ago

Thanks for the update Peter, I've put a patch up for review to fix movlhps/movhlps lowering and load folding on SSE1 targets:

http://reviews.llvm.org/D16956

There are various minor tweaks that could be done (domains and commutation come to mind) in follow up patches before tackling the target shuffle combine improvements necessary for SSE2+ targets.

Regarding your other points (add_in_wrong_order and store patterns) - please can you raise separate bugs? It can get crowded otherwise....

Quuxplusone commented 8 years ago

(In reply to comment #7)
> Regarding your other points (add_in_wrong_order and store patterns) - please
> can you raise separate bugs? It can get crowded otherwise....

Thanks for the guidance on whether that was a separate bug or not.  Reported as
https://llvm.org/bugs/show_bug.cgi?id=26515

The closet I could find to an existing similar bug is
https://llvm.org/bugs/show_bug.cgi?id=15705, but that's not a case where the
extra mov is just from the wrong choice of operand order for a commutative op.

Quuxplusone commented 8 years ago

(In reply to comment #6)
> gcc likes to xor-zero registers to avoid false dependencies (e.g. for popcnt
> if it doesn't want to overwrite the src).

LLVM does this too in some cases (ref: hasPartialRegUpdate() and
hasUndefRegUpdate() in X86InstrInfo.cpp).

This also came up in the discussion of bug 22024.

Quuxplusone commented 8 years ago

Candidate Fix: https://reviews.llvm.org/D23027

Quuxplusone commented 8 years ago

Committed in https://reviews.llvm.org/rL279430

Current output:

        movshdup %xmm0, %xmm1    # xmm1 = xmm0[1,1,3,3]
        addps    %xmm0, %xmm1
        movaps   %xmm1, %xmm0
        movhlps  %xmm0, %xmm0    # xmm0 = xmm0[1,1]
        addss    %xmm1, %xmm0
        retq

Quuxplusone commented 6 years ago

Current output:
        movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
        addps %xmm0, %xmm1
        movaps %xmm1, %xmm0
        unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
        addss %xmm1, %xmm0
        retq

Adding Craig who just put up D52109 to deal with these type of unwanted moves -
this bug's test is in haddsub-3.ll but unfortunately it doesn't seem to be
fixed by D52109.

Quuxplusone commented 6 years ago

The issue here is sort of weird because we're actually recognizing that the movhlps only needs one of its inputs. InstCombine made it an extractelement in IR and the the addss became a scalar add. It happens that our extractelement lowering here turns it back to movhlps/unpckhpd with both inputs the same. The extra move is impossible to get rid of in register allocation because there is no dependency on the movshdup output on the movhlps/unpckhpd.

Quuxplusone / LLVMBugzillaTest

_mm_movehl_ps pessimized to movaps/shufpd instead of movhlps #26490