Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Some 2-vector vector select identity shuffles may be better represented as moves #40403

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR41433
Status NEW
Importance P enhancement
Reported by Roman Lebedev (lebedev.ri@gmail.com)
Reported on 2019-04-08 13:41:12 -0700
Last modified on 2019-04-08 14:31:42 -0700
Version trunk
Hardware PC Linux
CC craig.topper@gmail.com, daan@dsprenkels.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR41429, PR39603
Split off from https://bugs.llvm.org/show_bug.cgi?id=41429

https://godbolt.org/z/_n1ggH

void example(__m256i * __restrict__ dest, const __m256i * __restrict__ a) {
    (*dest)[2] = (*a)[2];
    (*dest)[3] = (*a)[3];
}

Here we do not ever touch the low half of dest, and replace the high part
of dest with high part of `a`.

The naive asm could be:

  vmovaps ymm0, ymmword ptr [rdi]
  vblendps ymm0, ymm0, ymmword ptr [rsi], 240 # ymm0 = ymm0[0,1,2,3],mem[4,5,6,7]
  vmovaps ymmword ptr [rdi], ymm0
  vzeroupper

But we can also produce:

        vmovaps xmm0, xmmword ptr [rsi + 16]
        vmovaps xmmword ptr [rdi + 16], xmm0

I'm not quite sure what are the exact criteria when that is profitable to do
though.
Quuxplusone commented 5 years ago

This should be very good for x86 perf (replacing 2 ymm memops + a ymm shuffle with 2 xmm memops). Probably helps other targets too.

If we solve bug 41429, we have a vector select in the form of a shuffle in IR. Close enough to scalar select-store patterns to consider as a generic (DSE? earlyCSE?) optimization (see bug 39603) since it always eliminates ops?

If that's too far of a stretch, then we either have to enhance the SDAG load/store splitting/combining or make a custom pass to do this.