_mm256_shuffle_ps(_,_,0x44) compiles to vunpcklpd with -march= icelake-client

Quuxplusone commented 4 years ago


Bugzilla Link	PR44113
Status	NEW
Importance	P enhancement
Reported by	Volker Hirthammer (DrTroll@gmx.de)
Reported on	2019-11-22 04:38:43 -0800
Last modified on	2019-11-23 04:24:20 -0800
Version	9.0
Hardware	PC All
CC	blitzrakete@gmail.com, craig.topper@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, peter@cordes.ca, richard-llvm@metafoo.co.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also	PR44125

In a recent Stackoverflow discussion (link:
https://stackoverflow.com/questions/58954801/avx-equivalent-for-mm-movelh-ps)
we found out that the instruction _mm256_shuffle_ps(_,_,0x44) is compiled to
vunpcklpd by Clang. This is a possible optimization for Skylake and other
processors which have identical throughput for shuffles and unpacks.
But as the Stackoverflow-user Peter Cordes mentioned in his answer, Ice Lake
processors have a higher throughput for shuffles than for unpacks:

unpack:
https://www.uops.info/table.html?search=vunpcklp&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ICL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_avx=on&cb_avx2=on

shuffle:
https://www.uops.info/table.html?search=vshufp&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ICL=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_avx=on&cb_avx2=on

Therefore, the performed replacement of the shuffle is contra-productive on an
Ice Lake processor.
Even with -march= icelake-client, Clang replaces the shuffle with vunpcklpd:

https://godbolt.org/z/MqtJXY

Same goes for _mm256_shuffle_ps(_,_,0xee) vunpckhpd.

Quuxplusone commented 4 years ago

Unfortunately we have very poor support for target/scheduler-model specific instruction selection/combining other than what is based on feature flags.

I'm not sure we want to go the route of FeaturePreferSHUFPvsUNPCK etc.

What could work in this specific case is if we have a way to remove the bitcast from <8 x float> to <4 x double> for the unpack and the bitcast back again - instead scaling the shuffle mask as the combine would then recognise the shufps pattern - but thats not likely to generalize well. Incidently it could help us with the avx512-specific TODO we currently have that prevents shuffles of different scalar sizes as it interferes with mask predicates.

In conclusion, this is going to be annoying to fix satisfactorily but I'll have a go.

Quuxplusone commented 4 years ago

Also, I don't think anyone has contributed icelake scheduler models yet - the targets still reference the skylake models.

Quuxplusone commented 4 years ago

Weird. The integer vpunpcklqdq and vpunpckldq are on ports 1 and 5 on icelake according to uops.info. So its just the FP version that are crippled?

Quuxplusone commented 4 years ago

vpermilpd/ps with immediate are also only port 1, but pshufd is on port 1 and 5?

Quuxplusone commented 4 years ago

I think we should just go ahead and emit code that will run faster on IceLake
across the board.  When AVX is available we're at most saving 1 or 2 bytes of
code size (the immediate and size of VEX prefix) by using things like vunpcklpd
vs vshufps.

Ideally we'd still save that code size with -march=znver1 or whatever
(especially non-Intel tuning; earlier Intel tuning is more justified in caring
about ICL), but until we can make shuffle optimization target-dependent, it
might be best to make it care about ICL.

For legacy SSE there are reasons (like Core 2 Merom, and K8) to avoid "slow
shuffles" by using unpcklpd instead of shufps, and also some instructions can
allow avoiding a movdqa or movaps, e.g. pshufd is a copy-and-shuffle.  But with
AVX all those reasons go away; other than KNL, it doesn't matter very much
which shuffle you use, except for lane-crossing on AMD pre Zen2 being slow.

Until IceLake, so running well on IceLake can be the deciding factor over code-
size, for the benefit of making more future-proof binaries.  If we start now,
IceLake will still be widespread by the time clang/LLVM versions that do this
are widespread and their resulting binaries are widespread.

(In reply to Craig Topper from comment #4)
> vpermilpd/ps with immediate are also only port 1, but pshufd is on port 1
> and 5?

I'm seeing vpermilps/d as port5 only on ICL, on uops.info.
https://www.uops.info/table.html?search=vpermilp&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ICL=on&cb_measurements=on&cb_iaca30=on&cb_avx=on&cb_avx2=on&cb_avx512=on

p1 only wouldn't make sense; the vector ALUs on that port are shut down when
512-bit uops are in flight (at least on SKX, and it looks like also on ICL
based on FMA throughput).  Only p1's integer ALUs are still active in that
case. (Including all the usual 1c latency stuff like add, and p1 has the only
integer ALU that runs 3-cycle latency uops like slow-LEA, imul, popcnt / lzcnt,
and pdep)

Anyway, it looks like ICL has an extra shuffle unit on port 1 for "common" 128
and 256-bit in-lane shuffles.  No clue why it can't run the same data movement
via a different uop (implicit vs. immediate control).

Maybe the logic to turn uops into shuffle-control signals is actually a burden,
and keeping that simpler in the port1 bonus shuffle ALU might have been part of
keeping it lighter weight (as well as not handling lane-crossing or 512-bit
shuffles at all).

The FP vs. integer different might make sense: there are 2 separate forwarding
networks for vector data, SIMD-integer and FP.  This is why bypass delays
exist.  But it limits the combinatorial explosion of possible muxing that's
required.  The port 5 shuffle unit needs to be on both networks (instead of
having separate shuffle ALUs for integer shuffles vs. FP shuffles) but that's
expensive and maybe they didn't do it completely for all of port 1.  But the
fact that it can run shufps xmm/ymm presumably means at least some of it is
connected to FP forwarding networks.

Quuxplusone commented 4 years ago

I mistyped that about vpermilps/pd. I meant port 5. I think had "one port" in my head at the same time and ended up writing "port 1".

Quuxplusone / LLVMBugzillaTest

_mm256_shuffle_ps(_,_,0x44) compiles to vunpcklpd with -march= icelake-client #43083