Inefficient x64 codegen for splat

WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly

Other

531 stars 43 forks source link

Inefficient x64 codegen for splat #191

Open abrown opened 4 years ago

abrown commented 4 years ago

splat has 2- to 3-instruction lowerings in cranelift and v8. I believe the "splat all ones" and "splat all zeroes" cases are a single-instruction lowering in both platforms but it is unfortunate that other values of splat will incur a multi-instruction overhead, especially since splat would seem to be a high-use instruction.

tlively commented 4 years ago

Since splat is a high-use instruction, is there a different semantics that would cover most of its uses and also have better codegen? Or would simplifying the codegen for splats just lead to proportionally more complex user code to regain their current functionality?

dtig commented 4 years ago

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics. In the specific case that you linked for i16x8.splat, the pshufw instruction only operates on 64-bit operands and not 128-bit operands, so there are a few different ways to synthesize this, but ASFAIK they will all need atleast two instructions to synthesize the splat - there are different ways of doing this apart from the V8 implementation linked, but I suspect you would be looking at some combination of pshuflw, pshufhw and/or pshufd, and possibly a move to an xmm register depending on engine implementation.

Is there something specific you would like to propose to mitigate this apart from getting rid of a high value operation? If not, and this is more highlighting a code generation issue - I'm not sure anything can actually be done about it given the different semantics for different bit widths on x64.

abrown commented 4 years ago

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics

I don't think the key is actually the different semantics, it's that these instructions can't address scalar registers directly and are forced to MOV or PINSR* first to get the value in a vector register in order to then shuffle. I have been looking around at VBROADCAST, VPERM*, VSHUFF*, etc. but I don't see a way to address a GPR directly as in ARM. I suspect that this is impossible but perhaps there is some trick that I'm not yet aware of.

dtig commented 4 years ago

The different semantics are an issue for the specific i16x8.splat that you linked code to, but I agree that the additional mov/pinsr* instruction for splats is harder to get rid of even for memory operands because both the instructions don't load/insert into XMM registers from memory. and the same applies to the load+splat instructions as well (for pre-AVX* codegen).

dtig commented 4 years ago

There doesn't seem to be anything actionable here, so closing this issue - please reopen if you have suggestions for more we can do here.

abrown commented 4 years ago

Can I get permissions to re-open this? I think the actionable part is to document the possible lowerings that improve the situation in the "implementor's guide" document (do we have one yet?). Specifically on x86, this high-use instruction can be:

reduced to two instructions with MOVD + V[P]BROADCAST* in AVX2
reduced to a single instruction with V[P]BROADCAST* in AVX2 when the value to splat can be determined to be from a load operation
reduced to a single instruction with V[P]BROADCAST* in various flavors of AVX512

dtig commented 4 years ago

Not sure if you need permissions to reopen as the original author for the issue, but reopening. This was previously discussed at a meeting (03/06), and there was an AI for Intel folks who were discussing this at the meeting to follow up with PRs/Issues to decide where this document should live, and what form it should take.