Open abrown opened 4 years ago
Since splat is a high-use instruction, is there a different semantics that would cover most of its uses and also have better codegen? Or would simplifying the codegen for splats just lead to proportionally more complex user code to regain their current functionality?
This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb
all have different semantics. In the specific case that you linked for i16x8.splat, the pshufw
instruction only operates on 64-bit operands and not 128-bit operands, so there are a few different ways to synthesize this, but ASFAIK they will all need atleast two instructions to synthesize the splat - there are different ways of doing this apart from the V8 implementation linked, but I suspect you would be looking at some combination of pshuflw
, pshufhw
and/or pshufd
, and possibly a move to an xmm register depending on engine implementation.
Is there something specific you would like to propose to mitigate this apart from getting rid of a high value operation? If not, and this is more highlighting a code generation issue - I'm not sure anything can actually be done about it given the different semantics for different bit widths on x64.
This is very specifically an Intel ISA quirk because
pshufd/pshufw/pshufb
all have different semantics
I don't think the key is actually the different semantics, it's that these instructions can't address scalar registers directly and are forced to MOV
or PINSR*
first to get the value in a vector register in order to then shuffle. I have been looking around at VBROADCAST
, VPERM*
, VSHUFF*
, etc. but I don't see a way to address a GPR directly as in ARM. I suspect that this is impossible but perhaps there is some trick that I'm not yet aware of.
The different semantics are an issue for the specific i16x8.splat that you linked code to, but I agree that the additional mov/pinsr*
instruction for splats is harder to get rid of even for memory operands because both the instructions don't load/insert into XMM registers from memory. and the same applies to the load+splat instructions as well (for pre-AVX* codegen).
There doesn't seem to be anything actionable here, so closing this issue - please reopen if you have suggestions for more we can do here.
Can I get permissions to re-open this? I think the actionable part is to document the possible lowerings that improve the situation in the "implementor's guide" document (do we have one yet?). Specifically on x86, this high-use instruction can be:
MOVD + V[P]BROADCAST*
in AVX2V[P]BROADCAST*
in AVX2 when the value to splat can be determined to be from a load operationV[P]BROADCAST*
in various flavors of AVX512Not sure if you need permissions to reopen as the original author for the issue, but reopening. This was previously discussed at a meeting (03/06), and there was an AI for Intel folks who were discussing this at the meeting to follow up with PRs/Issues to decide where this document should live, and what form it should take.
splat
has 2- to 3-instruction lowerings in cranelift and v8. I believe the "splat all ones" and "splat all zeroes" cases are a single-instruction lowering in both platforms but it is unfortunate that other values ofsplat
will incur a multi-instruction overhead, especially sincesplat
would seem to be a high-use instruction.