Closed Maratyszcza closed 3 years ago
I think this might be an accidental omission - I didn't actually know that SSE2 supports this natively, and I didn't have use cases for 64-bit masks. So it didn't occur to me to propose that because I thought SSE2 lowering would have to scalarize.
This is implemented in LLVM (but not Binaryen) as __builtin_wasm_bitmask_i64x2
and will be available in Emscripten in a few hours. I don't know if we need benchmarking for this instruction or not, though.
I'd like to propose an alternative Arm64 mapping:
ushr Vtmp.2d, Vx.2d, #63
mov Xy, Vtmp.d[0]
mov Xtmp, Vtmp.d[1]
add Wy, Wy, Wtmp, lsl #1
The main advantage of this sequence is that the middle 2 instructions are independent, so they can execute in parallel. In fact, the essentially scalarized version might execute even faster (and require only 1 temporary register), but is 1 instruction longer:
mov Xy, Vx.d[0]
mov Xtmp, Vx.d[1]
lsr Xy, Xy, #63
lsr Xtmp, Xtmp, #63
add Wy, Wy, Wtmp, lsl #1
Forgot to say that this is prototyped in v8 (x64) https://chromium.googlesource.com/v8/v8/+/ceee7cfe7260152fd90c66657b8476b9d3a8b915
@akirilov-arm would you suggest a similar mapping for ARMv7 as well?
@ngzhian Something like this (tmp2
= tmp * 2
, tmp3
= tmp * 2 + 1
):
vshrq.u64 Qtmp, Qx, #63
vmov.32 Ry, Dtmp2[0]
vmov.32 Rtmp, Dtmp3[0]
add Ry, Ry, Rtmp, lsl #1
Note that I haven't tested the sequence and I am also not sure about its performance characteristics - extra latency may crop up due to the SIMD & FP register overlapping rules in AArch32.
Prototyped on arm64 as well
Introduction
This is proposal to add new variant of existing
bitmask
instruction. The new variant extracts the highest bit of the two 64-bit lanes in a SIMD vector into an 32-bit integer. This variant was left out of #201 without any discussion (maybe @zeux knows why), but would be useful both for orthogonality of the instruction set and for efficiency: x86 natively supports this instruction since SSE2, and on ARM is can be emulated more efficiently than otherbitmask
variants.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i64x2.bitmask(x)
is lowered toVMOVMSKPD reg_y, xmm_x
x86/x86-64 processors with SSE2 instruction set
y = i64x2.bitmask(x)
is lowered toMOVMSKPD reg_y, xmm_x
ARM64 processors
y = i64x2.bitmask(x)
is lowered to:SQXTN Vtmp.2S, Vx.2D
USHR Vtmp.2S, Vtmp.2S, 31
USRA Dtmp, Dtmp, 31
FMOV Wy, Stmp
ARMv7 processors with NEON instruction set
y = i64x2.bitmask(x)
is lowered to:VQMOVN.S64 Dtmp, Qx
VSHR.U32 Dtmp, Dtmp, 31
VSRA.U64 Dtmp, Dtmp, 31
VMOV.32 Ry, Dtmp[0]