Maratyszcza commented 4 years ago

Introduction

This is proposal to add new variant of existing bitmask instruction. The new variant extracts the highest bit of the two 64-bit lanes in a SIMD vector into an 32-bit integer. This variant was left out of #201 without any discussion (maybe @zeux knows why), but would be useful both for orthogonality of the instruction set and for efficiency: x86 natively supports this instruction since SSE2, and on ARM is can be emulated more efficiently than other bitmask variants.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to VMOVMSKPD reg_y, xmm_x

x86/x86-64 processors with SSE2 instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to MOVMSKPD reg_y, xmm_x

ARM64 processors

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to:
- SQXTN Vtmp.2S, Vx.2D
- USHR Vtmp.2S, Vtmp.2S, 31
- USRA Dtmp, Dtmp, 31
- FMOV Wy, Stmp

ARMv7 processors with NEON instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to:
- VQMOVN.S64 Dtmp, Qx
- VSHR.U32 Dtmp, Dtmp, 31
- VSRA.U64 Dtmp, Dtmp, 31
- VMOV.32 Ry, Dtmp[0]

zeux commented 4 years ago

I think this might be an accidental omission - I didn't actually know that SSE2 supports this natively, and I didn't have use cases for 64-bit masks. So it didn't occur to me to propose that because I thought SSE2 lowering would have to scalarize.

tlively commented 4 years ago

This is implemented in LLVM (but not Binaryen) as __builtin_wasm_bitmask_i64x2 and will be available in Emscripten in a few hours. I don't know if we need benchmarking for this instruction or not, though.

akirilov-arm commented 4 years ago

I'd like to propose an alternative Arm64 mapping:

ushr Vtmp.2d, Vx.2d, #63
mov Xy, Vtmp.d[0]
mov Xtmp, Vtmp.d[1]
add Wy, Wy, Wtmp, lsl #1

The main advantage of this sequence is that the middle 2 instructions are independent, so they can execute in parallel. In fact, the essentially scalarized version might execute even faster (and require only 1 temporary register), but is 1 instruction longer:

mov Xy, Vx.d[0]
mov Xtmp, Vx.d[1]
lsr Xy, Xy, #63
lsr Xtmp, Xtmp, #63
add Wy, Wy, Wtmp, lsl #1

ngzhian commented 4 years ago

Forgot to say that this is prototyped in v8 (x64) https://chromium.googlesource.com/v8/v8/+/ceee7cfe7260152fd90c66657b8476b9d3a8b915

ngzhian commented 3 years ago

@akirilov-arm would you suggest a similar mapping for ARMv7 as well?

akirilov-arm commented 3 years ago

@ngzhian Something like this (tmp2 = tmp * 2, tmp3 = tmp * 2 + 1):

vshrq.u64 Qtmp, Qx, #63
vmov.32 Ry, Dtmp2[0]
vmov.32 Rtmp, Dtmp3[0]
add Ry, Ry, Rtmp, lsl #1

Note that I haven't tested the sequence and I am also not sure about its performance characteristics - extra latency may crop up due to the SIMD & FP register overlapping rules in AArch32.