What are the instructions being proposed?

I propose a relaxed version of the Saturating Rounding Q-format Multiplication i16x8.q15mulr_sat_s introduced in WebAssembly/simd#365. I suggest i16x8.q15mulr_s as the tentative name for the relaxed instruction.

What are the semantics of these instructions?

i16x8.q15mulr_sat_s implements the mathematical operation of multiplication of fixed-point numbers in Q15 format (see WebAssembly/simd#365 for details). The multiplication overflows if and only if both inputs are INT16_MIN, and x86 SSSE3 and ARM NEON instructions differ in how they handle this situation: x86 version wraps around while ARM version saturates. WebAssembly SIMD instruction i16x8.q15mulr_sat_s standardized on the ARM overflow semantics, resulting in additional overflow checks on x86. However, as the case of both inputs INT16_MIN is rare and often can be guaranteed to never happen due to higher-level structure of an algorithm, having an relaxed version that allows both overflow options would help performance on x86.

The proposed i16x8.q15mulr_s Relaxed SIMD instruction computes the lane-wise rounded multiplication of Q15 numbers, and allows for either saturation or wrap-around behavior in the overflow case (where both inputs are INT16_MIN).

How will these instructions be implemented?

x86/x86-64 processors with AVX instruction set

y = i16x8.q15mulr_s(a, b) is lowered to VPMULHRSW xmm_y, xmm_a, xmm_b

x86/x86-64 processors with SSSE3 instruction set

y = i16x8.q15mulr_s(a, b) is lowered to MOVDQA xmm_y, xmm_a + PMULHRSW xmm_y, xmm_b

x86/x86-64 processors with SSE2 instruction set

y = i16x8.q15mulr_s(a, b) (y is NOT a and y is NOT b) is lowered to
- MOVDQA xmm_y, xmm_a
- MOVDQA xmm_tmp, xmm_a
- PMULLW xmm_y, xmm_b
- PMULHW xmm_tmp, xmm_b
- PSRLW xmm_y, 14
- PADDW xmm_tmp, xmm_tmp
- PAVGW xmm_y, wasm_i16x8_splat(0)
- PADDW xmm_y, xmm_tmp

ARM64 processors

y = i16x8.q15mulr_s(a, b) is lowered to SQRDMULH Vy.8H, Va.8H, Vb.8H

ARMv7 processors with NEON instruction set

y = i16x8.q15mulr_s(a, b) is lowered to VQRDMULH.S16 Qy, Qa, Qb

Reference lowering through the WAsm SIMD128 instruction set

y = i16x8.q15mulr_s(a, b) is lowered as y = i16x8.q15mulr_sat_s(a, b)

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

When both inputs are INT16_MIN, x86/x86-64 will produce INT16_MIN result while ARM/ARM64 will produce INT16_MAX result. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this instruction doesn't add any new fingerprinting surfaces.

WebAssembly / relaxed-simd

Relaxed Rounding Q-format Multiplication #40