I propose a relaxed version of the Saturating Rounding Q-format Multiplication i16x8.q15mulr_sat_s introduced in WebAssembly/simd#365. I suggest i16x8.q15mulr_s as the tentative name for the relaxed instruction.
What are the semantics of these instructions?
i16x8.q15mulr_sat_s implements the mathematical operation of multiplication of fixed-point numbers in Q15 format (see WebAssembly/simd#365 for details). The multiplication overflows if and only if both inputs are INT16_MIN, and x86 SSSE3 and ARM NEON instructions differ in how they handle this situation: x86 version wraps around while ARM version saturates. WebAssembly SIMD instruction i16x8.q15mulr_sat_s standardized on the ARM overflow semantics, resulting in additional overflow checks on x86. However, as the case of both inputs INT16_MIN is rare and often can be guaranteed to never happen due to higher-level structure of an algorithm, having an relaxed version that allows both overflow options would help performance on x86.
The proposed i16x8.q15mulr_s Relaxed SIMD instruction computes the lane-wise rounded multiplication of Q15 numbers, and allows for either saturation or wrap-around behavior in the overflow case (where both inputs are INT16_MIN).
How will these instructions be implemented?
x86/x86-64 processors with AVX instruction set
y = i16x8.q15mulr_s(a, b) is lowered to VPMULHRSW xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSSE3 instruction set
y = i16x8.q15mulr_s(a, b) is lowered to MOVDQA xmm_y, xmm_a + PMULHRSW xmm_y, xmm_b
x86/x86-64 processors with SSE2 instruction set
y = i16x8.q15mulr_s(a, b) (y is NOTa and y is NOTb) is lowered to
MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_a
PMULLW xmm_y, xmm_b
PMULHW xmm_tmp, xmm_b
PSRLW xmm_y, 14
PADDW xmm_tmp, xmm_tmp
PAVGW xmm_y, wasm_i16x8_splat(0)
PADDW xmm_y, xmm_tmp
ARM64 processors
y = i16x8.q15mulr_s(a, b) is lowered to SQRDMULH Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
y = i16x8.q15mulr_s(a, b) is lowered to VQRDMULH.S16 Qy, Qa, Qb
Reference lowering through the WAsm SIMD128 instruction set
y = i16x8.q15mulr_s(a, b) is lowered as y = i16x8.q15mulr_sat_s(a, b)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
When both inputs are INT16_MIN, x86/x86-64 will produce INT16_MIN result while ARM/ARM64 will produce INT16_MAX result. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this instruction doesn't add any new fingerprinting surfaces.
What are the instructions being proposed?
I propose a relaxed version of the Saturating Rounding Q-format Multiplication
i16x8.q15mulr_sat_s
introduced in WebAssembly/simd#365. I suggesti16x8.q15mulr_s
as the tentative name for the relaxed instruction.What are the semantics of these instructions?
i16x8.q15mulr_sat_s
implements the mathematical operation of multiplication of fixed-point numbers in Q15 format (see WebAssembly/simd#365 for details). The multiplication overflows if and only if both inputs areINT16_MIN
, and x86 SSSE3 and ARM NEON instructions differ in how they handle this situation: x86 version wraps around while ARM version saturates. WebAssembly SIMD instructioni16x8.q15mulr_sat_s
standardized on the ARM overflow semantics, resulting in additional overflow checks on x86. However, as the case of both inputsINT16_MIN
is rare and often can be guaranteed to never happen due to higher-level structure of an algorithm, having an relaxed version that allows both overflow options would help performance on x86.The proposed
i16x8.q15mulr_s
Relaxed SIMD instruction computes the lane-wise rounded multiplication of Q15 numbers, and allows for either saturation or wrap-around behavior in the overflow case (where both inputs areINT16_MIN
).How will these instructions be implemented?
x86/x86-64 processors with AVX instruction set
VPMULHRSW xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSSE3 instruction set
MOVDQA xmm_y, xmm_a
+PMULHRSW xmm_y, xmm_b
x86/x86-64 processors with SSE2 instruction set
y
is NOTa
andy
is NOTb
) is lowered toMOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_a
PMULLW xmm_y, xmm_b
PMULHW xmm_tmp, xmm_b
PSRLW xmm_y, 14
PADDW xmm_tmp, xmm_tmp
PAVGW xmm_y, wasm_i16x8_splat(0)
PADDW xmm_y, xmm_tmp
ARM64 processors
SQRDMULH Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
VQRDMULH.S16 Qy, Qa, Qb
Reference lowering through the WAsm SIMD128 instruction set
y = i16x8.q15mulr_sat_s(a, b)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
When both inputs are
INT16_MIN
, x86/x86-64 will produceINT16_MIN
result while ARM/ARM64 will produceINT16_MAX
result. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this instruction doesn't add any new fingerprinting surfaces.What use cases are there?