Closed Maratyszcza closed 3 years ago
I don't think this meets the bar for inclusion. The codegen is not great, and half of the use cases are SIMD libraries which expose such instructions (they don't use it).
It is expected that most uses of 64-bit integer operations is through either high-level wrappers or auto-vectorization: there are usually more efficient ways to do computations within narrower data types, but they are ISA-specific (e.g. on ARM NEON we may use saturated 32-bit arithmetics, but it is not portable to x86). Thus it is mainly the codes that trade some performance for portability (through high-level wrapper libraries or through auto-vectorization) that use 64-bit arithmetics.
IMO lowering on recentish systems isn't bad: 4 instructions on SSE4.2, 3 instructions on ARMv7 NEON, 2 instruction on ARM64 and AVX. Without specialized i64x2.min_s
/i64x2.max_s
instructions, but with i64x2.gt_s
, we'd have the same 2/3 instructions on ARM64/ARMv7+NEON, but 6+ instructions on SSE4.2 and 4 instructions on AVX (because they'd have to use v128.bitselect
instead of [V]PBLENDVB
).
Adding a preliminary vote for the inclusion of i64x2 signed min/max operations to the SIMD proposal below. Please vote with -
👍 For including i64x2 signed min/max operations 👎 Against including i64x2 signed min/max operations
The community group unanimously decided against including these instructions in the 1/29/21 meeting (#429).
Introduction
This is proposal to add 64-bit variant of the existing
min_s
andmax_s
instructions. Only x86 processors with AVX512 natively support these instructions, but ARMv7 NEON, ARM64 and x86 with SSE4.2 or AVX can efficiently emulate them with 2-4 instructions.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.min_s(a, b)
is lowered toVPMINSQ xmm_y, xmm_a, xmm_b
y = i64x2.max_s(a, b)
is lowered toVPMAXSQ xmm_y, xmm_a, xmm_b
x86/x86-64 processors with AVX instruction set
y = i64x2.min_s(a, b)
(y
is nota
andy
is notb
) is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y
y = i64x2.max_s(a, b)
(y
is nota
andy
is notb
) is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y
x86/x86-64 processors with SSE4.2 instruction set
y = i64x2.min_s(a, b)
(y
is notb
anda
/b
/y
are not inxmm0
) is lowered to:MOVDQA xmm0, xmm_a
MOVDQA xmm_y, xmm_a
PCMPGTQ xmm0, xmm_b
PBLENDVB xmm_y, xmm_b
y = i64x2.max_s(a, b)
(y
is nota
anda
/b
/y
are not inxmm0
) is lowered to:MOVDQA xmm0, xmm_a
MOVDQA xmm_y, xmm_b
PCMPGTQ xmm0, xmm_b
PBLENDVB xmm_y, xmm_a
x86/x86-64 processors with SSE4.1 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)
(y
is nota
andy
is notb
anda
/b
/y
are not inxmm0
) is lowered to:MOVDQA xmm0, xmm_b
MOVDQA xmm_y, xmm_a
PSUBQ xmm0, xmm_a
PCMPEQD xmm_y, xmm_b
PAND xmm0, xmm_y
MOVDQA xmm_y, xmm_a
PCMPGTD xmm_y, xmm_b
POR xmm0, xmm_y
MOVDQA xmm_y, xmm_a
PSHUFD xmm0, xmm0, 0xF5
PBLENDVB xmm_y, xmm_b
y = i64x2.max_s(a, b)
(y
is nota
andy
is notb
anda
/b
/y
are not inxmm0
) is lowered to:MOVDQA xmm0, xmm_b
MOVDQA xmm_y, xmm_a
PSUBQ xmm0, xmm_a
PCMPEQD xmm_y, xmm_b
PAND xmm0, xmm_y
MOVDQA xmm_y, xmm_a
PCMPGTD xmm_y, xmm_b
POR xmm0, xmm_y
MOVDQA xmm_y, xmm_b
PSHUFD xmm0, xmm0, 0xF5
PBLENDVB xmm_y, xmm_a
x86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PSHUFD xmm_y, xmm_y, 0xF5
PAND xmm_tmp, xmm_y
PANDN xmm_y, xmm_a
POR xmm_y, xmm_tmp
y = i64x2.max_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PSHUFD xmm_y, xmm_y, 0xF5
PAND xmm_tmp, xmm_y
PANDN xmm_y, xmm_b
POR xmm_y, xmm_tmp
ARM64 processors
y = i64x2.min_s(a, b)
(y
is nota
andy
is notb
) is lowered to:CMGT Vy.2D, Va.2D, Vb.2D
BSL Vy.16B, Vb.16B, Va.16B
y = i64x2.max_s(a, b)
(y
is nota
andy
is notb
) is lowered to:CMGT Vy.2D, Va.2D, Vb.2D
BSL Vy.16B, Va.16B, Vb.16B
ARMv7 processors with NEON instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)
(y
is nota
andy
is notb
) is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
VBSL Qy, Qb, Qa
y = i64x2.max_s(a, b)
(y
is nota
andy
is notb
) is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
VBSL Qy, Qa, Qb