Closed Maratyszcza closed 3 years ago
Adding a preliminary vote for the inclusion of i64x2 unsigned comparison operations to the SIMD proposal below. Please vote with -
š For including i64x2 unsigned comparison operations š Against including i64x2 unsigned comparison operations
Closing as per #436.
Introduction
This is proposal to add 64-bit variant of existing
gt_u
,lt_u
,ge_u
, andle_u
instructions. ARM64 and x86-64 XOP natively support these instructions, but on other instruction sets they need to be emulated. On SSE4.2 emulation costs 5-6 instructions, but on older SSE extension and on ARMv7 NEON the emulation cost is more significant.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F, AVX512DQ, and AVX512VL instruction sets
y = i64x2.gt_u(a, b)
is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 6
VPMOVM2Q xmm_y, k_tmp
y = i64x2.lt_u(a, b)
is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 1
VPMOVM2Q xmm_y, k_tmp
y = i64x2.ge_u(a, b)
is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 5
VPMOVM2Q xmm_y, k_tmp
y = i64x2.le_u(a, b)
is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 2
VPMOVM2Q xmm_y, k_tmp
x86/x86-64 processors with XOP instruction set
y = i64x2.gt_u(a, b)
is lowered toVPCOMGTUQ xmm_y, xmm_a, xmm_b
y = i64x2.lt_u(a, b)
is lowered toVPCOMLTUQ xmm_y, xmm_a, xmm_b
y = i64x2.ge_u(a, b)
is lowered toVPCOMGEUQ xmm_y, xmm_a, xmm_b
y = i64x2.le_u(a, b)
is lowered toVPCOMLEUQ xmm_y, xmm_a, xmm_b
x86/x86-64 processors with AVX instruction set
y = i64x2.gt_u(a, b)
(y
is notb
) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
VPXOR xmm_y, xmm_y, xmm_tmp
VPXOR xmm_tmp, xmm_b, xmm_tmp
VPCMPGTQ xmm_y, xmm_y, xmm_tmp
y = i64x2.lt_u(a, b)
(y
is notb
) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
VPXOR xmm_y, xmm_y, xmm_tmp
VPXOR xmm_tmp, xmm_b, xmm_tmp
VPCMPGTQ xmm_y, xmm_tmp, xmm_y
y = i64x2.ge_u(a, b)
(y
is notb
) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
VPXOR xmm_y, xmm_y, xmm_tmp
VPXOR xmm_tmp, xmm_b, xmm_tmp
VPCMPGTQ xmm_y, xmm_tmp, xmm_y
VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
y = i64x2.le_u(a, b)
(y
is notb
) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
VPXOR xmm_y, xmm_y, xmm_tmp
VPXOR xmm_tmp, xmm_b, xmm_tmp
VPCMPGTQ xmm_y, xmm_y, xmm_tmp
VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
x86/x86-64 processors with SSE4.2 instruction set
y = i64x2.gt_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
MOVDQA xmm_tmp, xmm_y
PXOR xmm_y, xmm_a
PXOR xmm_tmp, xmm_b
PCMPGTQ xmm_y, xmm_tmp
y = i64x2.lt_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
MOVDQA xmm_tmp, xmm_y
PXOR xmm_y, xmm_b
PXOR xmm_tmp, xmm_a
PCMPGTQ xmm_y, xmm_tmp
y = i64x2.ge_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
MOVDQA xmm_tmp, xmm_y
PXOR xmm_y, xmm_b
PXOR xmm_tmp, xmm_a
PCMPGTQ xmm_y, xmm_tmp
PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
y = i64x2.le_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
MOVDQA xmm_tmp, xmm_y
PXOR xmm_y, xmm_b
PXOR xmm_tmp, xmm_a
PCMPGTQ xmm_y, xmm_tmp
PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
x86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_tmp, xmm_b
MOVDQA xmm_y, xmm_b
PSUBQ xmm_tmp, xmm_a
PXOR xmm_y, xmm_a
PANDN xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PANDN xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSRAD xmm_y, 31
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.lt_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_tmp, xmm_a
MOVDQA xmm_y, xmm_b
PSUBQ xmm_tmp, xmm_b
PXOR xmm_y, xmm_b
PANDN xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PANDN xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSRAD xmm_y, 31
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.ge_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_tmp, xmm_a
MOVDQA xmm_y, xmm_b
PSUBQ xmm_tmp, xmm_b
PXOR xmm_y, xmm_b
PANDN xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PANDN xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSRAD xmm_y, 31
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
y = i64x2.le_u(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_tmp, xmm_b
MOVDQA xmm_y, xmm_b
PSUBQ xmm_tmp, xmm_a
PXOR xmm_y, xmm_a
PANDN xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PANDN xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSRAD xmm_y, 31
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
ARM64 processors
y = i64x2.gt_u(a, b)
is lowered toCMHI Vy.2D, Va.2D, Vb.2D
y = i64x2.lt_u(a, b)
is lowered toCMHI Vy.2D, Vb.2D, Va.2D
y = i64x2.ge_u(a, b)
is lowered toCMHS Vy.2D, Va.2D, Vb.2D
y = i64x2.le_u(a, b)
is lowered toCMHS Vy.2D, Vb.2D, Va.2D
ARMv7 processors with NEON instruction set
y = i64x2.gt_u(a, b)
is lowered to:VQSUB.U64 Qy, Qa, Qb
VCGT.U32 Qy, Qy, 0
VREV64.32 Qtmp, Qy
VAND Qy, Qy, Qtmp
y = i64x2.lt_u(a, b)
is lowered to:VQSUB.U64 Qy, Qb, Qa
VCGT.U32 Qy, Qy, 0
VREV64.32 Qtmp, Qy
VAND Qy, Qy, Qtmp
y = i64x2.ge_u(a, b)
is lowered to:VQSUB.U64 Qy, Qb, Qa
VCEQ.I32 Qy, Qy, 0
VREV64.32 Qtmp, Qy
VAND Qy, Qy, Qtmp
y = i64x2.le_u(a, b)
is lowered to:VQSUB.U64 Qy, Qa, Qb
VCEQ.I32 Qy, Qy, 0
VREV64.32 Qtmp, Qy
VAND Qy, Qy, Qtmp