Closed Maratyszcza closed 3 years ago
I guess with #416 (likely to happen), this PR can be simplified to just i64x2.all_true.
Removed i64x2.any_true
I think this PR is similar to #368: a reduction from many lanes to a scalar. Since its i32x4
version, i32x4.all_true
, has a lowering as unfortunate on x86 as this one, why not consider both together? I would argue that the long sequences that *.all_true
introduces are potholes in the spec--not giant cliffs, but still unexpected performance issues in a spec designed for near-native performance. But, doubting that we would agree to remove *.all_true
altogether in favor of the *.bitmask
instructions, then this i64x2.all_true
version makes sense.
@abrown, it's a reduction but my expectation here would be that all_true is frequently used in a control context, ie as the argument to a branch, not captured for its value. In that case there's no reason for the final setnz / movzxbl pair, and the final code doesn't look all that bad IMO.
Yes, under that assumption it is not as bad (still not great, though). I guess you are also assuming that engines would be looking for and applying this type of optimization? I think I would as well but not sure about everyone else.
Yes, I would assume engines would optimize boolean evaluation for control even for SIMD (SpiderMonkey does, anyway).
Curious about the sse4.1 lowering actually. Why not
PXOR tmp, tmp
PCMPEQQ tmp, src
PMOVMSKB dest, tmp
TEST dest, dest
SETZ dest
MOVZXBL dest, dest
@lars-t-hansen Both sequences use the same number of instructions, but the SSE4.1 in the proposal has shorter dependency chain. Also, TEST dest, dest
should be CMP dest, 0xFFFF
with 4-byte immediate.
Ideally it would be great to test this, but that is just my opinion.
Approving as this was voted on as of #419. @tlively could you confirm that the assigned opcode looks good for the tools?
0xfdc3 collides with i16x8.extadd_pairwise_i8x16_u in v8 currently, i guess i should change ext add (since i picked that for prototyping).
Probably best to err on the side of caution and make the opcode TBD, then.
Think the last step of ARMv7 lowering is incorrect:
NEG Ry, Ry
should be ADD Ry, Ry, #1
IIUC Ry will either be -1 or 0, and the result we want is 0 and 1 respectively.
Also the sequence doesn't seem to work for i64x2(0, -1)
, think all the vpmin
should be vpmax
. Here's what I cooked up that seems to pass the tests:
vpmax.32 tmp_lo, src_low, src_high
vceq.32 tmp, tmp #0
vpmax(32 tmp_low, tmp_low, tmp_low
vmov.32 dst, tmp_low[0]
add dst dst #1
Still 5 instructions, lmk if you figure out something better.
@Maratyszcza The ARM64 instruction sequence is wrong too - here's a corrected version:
CMEQ Vtmp.2d, Vx.2d, #0
UMAXV Stmp, Vtmp.4s
FMOV Wy, Stmp
ADD Wy, Wy, #1
However, the following is an improved sequence, since it avoids the reduction:
CMEQ Vtmp.2d, Vx.2d, #0
ADDP Dtmp, Vtmp.2d
FCMP Dtmp, Dtmp
CSET Xy, eq
Finally, possibly the best alternative, which is, unfortunately, an instruction longer:
FMOV Xy, Dx
FMOV Xtmp, Vx.d[1]
CMP Xy, #0
CCMP Xtmp, #0, #4, ne
CSET Xy, ne
In reference to https://github.com/WebAssembly/simd/pull/415#issuecomment-758884709
Also,
TEST dest, dest
should beCMP dest, 0xFFFF
with 4-byte immediate.
@Maratyszcza, I don't think that's right, if what you're getting at is correctness and not performance. My sequence compares the input to zero, meaning the resulting bitmask will be nonzero in any lane that is zero, hence the test succeeds if the bitmask is zero everywhere, and setz captures that.
Introduction
This is proposal to add 64-bit variant of existing
all_true
instruction. This instruction complement the set of double-precision (f64
) and 64-bit integer (i64
) SIMD operations and allow for checking for special conditions. It appears that these instructions were accidentally omitted what the double-precision SIMD was added back to the spec #101. Note that despitei64x2
in instruction names they are agnostic to the interpretation of SIMD lanes, and work forf64x2
as well.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i64x2.all_true(x)
is lowered to:VPXOR xmm_tmp1, xmm_tmp1, xmm_tmp1
VPCMPEQD xmm_tmp2, xmm_tmp2, xmm_tmp2
VPCMPEQQ xmm_tmp1, xmm_tmp1, xmm_x
VPTEST xmm_tmp1, xmm_tmp2
SETNZ reg_y
MOVZXBL reg_y, reg_y
x86/x86-64 processors with SSE4.1 instruction set
y = i64x2.all_true(x)
is lowered to:PXOR xmm_tmp1, xmm_tmp1
PCMPEQD xmm_tmp2, xmm_tmp2
PCMPEQQ xmm_tmp1, xmm_x
PTEST xmm_tmp1, xmm_tmp2
SETNZ reg_y
MOVZXBL reg_y, reg_y
x86/x86-64 processors with SSE2 instruction set
y = i64x2.all_true(x)
is lowered to:PXOR xmm_tmp1, xmm_tmp1
PCMPEQD xmm_tmp1, xmm_x
PSHUFD xmm_tmp2, xmm_tmp1, 0xB1
PAND xmm_tmp1, xmm_tmp2
PMOVMSKB reg_y, xmm_tmp1
CMP reg_y, 0xFFFF
SETE reg_y
MOVZXBL reg_y, reg_y
ARM64 processors
y = i64x2.all_true(x)
is lowered to:CMEQ Vtmp.2D, Vx.2D, 0
UMINV Stmp, Vtmp.4S
UMOV Ry, Vtmp.S[0]
NEG Ry, Ry
ARMv7 processors with NEON instruction set
y = i64x2.all_true(x)
is lowered to:VCEQ.I32 Qtmp, Qx, 0
VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_hi
VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_lo
VMOV Ry, Dtmp_lo[0]
NEG Ry, Ry