WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
531 stars 43 forks source link

i64x2.all_true instructions #415

Closed Maratyszcza closed 3 years ago

Maratyszcza commented 3 years ago

Introduction

This is proposal to add 64-bit variant of existing all_true instruction. This instruction complement the set of double-precision (f64) and 64-bit integer (i64) SIMD operations and allow for checking for special conditions. It appears that these instructions were accidentally omitted what the double-precision SIMD was added back to the spec #101. Note that despite i64x2 in instruction names they are agnostic to the interpretation of SIMD lanes, and work for f64x2 as well.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

ngzhian commented 3 years ago

I guess with #416 (likely to happen), this PR can be simplified to just i64x2.all_true.

Maratyszcza commented 3 years ago

Removed i64x2.any_true

abrown commented 3 years ago

I think this PR is similar to #368: a reduction from many lanes to a scalar. Since its i32x4 version, i32x4.all_true, has a lowering as unfortunate on x86 as this one, why not consider both together? I would argue that the long sequences that *.all_true introduces are potholes in the spec--not giant cliffs, but still unexpected performance issues in a spec designed for near-native performance. But, doubting that we would agree to remove *.all_true altogether in favor of the *.bitmask instructions, then this i64x2.all_true version makes sense.

lars-t-hansen commented 3 years ago

@abrown, it's a reduction but my expectation here would be that all_true is frequently used in a control context, ie as the argument to a branch, not captured for its value. In that case there's no reason for the final setnz / movzxbl pair, and the final code doesn't look all that bad IMO.

abrown commented 3 years ago

Yes, under that assumption it is not as bad (still not great, though). I guess you are also assuming that engines would be looking for and applying this type of optimization? I think I would as well but not sure about everyone else.

lars-t-hansen commented 3 years ago

Yes, I would assume engines would optimize boolean evaluation for control even for SIMD (SpiderMonkey does, anyway).

Curious about the sse4.1 lowering actually. Why not

PXOR tmp, tmp
PCMPEQQ tmp, src
PMOVMSKB dest, tmp
TEST dest, dest
SETZ dest
MOVZXBL dest, dest
Maratyszcza commented 3 years ago

@lars-t-hansen Both sequences use the same number of instructions, but the SSE4.1 in the proposal has shorter dependency chain. Also, TEST dest, dest should be CMP dest, 0xFFFF with 4-byte immediate.

penzn commented 3 years ago

Ideally it would be great to test this, but that is just my opinion.

dtig commented 3 years ago

Approving as this was voted on as of #419. @tlively could you confirm that the assigned opcode looks good for the tools?

ngzhian commented 3 years ago

0xfdc3 collides with i16x8.extadd_pairwise_i8x16_u in v8 currently, i guess i should change ext add (since i picked that for prototyping).

tlively commented 3 years ago

Probably best to err on the side of caution and make the opcode TBD, then.

ngzhian commented 3 years ago

Think the last step of ARMv7 lowering is incorrect: NEG Ry, Ry should be ADD Ry, Ry, #1

IIUC Ry will either be -1 or 0, and the result we want is 0 and 1 respectively.

Also the sequence doesn't seem to work for i64x2(0, -1), think all the vpmin should be vpmax. Here's what I cooked up that seems to pass the tests:

vpmax.32 tmp_lo, src_low, src_high
vceq.32 tmp, tmp #0
vpmax(32 tmp_low, tmp_low, tmp_low
vmov.32 dst, tmp_low[0]
add dst dst #1

Still 5 instructions, lmk if you figure out something better.

akirilov-arm commented 3 years ago

@Maratyszcza The ARM64 instruction sequence is wrong too - here's a corrected version:

CMEQ Vtmp.2d, Vx.2d, #0
UMAXV Stmp, Vtmp.4s
FMOV Wy, Stmp
ADD Wy, Wy, #1

However, the following is an improved sequence, since it avoids the reduction:

CMEQ Vtmp.2d, Vx.2d, #0
ADDP Dtmp, Vtmp.2d
FCMP Dtmp, Dtmp
CSET Xy, eq

Finally, possibly the best alternative, which is, unfortunately, an instruction longer:

FMOV Xy, Dx
FMOV Xtmp, Vx.d[1]
CMP Xy, #0
CCMP Xtmp, #0, #4, ne
CSET Xy, ne
lars-t-hansen commented 3 years ago

In reference to https://github.com/WebAssembly/simd/pull/415#issuecomment-758884709

Also, TEST dest, dest should be CMP dest, 0xFFFF with 4-byte immediate.

@Maratyszcza, I don't think that's right, if what you're getting at is correctness and not performance. My sequence compares the input to zero, meaning the resulting bitmask will be nonzero in any lane that is zero, hence the test succeeds if the bitmask is zero everywhere, and setz captures that.