Maratyszcza commented 3 years ago

Introduction

This is proposal to add 64-bit variant of existing gt_s, lt_s, ge_s, and le_s instructions. ARM64 and x86 (since SSE4.2) natively support the i64x2.gt instruction, and on ARMv7 NEON can be efficiently emulated with 3-4 instructions. i64x2.lt_s instruction is equivalent to i64x2.gt_s with reversed order of input operands. i64x2.le_s and i64x2.ge_s are equivalent to binary NOT operation applies to results of i64x2.gt_s and i64x2.lt_s accordingly.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
- VPCMPGTQ xmm_y, xmm_b, xmm_a
- VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
- VPCMPGTQ xmm_y, xmm_a, xmm_b
- VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55

x86/x86-64 processors with XOP instruction set

i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to VPCOMGEQ xmm_y, xmm_a, xmm_b
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to VPCOMLEQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with AVX instruction set

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to VPCMPGTQ xmm_y, xmm_a, xmm_b
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to VPCMPGTQ xmm_y, xmm_b, xmm_a
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
- VPCMPGTQ xmm_y, xmm_b, xmm_a
- VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
- VPCMPGTQ xmm_y, xmm_a, xmm_b
- VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]

x86/x86-64 processors with SSE4.2 instruction set

i64x2.gt_s
- y = i64x2.gt_s(a, b) (y is not b) is lowered to MOVDQA xmm_y, xmm_a + PCMPGTQ xmm_y, xmm_b
i64x2.lt_s
- y = i64x2.lt_s(a, b) (y is not a) is lowered to MOVDQA xmm_y, xmm_b + PCMPGTQ xmm_y, xmm_a
i64x2.ge_s
- y = i64x2.ge_s(a, b) (y is not a) is lowered to:
- MOVDQA xmm_y, xmm_b
- PCMPGTQ xmm_y, xmm_a
- PXOR xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) (y is not b) is lowered to:
- MOVDQA xmm_y, xmm_a
- PCMPGTQ xmm_y, xmm_b
- PXOR xmm_y, [wasm_i64x2_splat(-1)]

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.gt_s
- y = i64x2.gt_s(a, b) (y is not a and y is not b) is lowered to:
- MOVDQA xmm_y, xmm_b
- MOVDQA xmm_tmp, xmm_a
- PSUBQ xmm_y, xmm_a
- PCMPEQD xmm_tmp, xmm_b
- PAND xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_a
- PCMPGTD xmm_tmp, xmm_b
- POR xmm_y, xmm_tmp
- PSHUFD xmm_y, xmm_y, 0xF5
i64x2.lt_s
- y = i64x2.lt_s(a, b) (y is not a and y is not b) is lowered to:
- MOVDQA xmm_y, xmm_a
- MOVDQA xmm_tmp, xmm_b
- PSUBQ xmm_y, xmm_b
- PCMPEQD xmm_tmp, xmm_a
- PAND xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_b
- PCMPGTD xmm_tmp, xmm_a
- POR xmm_y, xmm_tmp
- PSHUFD xmm_y, xmm_y, 0xF5
i64x2.ge_s
- y = i64x2.ge_s(a, b) (y is not a) is lowered to:
- MOVDQA xmm_y, xmm_a
- MOVDQA xmm_tmp, xmm_b
- PSUBQ xmm_y, xmm_b
- PCMPEQD xmm_tmp, xmm_a
- PAND xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_b
- PCMPGTD xmm_tmp, xmm_a
- POR xmm_y, xmm_tmp
- PSHUFD xmm_y, xmm_y, 0xF5
- PXOR xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) (y is not b) is lowered to:
- MOVDQA xmm_y, xmm_b
- MOVDQA xmm_tmp, xmm_a
- PSUBQ xmm_y, xmm_a
- PCMPEQD xmm_tmp, xmm_b
- PAND xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_a
- PCMPGTD xmm_tmp, xmm_b
- POR xmm_y, xmm_tmp
- PSHUFD xmm_y, xmm_y, 0xF5
- PXOR xmm_y, [wasm_i64x2_splat(-1)]

ARM64 processors

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to CMGT Vy.2D, Va.2D, Vb.2D
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to CMGT Vy.2D, Vb.2D, Va.2D
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to CMGE Vy.2D, Va.2D, Vb.2D
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to CMGE Vy.2D, Vb.2D, Va.2D

ARMv7 processors with NEON instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to:
- VQSUB.S64 Qy, Qb, Qa
- VSHR.S64 Qy, Qy, #63
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to:
- VQSUB.S64 Qy, Qa, Qb
- VSHR.S64 Qy, Qy, #63
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
- VQSUB.S64 Qy, Qa, Qb
- VSHR.S64 Qy, Qy, #63
- VMVN Qy, Qy
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
- VQSUB.S64 Qy, Qb, Qa
- VSHR.S64 Qy, Qy, #63
- VMVN Qy, Qy

ngzhian commented 3 years ago

When re-introducing some 64x2 instructions in #101, majority voted for the option which did not include comparisons. I did not find a lot of use cases requiring those instructions (happy to be corrected). That vote was slightly bias since it only gave 3 options, but no one voiced strong opinions against omitting the comparisons instructions. What has changed since then?

This PR only includes signed instructions, so it seems like we only want a subset of instructions that lower reasonably well. What about the asymmetry of Wasm SIMD?

Maratyszcza commented 3 years ago

We discussed the issue of missing 64-bit forms in several meetings, and I have an action item to create PRs for missing instruction forms and document how they could lower to native ISA. This primary concern with missing 64-bit forms is the resulting non-orthogonality of the WebAssembly SIMD instruction set. I grouped similar instructions into one PR, but there will be more PRs coming for other instructions.

dtig commented 3 years ago

The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so. To avoid doing work that we've already done before - are there code examples where these are particularly useful? IIRC, @ngzhian evaluated several benchmarks, and only included the ones that would be lowered efficiently and were being used in real world use cases. For the more commonly used 32x4, 16x8, and 8x16 types, the symmetry argument I agree with - but for 64x2 types I'm not sure that it does?

My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.

omnisip commented 3 years ago

Special case not explicitly covered above:

If the JIT can detect that the two operands are identical, it should always call the cheapest method for returning zero regardless of whether or not the underlying instruction is used.

(corrected to be accurate for cmpgt)

omnisip commented 3 years ago

@ngzhian @dtig

I think some of the driving motivation here is that it really looks like the instruction set will be finalized within a meeting or two. If that's the case, do we want to push forward a standard that doesn't have ordering operators on 64 bit? I don't think unsigned operations were left out intentionally.

Maratyszcza commented 3 years ago

are there code examples where these are particularly useful? ... My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.

Thanks for explaining your concerns. Added links to applications.

As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.

ngzhian commented 3 years ago

As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.

Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.

Maratyszcza commented 3 years ago

Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.

Even so, why is it a concern?

penzn commented 3 years ago

I don't necessary think instructions in this PR are expensive (though others in the same subfamily would, also I agree that we have made a decision not to include those), but I want to make a point about application examples. Sorry, I've been a broken record on this lately, and I should probably stop :)

On a semi-personal point, I am quite familiar with the "legacy" flang compiler (the first example), and is very unlikely to be targeting wasm at any point, at the very least because it is going to be deprecated in favor of similarly-named project already in LLVM source tree. However, I am not sure how similar functionality would be implemented then.

Going further down the least of examples, some of those are obviously AVX, so while it is technically possible to port them, but for this proposal in its current state it would mean going from AVX to SSE, which might not have enough parallelism. On the other hand, AVX examples would be very important for any "bigger" SIMD proposals, like flexible vectors.

It is possible to find examples of intrinsics' use by doing a github search (example), but by themselves those are not wasm examples yet - there might be other issues getting in the way of getting good performance for those and they might be impossible to port efficiently for reasons other than SIMD. Probably more imporantly, why do we need to chase exact instruction sequences - I thought there is a consensus that adding something for symmetry or to match native is not a goal.

libpgmath support library for Flang compiler

Vector Packet Processing library

ByteSlice engine for column-store databases

MemFusion query aggregation engine

parasail Pairwise Sequence Alignment library

xsimd SIMD wrapper library

Enoki vectorization & differentiation library

KFR DSP library

fit-diffusion-model code for fitting diffusion models to MRI data

abrown commented 3 years ago

The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so

I would agree with @dtig's comment above and would extend it a bit: if we were to add these signed comparisons but not #414 (unsigned comparisons with worse lowerings for x86), wouldn't this be making the spec less orthogonal? To be consistent, I think actually this would make sense to me, having a patchwork of instructions that directly map to the supported ISAs, but that seems to contradict the orthogonality intent of a bunch of these i64x2 PRs. Are others OK with that type of non-symmetry: merge this but not #414? I think I would be.

penzn commented 3 years ago

I remember that historically we have been somewhat skeptical about 64x2 instructions, as they handle just two elements, and the cost can add up quickly when lowering is not very good. I think it would be great to get perf data for non-trivial lowerings.

I am at least somewhat comfortable with merging in "non-orthogonal" fashion, though :) This applies to #414 as well.

Maratyszcza commented 3 years ago

Today's meeting raised concerns about non-wrapper use-cases for these instruction, so I'd like to point out some examples:

ngzhian commented 3 years ago

I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a vec256_int.h. The .net runtime also requires AVX2. Many of these won't be trivial to port to Wasm, even with these instructions.

To put it another way, most (if not all) other instructions proposed have strong use cases in that XNNPACK will use them once those instructions are standardized, and XNNPACK benchmarks indicate performance improvements. This list of examples do not meet the same bar.

Maratyszcza commented 3 years ago

I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a vec256_int.h. The .net runtime also requires AVX2.

x86 SIMD extensions are so fragmented that many projects optimize only for a few of them. E.g. PyTorch has explicit vectorization only for AVX2 (thus vec256_int.h), bitonic sort in .Net Runtime - only for AVX2 and AVX512, Google Highway targets SSE2, AVX2, and AVX512, Microsoft ONNX Runtime - SSE2, SSE4.1, AVX, and AVX2. It isn't that other x86 SIMD extensions are not useful - but the developers are fatigued of writing multiple versions of the algorithm. WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

Many of these won't be trivial to port to Wasm, even with these instructions.

PyTorch vector primitives have a port to POWER VSX, which is a 128-bit SIMD extension similar to WAsm SIMD. E.g. here's the signed 64-bit comparison for greater-than. I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.

ngzhian commented 3 years ago

WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.

WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

I am not saying we shouldn't, I am saying SIMD v1 doesn't need to. This SIMD proposal is not the end, it's only the beginning.

I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.

They could. If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument. Your work on instructions like load/store lane, benchmarking on XNNPACK, shows obvious wins for inclusion, and is a real world, immediately relevant use case. The rest of the examples given here, less so.

Maratyszcza commented 3 years ago

That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.

I agree in principle, but it is important to quantify which subset of users might experience poor performance. The proposed instructions lower to 1-3 instructions on x86 with SSE4.2, ARM with NEON, and ARM64. The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2. These are Intel Core 2 Duo/Quad CPUs on 45 nm process (earlier Intel Core 2 processors on 65 nm didn't support SSE4.1, and later Nehalem-generation processors support SSE4.2). From the Wikipedia list the latest among these processors is Core 2 Quad Q9500, released in January 2010, and long discontinued. Is this processor a good fit for WebAssembly SIMD? I doubt so: per Agner Fog, on these processors unaligned loads are internally decoposed into 4 microoperations and unaligned stores - into 9 microoperations, so any SIMD code would likely perform worse than scalar (unless the code never use full 128-bit SIMD loads/stores).

So, in general, I agree that portable performance is important and should be our goal. However, IMO performance portability to decade-old processors that are not suited for WebAssembly SIMD anyway does not weight much towards this goal.

If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument.

Parts of PyTorch were ported to Emscripten. SIMD is not there yet, though.

jan-wassenberg commented 3 years ago

A small correction, Highway targets SSE4.1 and others, but not SSE2, indeed because there are way too many combinations. I agree with @ngzhian that performance portability is important but also with @Maratyszcza that <SSE4.2 is getting quite old and less relevant.

My opinion on orthogonality has shifted a bit recently, I wanted to detect a sign change and if so, flip all bits - also for i64. Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)

Now 6 ops does look a bit scary from the point of performance cliffs, but the alternative of going scalar also includes store + either store to load forwarding stall (from storing one i64 half then loading i62x2) or pinsrq (2 cycles, surprisingly enough). Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.

ngzhian commented 3 years ago

The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2.

Thanks for the detailed breakdown. I recall @lars-t-hansen mentioned some metrics he saw regarding SSE4.1, it was in the lower percentages. I looked at Chrome's numbers, ~10% of clients don't have at least SSE4.2. This number will surely go down with time, but it's not insignificant.

Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)

We have i64.shr. You need all 3 of the above?

Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.

If we add gt_s, lt_s comes for free, since we can swap the operands. So that will make this group look slightly more complete.

dtig commented 3 years ago

Adding a preliminary vote for the inclusion of i64x2 signed comparison operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 signed comparison operations 👎 Against including i64x2 signed comparison operations

penzn commented 3 years ago

IMO, PyTorch is one of those examples that has much larger issues than SIMD support, as Python is not really running in Wasm today.

Libraries like PyTorch, NumPy, Flang RTL would at least require their "entry" language to be compiled to Wasm (NumPy stands out as it also requires Fortran). That's why I personally don't think those are valid examples of apps by themselves - code running on top of them would, which makes using them for our purposes even more far-fetched.

Maratyszcza commented 3 years ago

@penzn I ported CPython to Emscripten even ~~before it was cool~~ before WebAssembly, and today embedding Python in WAsm binaries does not raise eyebrows. In PyTorch the 64-bit comparison intrinsics are used in ATen, its Python-independent part that is used e.g. in mobile deployments. You don't need to Python to use ATen or even to run NN inference on a PyTorch model.

penzn commented 3 years ago

No doubt it is possible to compile it, but what about performance - what does it use for GC and how well does that work?

Maratyszcza commented 3 years ago

CPython objects are reference-counted

WebAssembly / simd

i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.2 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set