Closed Maratyszcza closed 3 years ago
When re-introducing some 64x2 instructions in #101, majority voted for the option which did not include comparisons. I did not find a lot of use cases requiring those instructions (happy to be corrected). That vote was slightly bias since it only gave 3 options, but no one voiced strong opinions against omitting the comparisons instructions. What has changed since then?
This PR only includes signed instructions, so it seems like we only want a subset of instructions that lower reasonably well. What about the asymmetry of Wasm SIMD?
We discussed the issue of missing 64-bit forms in several meetings, and I have an action item to create PRs for missing instruction forms and document how they could lower to native ISA. This primary concern with missing 64-bit forms is the resulting non-orthogonality of the WebAssembly SIMD instruction set. I grouped similar instructions into one PR, but there will be more PRs coming for other instructions.
The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so. To avoid doing work that we've already done before - are there code examples where these are particularly useful? IIRC, @ngzhian evaluated several benchmarks, and only included the ones that would be lowered efficiently and were being used in real world use cases. For the more commonly used 32x4, 16x8, and 8x16 types, the symmetry argument I agree with - but for 64x2 types I'm not sure that it does?
My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.
Special case not explicitly covered above:
If the JIT can detect that the two operands are identical, it should always call the cheapest method for returning zero regardless of whether or not the underlying instruction is used.
(corrected to be accurate for cmpgt)
@ngzhian @dtig
I think some of the driving motivation here is that it really looks like the instruction set will be finalized within a meeting or two. If that's the case, do we want to push forward a standard that doesn't have ordering operators on 64 bit? I don't think unsigned operations were left out intentionally.
are there code examples where these are particularly useful? ... My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.
Thanks for explaining your concerns. Added links to applications.
As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.
As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.
Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.
Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.
Even so, why is it a concern?
I don't necessary think instructions in this PR are expensive (though others in the same subfamily would, also I agree that we have made a decision not to include those), but I want to make a point about application examples. Sorry, I've been a broken record on this lately, and I should probably stop :)
On a semi-personal point, I am quite familiar with the "legacy" flang compiler (the first example), and is very unlikely to be targeting wasm at any point, at the very least because it is going to be deprecated in favor of similarly-named project already in LLVM source tree. However, I am not sure how similar functionality would be implemented then.
Going further down the least of examples, some of those are obviously AVX, so while it is technically possible to port them, but for this proposal in its current state it would mean going from AVX to SSE, which might not have enough parallelism. On the other hand, AVX examples would be very important for any "bigger" SIMD proposals, like flexible vectors.
It is possible to find examples of intrinsics' use by doing a github search (example), but by themselves those are not wasm examples yet - there might be other issues getting in the way of getting good performance for those and they might be impossible to port efficiently for reasons other than SIMD. Probably more imporantly, why do we need to chase exact instruction sequences - I thought there is a consensus that adding something for symmetry or to match native is not a goal.
- libpgmath support library for Flang compiler
- Vector Packet Processing library
- ByteSlice engine for column-store databases
- MemFusion query aggregation engine
- parasail Pairwise Sequence Alignment library
- xsimd SIMD wrapper library
- Enoki vectorization & differentiation library
- KFR DSP library
- fit-diffusion-model code for fitting diffusion models to MRI data
The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so
I would agree with @dtig's comment above and would extend it a bit: if we were to add these signed comparisons but not #414 (unsigned comparisons with worse lowerings for x86), wouldn't this be making the spec less orthogonal? To be consistent, I think actually this would make sense to me, having a patchwork of instructions that directly map to the supported ISAs, but that seems to contradict the orthogonality intent of a bunch of these i64x2
PRs. Are others OK with that type of non-symmetry: merge this but not #414? I think I would be.
I remember that historically we have been somewhat skeptical about 64x2 instructions, as they handle just two elements, and the cost can add up quickly when lowering is not very good. I think it would be great to get perf data for non-trivial lowerings.
I am at least somewhat comfortable with merging in "non-orthogonal" fashion, though :) This applies to #414 as well.
Today's meeting raised concerns about non-wrapper use-cases for these instruction, so I'd like to point out some examples:
I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a vec256_int.h
. The .net runtime also requires AVX2. Many of these won't be trivial to port to Wasm, even with these instructions.
To put it another way, most (if not all) other instructions proposed have strong use cases in that XNNPACK will use them once those instructions are standardized, and XNNPACK benchmarks indicate performance improvements. This list of examples do not meet the same bar.
I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a
vec256_int.h
. The .net runtime also requires AVX2.
x86 SIMD extensions are so fragmented that many projects optimize only for a few of them. E.g. PyTorch has explicit vectorization only for AVX2 (thus vec256_int.h
), bitonic sort in .Net Runtime - only for AVX2 and AVX512, Google Highway targets SSE2, AVX2, and AVX512, Microsoft ONNX Runtime - SSE2, SSE4.1, AVX, and AVX2. It isn't that other x86 SIMD extensions are not useful - but the developers are fatigued of writing multiple versions of the algorithm. WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.
Many of these won't be trivial to port to Wasm, even with these instructions.
PyTorch vector primitives have a port to POWER VSX, which is a 128-bit SIMD extension similar to WAsm SIMD. E.g. here's the signed 64-bit comparison for greater-than. I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.
WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.
That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.
WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.
I am not saying we shouldn't, I am saying SIMD v1 doesn't need to. This SIMD proposal is not the end, it's only the beginning.
I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.
They could. If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument. Your work on instructions like load/store lane, benchmarking on XNNPACK, shows obvious wins for inclusion, and is a real world, immediately relevant use case. The rest of the examples given here, less so.
That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.
I agree in principle, but it is important to quantify which subset of users might experience poor performance. The proposed instructions lower to 1-3 instructions on x86 with SSE4.2, ARM with NEON, and ARM64. The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2. These are Intel Core 2 Duo/Quad CPUs on 45 nm process (earlier Intel Core 2 processors on 65 nm didn't support SSE4.1, and later Nehalem-generation processors support SSE4.2). From the Wikipedia list the latest among these processors is Core 2 Quad Q9500, released in January 2010, and long discontinued. Is this processor a good fit for WebAssembly SIMD? I doubt so: per Agner Fog, on these processors unaligned loads are internally decoposed into 4 microoperations and unaligned stores - into 9 microoperations, so any SIMD code would likely perform worse than scalar (unless the code never use full 128-bit SIMD loads/stores).
So, in general, I agree that portable performance is important and should be our goal. However, IMO performance portability to decade-old processors that are not suited for WebAssembly SIMD anyway does not weight much towards this goal.
If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument.
Parts of PyTorch were ported to Emscripten. SIMD is not there yet, though.
A small correction, Highway targets SSE4.1 and others, but not SSE2, indeed because there are way too many combinations. I agree with @ngzhian that performance portability is important but also with @Maratyszcza that <SSE4.2 is getting quite old and less relevant.
My opinion on orthogonality has shifted a bit recently, I wanted to detect a sign change and if so, flip all bits - also for i64. Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)
Now 6 ops does look a bit scary from the point of performance cliffs, but the alternative of going scalar also includes store + either store to load forwarding stall (from storing one i64 half then loading i62x2) or pinsrq (2 cycles, surprisingly enough). Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.
The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2.
Thanks for the detailed breakdown. I recall @lars-t-hansen mentioned some metrics he saw regarding SSE4.1, it was in the lower percentages. I looked at Chrome's numbers, ~10% of clients don't have at least SSE4.2. This number will surely go down with time, but it's not insignificant.
Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)
We have i64.shr. You need all 3 of the above?
Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.
If we add gt_s, lt_s comes for free, since we can swap the operands. So that will make this group look slightly more complete.
Adding a preliminary vote for the inclusion of i64x2 signed comparison operations to the SIMD proposal below. Please vote with -
👍 For including i64x2 signed comparison operations 👎 Against including i64x2 signed comparison operations
IMO, PyTorch is one of those examples that has much larger issues than SIMD support, as Python is not really running in Wasm today.
Libraries like PyTorch, NumPy, Flang RTL would at least require their "entry" language to be compiled to Wasm (NumPy stands out as it also requires Fortran). That's why I personally don't think those are valid examples of apps by themselves - code running on top of them would, which makes using them for our purposes even more far-fetched.
@penzn I ported CPython to Emscripten even before it was cool before WebAssembly, and today embedding Python in WAsm binaries does not raise eyebrows. In PyTorch the 64-bit comparison intrinsics are used in ATen, its Python-independent part that is used e.g. in mobile deployments. You don't need to Python to use ATen or even to run NN inference on a PyTorch model.
No doubt it is possible to compile it, but what about performance - what does it use for GC and how well does that work?
CPython objects are reference-counted
Introduction
This is proposal to add 64-bit variant of existing
gt_s
,lt_s
,ge_s
, andle_s
instructions. ARM64 and x86 (since SSE4.2) natively support thei64x2.gt
instruction, and on ARMv7 NEON can be efficiently emulated with 3-4 instructions.i64x2.lt_s
instruction is equivalent toi64x2.gt_s
with reversed order of input operands.i64x2.le_s
andi64x2.ge_s
are equivalent to binary NOT operation applies to results ofi64x2.gt_s
andi64x2.lt_s
accordingly.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.ge_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_b, xmm_a
VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
y = i64x2.le_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
x86/x86-64 processors with XOP instruction set
y = i64x2.ge_s(a, b)
is lowered toVPCOMGEQ xmm_y, xmm_a, xmm_b
y = i64x2.le_s(a, b)
is lowered toVPCOMLEQ xmm_y, xmm_a, xmm_b
x86/x86-64 processors with AVX instruction set
y = i64x2.gt_s(a, b)
is lowered toVPCMPGTQ xmm_y, xmm_a, xmm_b
y = i64x2.lt_s(a, b)
is lowered toVPCMPGTQ xmm_y, xmm_b, xmm_a
y = i64x2.ge_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_b, xmm_a
VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
x86/x86-64 processors with SSE4.2 instruction set
y = i64x2.gt_s(a, b)
(y
is notb
) is lowered toMOVDQA xmm_y, xmm_a
+PCMPGTQ xmm_y, xmm_b
y = i64x2.lt_s(a, b)
(y
is nota
) is lowered toMOVDQA xmm_y, xmm_b
+PCMPGTQ xmm_y, xmm_a
y = i64x2.ge_s(a, b)
(y
is nota
) is lowered to:MOVDQA xmm_y, xmm_b
PCMPGTQ xmm_y, xmm_a
PXOR xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
(y
is notb
) is lowered to:MOVDQA xmm_y, xmm_a
PCMPGTQ xmm_y, xmm_b
PXOR xmm_y, [wasm_i64x2_splat(-1)]
x86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.lt_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_b
PSUBQ xmm_y, xmm_b
PCMPEQD xmm_tmp, xmm_a
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PCMPGTD xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.ge_s(a, b)
(y
is nota
) is lowered to:MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_b
PSUBQ xmm_y, xmm_b
PCMPEQD xmm_tmp, xmm_a
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PCMPGTD xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
(y
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i64x2_splat(-1)]
ARM64 processors
y = i64x2.gt_s(a, b)
is lowered toCMGT Vy.2D, Va.2D, Vb.2D
y = i64x2.lt_s(a, b)
is lowered toCMGT Vy.2D, Vb.2D, Va.2D
y = i64x2.ge_s(a, b)
is lowered toCMGE Vy.2D, Va.2D, Vb.2D
y = i64x2.le_s(a, b)
is lowered toCMGE Vy.2D, Vb.2D, Va.2D
ARMv7 processors with NEON instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_s(a, b)
is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
y = i64x2.lt_s(a, b)
is lowered to:VQSUB.S64 Qy, Qa, Qb
VSHR.S64 Qy, Qy, #63
y = i64x2.ge_s(a, b)
is lowered to:VQSUB.S64 Qy, Qa, Qb
VSHR.S64 Qy, Qy, #63
VMVN Qy, Qy
y = i64x2.le_s(a, b)
is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
VMVN Qy, Qy