Summarizing previous polls, criteria for including SIMD operations

Following up from the last CG meeting, the summarized set of entry requirements for SIMD instruction groups are as follows:

SIMD groups of instructions need to be identified, and justified (by data) separately
Performance wins should be positive across multiple relevant arches within an instruction group
Individual operations within an instruction group might be neutral/negative on some architectures, but the aggregate group should be a win
The strength of performance evidence required increases when an op is clearly negative on some arch.

Open questions:

Are there other items that are missing in this list?
If this list is sufficient, can this live as a separate Requirements.md file in this repository?

Separately, there were polls on the performance, and usefulness of Integer SIMD opcodes. The takeaway as I see it is that the Integer SIMD operations have been justified by data, and can be included as a part of the SIMD Spec text, is this accurate? If not, why not?

Below are the relevant snippets from previous meeting notes:

Results from CG Meeting 05/2017 (Meeting notes):

Poll: In order for SIMD to go forward as part of WebAssembly, distinct SIMD groups of instructions (justified by data) need to be identified and justified separately. Tentatively these categories may be similar to the groups on Jakob's slides. Instructions might be in multiple groups.

SF	F	N	A	SA
14	3	1	0	0

Poll: Performance wins should be positive across multiple relevant arches within an instruction group. Individual operations within an instruction group might be neutral / negative on some arches, but the aggregate group should be a win. The strength of performance evidence required increases when an op is clearly negative on some arch.

SF	F	N	A	SA
3	14	2	2	0

Results from CG Meeting 05/2017 (Meeting notes):

Does the WebP port demonstrate sufficient performance consistency to obviate concern around performance cliffs?

SF	F	N	A	SA
5	8	3	4	0

Does the WebP port demonstrate sufficient performance gain to justify adding integer SIMD opcodes (assuming there are no performance cliffs)?

SF	F	N	A	SA
9	7	3	1	0

Do the set of integer opcodes present in the portable simd port of WebP demonstrate a useful constellation of opcodes? It’s a subset of opcodes already.

SF	F	N	A	SA
9	5	0	2	0

What are the "multiple relevant arches" ?

FWIW, some SIMD operations aren't necessarily more performant than their scalar counter-parts, but are often necessary and faster than moving vectors to memory to perform scalar operations. For example, extracting a couple of lanes from a vector is often slower than just doing the same for an in memory array. However, these operations are necessary to build SIMD algorithms, where the whole algorithm performs better than the scalar version.

Pretty much the whole API exposed here (and some of the extensions in PRs) can be used from Rust already via the packed_simd crate, and it contains ~10 "real-world" examples that implement an algorithm each to solve one problem, both using SIMD and using scalar code (often also a third multi-threaded+simd implementation is also available).

These examples compile via LLVM to arm32, arm32+v7+NEON, arm64, arm64+ASIMD, (x86 | x86_64) + SSE | SSE2 | ... | AVX-512, ppc, ppc+ AltiVec, pp64le, pp64le+ VSX, mips, mips + MSA, Sparc64, SystemZ, wasm32, (wip) wasm32 + simd128 (whatever level of support LLVM trunk has for this)...

One could take a look at which portable SIMD operations are used by these examples, which algorithms they enable, and how much better do the vectorized algorithms perform in some "relevant architectures".

In particular, that the library compiles to wasm32 as is, could be an advantage to use these examples to make sure that the wasm implementations perform properly.

What are the "multiple relevant arches" ?

FWIW, some SIMD operations aren't necessarily more performant than their scalar counter-parts, but are often necessary and faster than moving vectors to memory to perform scalar operations. For example, extracting a couple of lanes from a vector is often slower than just doing the same for an in memory array. However, these operations are necessary to build SIMD algorithms, where the whole algorithm performs better than the scalar version.

IIRC, the context for "multiple relevant arches" is that the performance benefits should be positive across architectures, i.e. if a set of operations is fast on IA32/X64, but causes a performance degradation on ARM/ARM32 then there would need to be a stronger justification for including such a group of operations.

Agreed that some SIMD operations aren't necessarily performant on their own, but that is what the second poll tries to address, that individual operations that are not necessarily performant can be included as long as the aggregate of the instruction group they are included in is a net positive.

Pretty much the whole API exposed here (and some of the extensions in PRs) can be used from Rust already via the packed_simd crate, and it contains ~10 "real-world" examples that implement an algorithm each to solve one problem, both using SIMD and using scalar code (often also a third multi-threaded+simd implementation is also available).

These examples compile via LLVM to arm32, arm32+v7+NEON, arm64, arm64+ASIMD, (x86 | x86_64) + SSE | SSE2 | ... | AVX-512, ppc, ppc+ AltiVec, pp64le, pp64le+ VSX, mips, mips + MSA, Sparc64, SystemZ, wasm32, (wip) wasm32 + simd128 (whatever level of support LLVM trunk has for this)...

One could take a look at which portable SIMD operations are used by these examples, which algorithms they enable, and how much better do the vectorized algorithms perform in some "relevant architectures".

In particular, that the library compiles to wasm32 as is, could be an advantage to use these examples to make sure that the wasm implementations perform properly.

IIRC, the context for "multiple relevant arches" is that the performance benefits should be positive across architectures, i.e. if a set of operations is fast on IA32/X64, but causes a performance degradation on ARM/ARM32 then there would need to be a stronger justification for including such a group of operations.

I find calling this a "performance degradation" a bit strange.

If WASM adds an operation that on x64 lowers to a native instruction, but arm64 lacks an instruction for that, then, on arm64, the machine code generator will need to lower it to a series of scalar instructions.

If we don't add the operation, then on both x64 and arm64 the scalar instructions will be present in the WASM binary. In the worst case, for arm64, there should not be any performance difference because the scalar code lowers to the same machine code. However, even if arm64 does not have an instruction for this, a machine code generator could generate better code by lowering a single instruction "optimally" to scalar code, than by just lowering scalar code directly, if it lacks optimizations to pattern match what the code is trying to do.

So what I think it is meant here by "performance degradation" is not that performance can actually degrade with respect to the scalar code, because AFAICT, it can't.

What we mean is that, if choosing the instruction instead of the scalar code on x64 gives the program an Nx speedup, then doing the same on all other archs should give the scalar code a similar speedup (instead of only a slight speed up, or no speed up).

What we mean is that, if choosing the instruction instead of the scalar code on x64 gives the program an Nx speedup, then doing the same on all other archs should give the scalar code a similar speedup (instead of only a slight speed up, or no speed up).

FWIW I think that this is super tricky to show on an instruction for instruction basis.

For example, I was replying to @tlively here https://github.com/WebAssembly/simd/pull/27, and I came up with a small godbolt snippet that showed the performance of the min/max lane-wise integer ops: https://gcc.godbolt.org/z/jtMusP

I had two synthetic benchmarks in mind:

write the element-wise min/max of two arrays to a third one, or
compute the min/max element in an array,

and decided to go for the first one. LLVM MCA suggest that for i32x4 the vectorized code uses 143 cycles, while the scalar code uses 216 cycles. This shows "an speedup". Will it be the same on ARM/MIPS/PowerPC/RISCV/...WASM? Probably not. Will it be the same in practice, when one actuall runs the synthetic benchmark? Definitely not, the synthetic benchmark is memory bound, even if one chooses I8x16 it won't be easy to measure.

We are trying to analyse the intrinsics we add on at least three axes: usefulness, architecture support, and performance.

Trying to evaluate performance at the level I showed above is meaningless, particularly if we consider architecture support to be an orthogonal evaluation axis. If an architecture natively supports an intrinsic, chances are that this is because it can do more in less time. So IMO "architecture support" should already justify this part of the performance axis.

We could arguably evaluate performance in how the intrinsic makes more complex algorithms faster. But then we need to find a more complex algorithm, which will use many other intrinsics, and quantify the effect of replacing one with scalar code. Chances are that the difference will be small, because it is the sum of all intrinsics that make the algorithm faster, and dropping a single one doesn't make it particularly slower. The "usefulness" evaluation axis is better at quantifying this, that trying to do any performance studies.

The "usefulness" and "architecture support" evaluation criteria are objective and orthogonal to each other, and we should trust that if vendors provide silicon to perform a particular operation, it's because that's worth its cost.

"Performance" is to vague to be useful, and probably linearly dependent to both of them. Even if an important architecture does not have silicon for an operation, "performance degradation" is not a useful criteria to use there. On that architecture, because there is no silicon for the operation, the scalar code and the code using the WASM intrinsic should be equally performant on a sufficiently good WASM compiler. In practice the WASM using the intrinsic might be faster, but that only makes the criterium even less useful.

The axes of evaluating operations - usefulness, architecture support, and performance I agree with at a high level. But there are a couple of things that are being conflated, and I want to separate them out the way I see them.

Performance evaluation against scalar

When evaluating which operations can be added, having a clear performance win over scalar is a good indicator. As long as this can clearly be demonstrated at least on one architecture, and the same set of operations does not impact performance negatively compared to scalar then IMO this is still a good set of operations.

Performance Cliffs

SIMD instructions have a large surface areas, and some have documented performance cliffs, and this is something to be careful about. So if a set of instructions introduces a non-deterministic performance cliff then this is a good reason to push back on introducing that as a part of this proposal.

For a performance feature, I don't think evaluating on performance is too vague to be useful. I'm not sure I understand what you are trying to get at with the last bit, so if this doesn't sufficiently address your concerns, I'm happy to take another stab at addressing them.

Also, mentioning here for completeness that the set of criteria I started with were voted on by the CG members, and if you do have other criteria that are missing, and think should be added, this is something that can brought to a poll at an online or in-person CG meeting.

As discussed today: when incorporating this text into the spec, can you please make it clear the degree to which this is a guideline, to make it clear that (for instance) a substantial practical speedup (for real-world use cases) on some subset architectures that's neutral on others is acceptable? That understanding seems to vary between people, and hasn't been documented explicitly, only carried via discussion. I'd like to avoid the impression some have that justifying an operation requires comparable speedups on all architectures.

@dtig

Performance evaluation against scalar

As long as this can clearly be demonstrated at least on one architecture, and the same set of operations does not impact performance negatively compared to scalar then IMO this is still a good set of operations.

I understand this as follows. We can consider adding a new SIMD operation if:

on at least one architecture, the SIMD operation is more efficient than a scalar polyfill,
on all other architectures, the SIMD operation (polyfilled or not) is not slower than the scalar polyfill.

That is, if you have some scalar code that can be replaced by a single WASM SIMD operation, this code is not allowed to perform worse on any architecture, but it is allowed to not perform any faster.

Is this what you meant?

Performance Cliffs

From my understanding of the above, I still do not know what you mean by a "Performance Cliff", a cliff with respect to what ? If we have a SIMD operation, either an architecture supports it natively, in which case it should perform faster than a scalar polyfill, or the architecture does not support it natively, in which case it gets implemented as a scalar polyfill. Either way, performance can't be worse than that of the scalar polyfill, so at least under these criteria, "performance cliffs" are impossible. For a performance cliff to exist, an architecture would need to support a SIMD operation natively, the native instruction has to be slower than the scalar polyfill, and the WASM machine code generator needs to pick up the slower native intrinsic over the scalar polyfill (this could be an interesting trade-off when optimizing for code-size vs performance).

WebAssembly / simd

Summarizing previous polls, criteria for including SIMD operations #37

Performance evaluation against scalar