Remove integer SIMD not-equals instructions

Maratyszcza commented 4 years ago

Currently WebAssembly SIMD specification includes integer Not Equals instructions. These instructions are problematic, as neither x86 SSE4.1 nor ARM NEON include equivalents of these instructions. In practice, software developers almost always can replace expressions using these Not Equals instructions into equivalent expressions using Equals instructions, and the latter would perform better due to direct mapping on native SIMD instruction sets. In case where Not Equals predicate is desired, it can be emulated as e.g. i32x4.not(i32x4.eq(a, b)) and would result in the same native code generation as the i32x4.ne(a, b).

penzn commented 4 years ago

What are the benefits of this? I don't see that two ops replacement would be faster than current single operation. In addition to that a single operation reduces code size and is also closer to what Wasm has (i32.ne) outside of SIMD.

Maratyszcza commented 4 years ago

The benefit is to motivate developers (both software developers coding in intrinsics and compiler developers) to use eq instructions rather than ne instructions. eq instructions map to a single instruction on both x86 and ARM64, and ne instructions map to multiple instructions (literally the equivalent of eq + v128.not) on both, so removing ne instructions motivates more efficient instruction selection.

Note that unlike SIMD, in scalar code i32.ne and i32.eq are equally efficient, so scalar codes are not susceptible to this problem.

penzn commented 4 years ago

The benefit is to motivate developers (both software developers coding in intrinsics and compiler developers) to use eq instructions rather than ne instructions. eq instructions map to a single instruction on both x86 and ARM64, and ne instructions map to multiple instructions (literally the equivalent of eq + v128.not) on both, so removing ne instructions motivates more efficient instruction selection.

Isn't the resulting machine instruction sequence the same for, let's say, i32x4.not(i32x4.eq(a, b)) and i32x4.ne(a, b)?

Maratyszcza commented 4 years ago

Isn't the resulting machine instruction sequence the same for i32x4.not(i32x4.eq(a, b)) and i32x4.ne(a, b)?

It is. My point is:

Most expressions using i32x4.ne(a, b) can be rewritten with i32x4.eq(a, b)
After such rewrite the expression would be more efficient on both ARM and x86.
Thus, we should remove i32x4.ne(a, b) to motivate such rewrites. In the rare cases where users can't/don't want to rewrite the expressions, they can use v128.not(i32x4.eq(a, b)).

tlively commented 4 years ago

It sounds like the rewrites your thinking of, @Maratyszcza, are larger rewrites of the surrounding algorithms than just the simple rewrite of replacing i32x4.ne with v128.not; i32x4.eq. Do you have an example to illustrate this?

jan-wassenberg commented 3 years ago

I agree with the general principle of encouraging apps to use only what's efficient. An example would be changing the algorithm from bits = movemask(a != b); bits == 0 ? to bits = movemask(a == b); bits != 0 ?. Apps can often propagate the negation and avoid the extra not.

Maratyszcza commented 3 years ago

Here are two more examples:

v128.bitselect(a, b, i32x4.ne(x, y)) can be rewritten as v128.bitselect(b, a, i32x4.eq(x, y))
v128.and(a, i32.ne(x, y)) can be rewritten as v128.andnot(a, i32.eq(x, y))

In both cases a rewrite completely removes the SIMD Not operation from generated code

tlively commented 3 years ago

Ok, this seems reasonable to me. If we didn't already have the i32x4.ne instruction, it certainly wouldn't meet the bar for inclusion, and it seems silly to keep it in just because it's the status quo.

Note that removing it would be a breaking change for some OT users, but we explicitly need to not consider that as a downside in making a decision.

Maratyszcza commented 3 years ago

The first step would be to stop generating these instructions in LLVM, and it is not blocked by anything.

tlively commented 3 years ago

I'd like to hear what @ngzhian and @binji think of this before making any changes to LLVM.

ngzhian commented 3 years ago

I don't see a strong benefit in removing:

in terms of not encouraging users to use these instructions, SIMD users understand hardware, and usually delve into underlying implementation (in this case V8), and will actively avoid slow operations, the presence of these not-equal instructions is not a stumbling block
some of the instructions SIMD has is meant to be symmetrical to Wasm scalar instructions, and I think not-equal is one of them
we already have instructions that don't map to single operations on supported archs, so having this around is not a big issue
we can maybe rely on toolchain to perform some of the translations suggested by Marat to not use not-equals

(However I am not a SIMD writer, so my opinions should be listened to with much lower priority than the other experts on this thread.)

That said, I also agree with Thomas that, if we did not have this instruction, and we were looking at this individually, we might not have added it, given the bar (post opcode renumbering when we slowed down additions).

So in summary, is not-equals a big footgun? IMO, no. Is it painful enough that we should remove it? IMO, no as well. But if we do remove it I will not miss it particularly. I am also interested in how much friction it will cause existing users of SIMD:

how many will need to recompile?
users of SIMD intrinsics, how much rewrite as to happen? (though less of a concern with SIMD being experimental, any unnecessary churn should be avoided.)

lemaitre commented 3 years ago

Here is my take on the i32x4.ne instruction. The bit negation is ultra fast and will almost never change the actual performance of a code, especially considering that comparisons usually appear much less often than arithmetic. If a user end up being in those really rare situation where it matters, they will be clever enough to do the change on their own.

Also, removing it will incentive people to come up with workarounds that might just be wrong. This is for example the case with @jan-wassenberg snippet

from bits = movemask(a != b); bits == 0 to bits = movemask(a == b); bits != 0

The actual test should be bits = movemask(a == b); bits == 0b1111 for it to be equivalent to the the original test.

And finally, removing i32x4.ne punches a hole inside the consistency of the SIMD instruction set. For all these reasons, I think it would be better to keep it.

Maratyszcza commented 3 years ago

IMO it is far from evident that there's such a difference in performance between ne and eq instructions, and most SIMD developers would expect them to perform similarly (I, for one, did expect equivalent performance until I looked into native ISA manuals). The situation is exacerbated by x86 not having a SIMD Not instruction. As a result, ne WAsm SIMD instructions on x86/x86-64 lower to 3 native instructions and clobber a temporary register.

With WAsm SIMD being a performance feature, it should discourage such inefficient solutions.

jan-wassenberg commented 3 years ago

Agreed.

@lemaitre thanks for pointing out the logic error. I think providing both == and != actually makes such errors more likely, because it incentivizes such transforms. Someone may realize (only after looking deep into the manuals or generated code) that != is best avoided, and then change the code at a late stage. If != were not provided at all, written-from-scratch SIMD code would not encounter this problem.

Maratyszcza commented 3 years ago

Stumbled upon a previous discussion of not-equals by @zeux here:

I've ran into this in context of LLVM strength-reducing one to another: it will replace i32x4.gt(value, 0) with i32x4.ne(value, 0) if it knows value is non-negative, which has a slight penalty on the codegen.

zeux commented 3 years ago

Ok, this seems reasonable to me. If we didn't already have the i32x4.ne instruction, it certainly wouldn't meet the bar for inclusion, and it seems silly to keep it in just because it's the status quo.

I'm wondering if this is true.

Multiple times throughout the development of the SIMD extension we've discussed the tradeoffs - when there's an instruction that logically belongs to the ISA, but has suboptimal lowering on current architectures, should we still include it?

While I disagree that the penalty for emitting "not" is insignificant (it can have a noticeable perf impact on real workloads; of course "noticeable" means, say, 5% or less, but when you use SIMD I'd expect some users like myself to want to tune the code to get everything out of the hardware), I think the following considerations apply:

Trying to synthesize "not equals" from "equals" as a user isn't going to help anybody here
This instruction is a natural part of the instruction set; it doesn't make a lot of sense to provide == without !=
Removing this breaks uniform treatment of types for basic arithmetics, and probably will force clang to synthesize not-equals anyhow when for example using the vector extensions for C/C++.
During auto-vectorization, compiler will have to synthesize not+eq when vectorizing some loops, and then rely on optimizations that propagate not through the expression to try to get rid of it

So my general observation is that in this proposal we have consistently said the same thing:

If an operation is fundamental to the type and you'd naturally expect this to be available in a general purpose instruction set, we will include this in the proposal even if lowering is suboptimal.

Case in point: f32->i32 conversions, byte shifts, probably others (there should be a host of issues in this repository by now).

In my view we should make sure LLVM doesn't strength-reduce > 0 to != 0 and in general has a cost model where != is slower than all other comparison ops, but it doesn't have to mean the operation is omitted entirely, and isn't a spec change anyway.

ngzhian commented 3 years ago

I think the confusion from missing NE might be greater than benefits we get from leaving it out. Many developers won't know (toolchain hides it), but we also have many authors using intrinsics. A missing NE could take more effort to explain why we left it out, compared to putting it there, given that emulating NE costs about the same as NE codegen itself.

Regarding education, we can have documentation beside the intrinsics header file that notes the potential performance problem with NE instructions. This is probably the most visible place for intrinsic users.

tlively commented 3 years ago

I agree that the principle of least surprise more strongly supports keeping it rather than taking it out. Keeping it reduces surprise among those who read the docs, which is hopefully a large group of people in the long run.Taking it out reduces surprise among those who use an poorly tuned compiler or hand-write assembly without reading the docs, which is hopefully small group of people in the long run.

jlb6740 commented 3 years ago

Quick comment in support of a principle of design here that motivates removing this (particularly responding to the earlier comments). The number one goal I think for WASM SIMD and all of WASM is near native performance and the best way to achieve that is by mapping directly to what the hardware gives you. In this case neither x86 nor Arm supports not equal directly which means a less efficient lowering as has been discussed. Symmetry is clearly desired and encouraged and an important goal for the spec but performance is the ultimate goal. In this case the hardware isn't symmetric (neither x86 nor arm) and the hardware is the lead to follow and the blueprint to map to. Obviously the points from @ngzhian, @tlively, etc for keeping (not(not removing)) are more than fair, with the impact of already being implemented/shipped being the biggest point that resonates. However, the principle @Maratyszcza express for removing here is spot on.

ngzhian commented 3 years ago

In this case neither x86 nor Arm supports not equal directly which means a less efficient lowering as has been discussed.

Should we look at other instructions where codegen on both architectures are poor? There are a number of problematic instructions (https://github.com/zeux/wasm-simd/blob/master/Instructions.md for an overview, and https://github.com/WebAssembly/simd/issues/created_by/abrown for a specifc instructions, the titles all say x64, some of them are similarly poor for arm, such as all_true, any_true.)

jlb6740 commented 3 years ago

In this case neither x86 nor Arm supports not equal directly which means a less efficient lowering as has been discussed.

Should we look at other instructions where codegen on both architectures are poor? There are a number of problematic instructions (https://github.com/zeux/wasm-simd/blob/master/Instructions.md for an overview, and https://github.com/WebAssembly/simd/issues/created_by/abrown for a specifc instructions, the titles all say x64, some of them are similarly poor for arm, such as all_true, any_true.)

@ngzhian It's a good question. I think the reasoning for removing not-equals is a sound one that should be applied generally. I think the hardware provides the blueprint for the instructions desired and when all hardware is saying the same thing, that blueprint is even more solid. If getting it right is important and particular when the instructions put in will be legacy I do think its fair to consider revisiting and potentially holding back any instruction that could lead to toolchains and app developers baking in inefficient translations to hardware. This particular instruction of not-equals doesn't sound too controversial to consider or even too painful to hold off on, but there is an argument both for removing and against and so there are of course others instructions that would be even more controversial to consider. Now finding those candidate instructions to review is not too hard as the lists you point out from @zeux and @abrown contain them but I think the question is, is there significant appetite to revisit. Personally I do support doing that pass but understand that may not necessarily be a shared opinion.

arunetm commented 3 years ago

IMO there is value in allowing the developers & toolchains to see lowering inefficiencies so that spec itself plays a role in helping to avoid generating inefficient Wasm code in the first place. Documenting the cases might not be sufficient as it opens up proper maintenance concerns and there are no guarantees that developers and tool writers will always use them appropriately.

I think the primary expectation of a user of Wasm SIMD is performance, convenience, and symmetry expectations are secondary and only 'nice to haves'. I would assume that an average simd developer must be familiar with few of these inconveniences that exist in the native world as well. Taking a dependency on the toolchains and runtimes to optimize or workaround inefficiencies will be a slippery slope as the developers will be unpleasantly surprised if they happen to use any sub-optimial toolchains and runtimes (beyond emscripten and V8) in the future.

I am in favor of this proposed change by @Maratyszcza. Agree with @jib6740 that hardware support is the best indicator of an operation being best suited for fulfilling the developer expectation of performance from Wasm SIMD. For operations that are highly useful with less than ideal hardware support, they are great candidates to drive the future evolution of hardware support and can be included.

As we have seen, this specific scenario of NE is not too severe compared to some of the other open issues. There is value in taking a closer look at the hardware support and real-world use cases requirement criteria and consistency to apply them across the known open Issues/proposals we have now.

zeux commented 3 years ago

Here's a list of instructions that aren't supported well natively by either ARM64 or x64, based on v8 implementation:

v8x16.shuffle i64x2.mul i32x4.ne, i16x8.ne, i8x16.ne (this PR) All integer shifts with a non-constant right hand side (due to specified behavior mismatch wrt arch details) All any_true or all_true instructions

Here's a list of instructions that have lowering on Intel architectures (pre-AVX512, haven't analyzed if that changes any of these) that is expensive to the point of being problematic:

All unsigned integer comparisons 8-byte shifts f32x4.min/max i32x4.trunc_sat_f32x4_s i32x4.trunc_sat_f32x4_u

If we consistently apply the hardware support principle, we should remove all of the above. I'm not sure I understand why removing not-equals specifically and keeping all of the above makes sense. The impact of a single bad shuffle dwarfs that of a not-equal comparison by a fair margin.

arunetm commented 3 years ago

Here's a list of instructions that aren't supported well natively by either ARM64 or x64, based on v8 implementation:

v8x16.shuffle i64x2.mul i32x4.ne, i16x8.ne, i8x16.ne (this PR) All integer shifts with a non-constant right hand side (due to specified behavior mismatch wrt arch details) All any_true or all_true instructions

Here's a list of instructions that have lowering on Intel architectures (pre-AVX512, haven't analyzed if that changes any of these) that is expensive to the point of being problematic:

All unsigned integer comparisons 8-byte shifts f32x4.min/max i32x4.trunc_sat_f32x4_s i32x4.trunc_sat_f32x4_u

I think there is value in assessing these on a case-by-case basis. An instruction with popular real world use-cases is a good candidate for future hw support considerations and has a better chance of becoming efficient as hw evolves. Instructions with no real-world uses will remain perf disadvantaged as there are not many incentives to support them and I suggest removing them from the MVP. It's not feasible to remove an op after being standardized, but they can always be reconsidered as future extensions.

I'm not sure I understand why removing not-equals specifically and keeping all of the above makes sense. The impact of a single bad shuffle dwarfs that of a not-equal comparison by a fair margin.

Agree. We already have unresolved issues for almost all of these hw inefficient instructions. If we can establish a consensus on the criteria clarifications including the hw support requirements, it will help to resolve those issues as well. We have the 'Inclusion criteria' topic on agenda for the next sync https://github.com/WebAssembly/simd/issues/203

ngzhian commented 3 years ago

Working on top of Arseny's comment, I collated the "problematic" instructions as of V8 HEAD today, with their codegen instruction counts on x64 and ARM64. (Note that the instruction counts shown are worse case, so in the case of shifts, we assume non-immediate shifts, so we need 4 instructions rather than 1.)

Some highlights:

in the category of "> 1 instruction on both arch", then there are 55 instructions to examine
opcodes with more than > 5 instructions on either arch (13)

instruction	x64	arm64
i8x16.shr_s	9	4
i8x16.shr_u	9	4
i64x2.shr_s	6	4
i64x2.mul	10	7
f32x4.min	8	1
f32x4.max	9	1
f64x2.min	8	1
f64x2.max	9	1
i32x4.trunc_sat_f32x4_s	8	1
i32x4.trunc_sat_f32x4_u	14	1
f32x4.convert_i32x4_u	8	1
i8x16.bitmask	1	7

opcodes with > 3 instructions on either arch (exclude those with > 5 since those are in the table above) (26)

instruction	x64	arm64
i8x16.gt_u	4	1
i16x8.lt_u	4	1
i16x8.gt_u	4	1
i32x4.lt_u	4	1
i32x4.gt_u	4	1
i16x8.shl	4	3
i16x8.shr_s	4	3
i16x8.shr_u	4	3
i32x4.shl	4	3
i32x4.shr_s	4	3
i32x4.shr_u	4	3
i64x2.shl	4	3
i64x2.shr_u	4	3
i8x16.any_true	3	4
i16x8.any_true	3	4
i32x4.any_true	3	4
i8x16.all_true	5	4
i16x8.all_true	5	4
i32x4.all_true	5	4
v8x16.swizzle	4	1
i64x2.bitmask	1	4
i32x4.bitmask	1	5
i16x8.bitmask	3	5
i8x16.bitmask	1	7

If we were to tackle things in terms of severity:

the first thing we would probably want to do is remove i64x2.mul, it is horrible on both.
shifts are tricky, if we only allow immediates, we get better codegen, 1 in most cases, but I don't think we can. I did a grep on some local codebases that use SSE intrinsics: for shift left I could find only 1 usage of non-immediate shift, _mm_sll_epi32, all other usages were _mm_slli_*, but for shift right, both immediate and non-immediate shifts are heavily used.
any_true/all_true. I'm not sure if there is a simple way to check for existing usages, and am not sure if they have good workarounds if we remove them. It is possible that with #201 and also #124 we can.
unsigned integer comparison, which is lacking many unsigned comparisons until AVX-512 (i think).
float-int conversions, bad on x64.
others, like swizzle, 4 on x64, not the worst.

I haven't looked too closely at the outstanding PRs that suggest new instructions, I suppose we should have consensus on requirements and expectations before we look at them with fresh lenses. A couple that stand out are: #379 (>8 on x64 until AVX512), #350 (>1 on both arch for 3 out of 8 suggestion instructions), #247 (>7 on x64).

There is an outstanding issue that V8 cannot put 128-bit constants into memory to make use of rip-relative loads, and so must always materialize masks using SIMD instructions. I eyeballed cases where we are doing that, and some of those instructions (integer negation, float abs an neg) don't make it to the top of "worst offender" list. So I think this discussion can ignore whether this particular V8 limitation.

I probably missed something, lmk so I can update the sheet. I hope this is useful data to guide our future discussions.

[0] load_splat might be surprising, on ARM, the addressing mode is not able to handle what memarg immediate + i32 offset that Wasm has for memory loads, so we need an additional add. [1] All the comparisons here can be improved to 3 instructions if V8 gains the ability to load -1 constant from memory. [2] The suggestion of pmin/pmax #122 is meant to workaround this asymmetry. [3] Omitted from this list are shifts with non-immediate shift values. Those make the codegen instantly bad, from 1 to 4 (x64), and 1 to 3 (arm64), but i8x16 shifts are the worst of this bunch.

Maratyszcza commented 3 years ago

Thank you for the detailed analysis, @ngzhian. However, I think the number of instructions generated in V8 doesn't tell the whole story.

First, we need to consider how expensive it would be to emulate these operations with other WAsm SIMD instructions if we take the corresponding instruction out of the spec. E.g. wasm_i8x16_gt_u(a, b) would have to be emulated via wasm_i8x16_gt_s(a - 0x80, b - 0x80), resulting in the same 4 instructions it takes now on x86 SSE4, but now on all architectures. It is as if we penalized ARM and recent x86 CPUs with AVX512 for not having this instruction in SSE4.

Secondly, due to limitation on loading 128-bit constants in V8, it often prefers longer instruction sequences to avoid loading constants. As far as I understand, this limitation is likely to be removed in the future as WAsm SIMD support in V8 gets more mature. Thus, we should avoid making decisions grounded in temporary limitations of a single WAsm engine. I think a better way to compare instruction complexity is through unrestricted assembly implementation like I do in my PRs.

IMO instructions in WebAssembly SIMD specification should be strict Pareto-improvements over their polyfill on the curve of (ARM64, x86-64 SSE4) performance, but it doesn't mean that they necessarily lower to a single native instruction on either architecture. all_true/any_true instructions are good examples: the lower to multiple native instructions on both ARM and x86, but these sequences are still better then their emulation with other WAsm SIMD instructions in lieu of all_true/any_true. Removal of integer not-equals instruction, suggested in this PR, satisfy this requirement: without i32x4.ne, developers would use v128.not(i32x4.eq(a, b)), which performs exactly the same, thus addition of i32x4.ne to the spec in NOT a Pareto-improvement.

arunetm commented 3 years ago

First, we need to consider how expensive it would be to emulate these operations with other WAsm SIMD instructions if we take the corresponding instruction out of the spec. E.g. wasm_i8x16_gt_u(a, b) would have to be emulated via wasm_i8x16_gt_s(a - 0x80, b - 0x80), resulting in the same 4 instructions it takes now on x86 SSE4, but now on all architectures. It is as if we penalized ARM and recent x86 CPUs with AVX512 for not having this instruction in SSE4.

I agree with this argument but only when the difference in instruction count is below a threshold that raises arch-specific perf cliffs (a threshold we can see group consensus on). I disagree that the spirit of avoiding instructions without good (2+) hw support is penalizing few architectures. On the contrary, it helps developers to avoid inefficient code that will run faster only on certain devices.

If an op emulation takes a significantly higher number of instructions on one platform vs. other and if used in a perf critical control path, it could easily tank perf on certain devices and surprise developers with loss of perf or even regressions by using Wasm SIMD. This runs the risk of affecting the usability of a performance-oriented feature like simd. These operations are safely included in the spec with some sort of feature detection providing developers a way to get around such perf cliffs and will be better candidates for post mvp additions. This re-emphasizes the usefulness for conditional sections and feature detection proposals.

AVX512 availability is very limited and there is a high likelihood that a significant % of platforms running Wasm simd code won't be able to use them in the future too. So support in AVX512 is not a good indicator of architecture support.

lemaitre commented 3 years ago

I agree with this argument but only when the difference in instruction count is below a threshold that raises arch-specific perf cliffs (a threshold we can see group consensus on). I disagree that the spirit of avoiding instructions without good (2+) hw support is penalizing few architectures. On the contrary, it helps developers to avoid inefficient code that will run faster only on certain devices.

I think "perf cliff" is irrelevant and to some extent unavoidable. For instance SQRTPD has a throughput⁻¹ of 32 cycles on Nehalem (SSE4), where it is 6 cycles on Skylake X. That is more than a factor 5 between the two, much more than any other "discrepancies" highlighted in this thread/repo. Different instructions have been optimized differently between architectures and there is nothing we can do about it.

To me, the only interested metric is: is it beneficial on at least one common architecture while not being detrimental on all the others? This quote from @Maratyszcza sums up this idea fairly well:

instructions in WebAssembly SIMD specification should be strict Pareto-improvements over their polyfill on the curve of (ARM64, x86-64 SSE4) performance

AVX512 availability is very limited and there is a high likelihood that a significant % of platforms running Wasm simd code won't be able to use them in the future too. So support in AVX512 is not a good indicator of architecture support.

Many middle-end Intel laptops recently released have AVX512 support, so it will be a significant proportion in the future.

Also, I often see comments about SSE4, but according to the number of AVX2 machines in the wild, I think WASM should be optimized for AVX2 and just be compatible with SSE4 (ie: not optimized for). It should also be optimized for AARCH64, of course.

akirilov-arm commented 3 years ago

* the first thing we would probably want to do is remove i64x2.mul, it is horrible on both.

Just a quick note about AArch64 - since people are considering different SIMD ISA extensions on x86 (e.g. AVX-512), it makes sense to do the same for AArch64. i64x2.mul is implementable with a single SVE2 instruction or 2 or 3 SVE instructions, depending on whether a move can be elided.

SVE(2) helps with the implementation of some of the other operations, in particular bitmask extraction; not NE comparisons, however.

Maratyszcza commented 3 years ago

Closing as it was decided in the 10/16/2020 meeting to keep the SIMD Not Equals instructions for consistency with scalar instructions.

WebAssembly / simd

Remove integer SIMD not-equals instructions #351