Codegen of i16x8.relaxed_laneselect

alexcrichton commented 1 year ago

Currently in https://github.com/WebAssembly/relaxed-simd/issues/17 it's suggested that the x86_64 lowering of the i16x8.relaxed_laneselect instruction is pblendvb, and this appears to be what v8 does today. In https://github.com/WebAssembly/relaxed-simd/pull/115 (plus the current Overview.md), however, the english prose for the definition of this instruction is:

Relaxed lane selection is deterministic when all bits are set or unset in the mask. Otherwise depending on the host, either only the top bit is examined, or all bits are examined (i.e. it becomes a bit select).

I don't believe, though, that the pblendvb instruction correctly implements these semantics because lane selection mask 0x0080 that's neither 0x0000 or 0xffff and the high bit is zero, meaning that according to the spec the output should be the element in the b vector. The pblendvb instruction works at the byte-level, though, so one byte will be chosen from the a vector and one will be chosen from the b vector.

I think that this is also an issue with v8's lowering of the i{32x4,64x2}.relaxed_laneselect since they all go through pblendvb right now, although the suggestion in https://github.com/WebAssembly/relaxed-simd/issues/17 I think would work with blendvp{s,d} instead.

Maratyszcza commented 1 year ago

This is a good point! The intention is to allow i16x8.relaxed_laneselect to lower to pblendvb.

alexcrichton commented 1 year ago

I'm not intimately familiar with the motivations, but if the intention is to basically give users a choice of bsl on AArch64 and pblendvb on x86_64, then would it make sense to remove the i16x8, i32x4, and i64x2 versions? The current Overview.md I think is correct for the i8x16 case when pblendvb is used as an implementation and if the other types use the same lowering on v8 then the other types may not be necessary?

dtig commented 1 year ago

This maybe a bug in the baseline compiler, thanks for flagging. Here's how the optimizing compiler handles it.

alexcrichton commented 1 year ago

Oh! Sorry I naively assumed they'd be similar, should have checked both!

For your link @dtig though if I'm reading that right it's only different in using blendvp{s,d} for i{32x4,64x2}.relaxed_laneselect, but for i16x8 it's still using pblendvb which suffers from this issue?

dtig commented 1 year ago

Oh! Sorry I naively assumed they'd be similar, should have checked both!

This is the right assumption, and should be true in most cases except for where we emit extra instructions because we can't specify register constraints in the baseline compiler. The basic functionality or the variants of the instructions should be the same. Hopefully there aren't many other cases where they don't match like they should. :)

For your link @dtig though if I'm reading that right it's only different in using blendvp{s,d} for i{32x4,64x2}.relaxed_laneselect, but for i16x8 it's still using pblendvb which suffers from this issue?

You're correct that I was only responding to the i{32x4,64x2}.relaxed_laneselect bit. Working through the example in the OP, for the case where all the bits are set (deterministic output), everything works as expected, but for the case you outlined, pblendvb wouldn't work as detailed in the prose spec. It would be simple to massage the mask to only examine the top lane bit for the i16x8 case to adhere to the spec. But from @Maratyszcza message here, maybe the prose may not be indicative of the intent? My interpretation is that the result is only guaranteed for 0x0000 or 0xffff, and for no other values of the mask. I'm fine with either formulation, but we should pick one for consistency.

ngzhian commented 1 year ago

Think this will get pretty ugly: we will have to special case i16x8, and have it check specifically for 0x0080 to 0x00FF, and return a mix of the top and bottom half. For reference, the spec for relaxed lane select is here: https://www.ngzhian.com/relaxed-simd/core/exec/numerics.html#xref-exec-numerics-op-relaxed-lane-mathrm-relaxed-lane-n-i-1-i-2-i-3

Maratyszcza commented 1 year ago

We could change the text as follows: "Relaxed lane selection is deterministic when all bits are set or unset in the mask. Otherwise depending on the host, either only the top bit of a lane is examined, only the top bit of each byte is examined, or all bits are examined (i.e. it becomes a bit select)."

This change would allow using PBLENDVB for 16-bit laneselect (also allow using PBLENDVB for other laneselects, but that option is not interesting).

alexcrichton commented 1 year ago

Would it be possibility to drop the i16x8.relaxed_laneselect intrinsic entirely? If the goal is to get the generated instruction on x86_64 to be pblendvb then that's already achievable with i8x16.relaxed_laneselect. If the spec wording is updated to allow i16x8.relaxed_laneselect to have this implementation then there's no need for the instruction itself as other instructions already generate equivalent lowerings.

ngzhian commented 1 year ago

I'm not opposed to removing i16x8.relaxed_laneselect, but it makes the instruction set a bit asymmetrical. Having to explain why we don't have that instruction is also weird:

"Why don't we have i16x8.relaxed_laneselect?"
we want it to be pblendvb, and we already have a i8x16.relaxed_laneselect
"isn't pblendvb a valid implementation of i32x4.relaxed_laneselect then? why do we have i32x4.relaxed_laneselect?"
because i32x4.relaxed_laneselect is implemented as blendvps

So maybe we want to change the wording, but keep all instructions, to allow for some implementation wiggle room.

alexcrichton commented 1 year ago

I don't disagree that symmetry is nice, but when it comes to wasm simd I think the ship might have already sailed on that? For example i64x2 unsigned comparisons are not present, there's no extadd_pairwise for i64x2 outputs, there's no i32x4.narrow_*, etc. I haven't been intimately involved with SIMD historically but it seems like the instruction set is strongly driven by concrete "what's in hardware and what do people need" motivations rather than symmetrical "let's hope hardware makes this fast" wishes. Given that, at least from my own (possibly naive) perspective, a missing instruction like this is expected and normal. Or, in other words, explaining missing instructions seems like something that's already required and having one more hole seems like it's almost expected rather than rationale to keep something.

Alternatively why not go one step further? If the goal is to get access to pblendvb on x86_64 why have i{32x4,64x2}.relaxed_laneselect? If there are concrete use cases which want blendvp{s,d} it seems reasonable to have them, but if they were themselves purely added for symmetry with i8x16 and the only original goal was pblendvb then it seems like a possibility is to only have the i8x16 variant.

ngzhian commented 1 year ago

We should probably discuss this at the next sync, unfortunately I'm OOO that day. (And Marat is also OOO). @penzn will you be around to run the sync this Friday, we should bring this up for discussion.

Okay I just checked, XNNPack has some code (not in production yet) that checks if i32x4.laneselect is implemented using blendvps (i.e. only checks top bit). So it is still useful (e.g. detectsign of floats)

penzn commented 1 year ago

Sorry for the late reply. We should schedule a sync I think.

penzn commented 1 year ago

Back to the main topic, I think there is a process angle to this. Feature has passed into phase 4 pending final spec review, and according to 'phases' doc this should mean that only "minor cosmetic changes should occur". I am not sure where is the line for that, but changing semantics or dropping instructions may be pushing it. This also might be entirely up to the Working Group, who should be in control according to the doc. If we are considering either of the changes we should probably bring this up at a WG meeting.

alexcrichton commented 1 year ago

Personally at least I feel that there should still be room for changes. This proposal is not yet at phase 4, although I understand there was a provisional vote of sorts that everyone was ok with it moving to phase 4 with the spec text in place. It's perhaps worth noting, though, that I discovered this issue after said provisional vote.

I personally feel that this is an issue that was discovered late in the process because this proposal progressed relatively quickly throughout the stages process. I think that's ok, but at the same time I would at least hope that the process won't be wielded in such a way to codify what might otherwise be considered a mistake in the official specification.

penzn commented 1 year ago

Yes, I am also not sure how the provisional status of the vote plays into all of this. I don't think this would stop us making the change if the change is needed - I am just trying to understand what is the 'official' process for it right now.

By the way, there might be a way to fix just the i16x8 lowering. I have an idea, but trying to see if there are better ways to do it, will share it today or tomorrow.

To be super clear, I think it is important to resolve this issue.

penzn commented 1 year ago

To use pblendvb for i16x8.laneselect it just needs to be prepended by pcmpgtw on the mask and all-zeros value: if 16-bit value in the mask is less than zero then corresponding lane would be set to all 1s, which would be sufficient for pblendvb to pick it up. This adds at least one extra instruction though.

For SSE this would be

xorps xmm0 xmm0
pcmpgtw xmm0 $laneselect_mask
pblendvb $a $b

Two operand blend implicitly uses xmm0, that's why this pattern only adds one extra instruction. In three operand mode it would probably be two extra operations. Another way I can think of doing this is shift right + or, but that seems to be just a tad longer.

alexcrichton commented 1 year ago

For me I still don't fully understand the motivation for the existence of this instruction. It seems to be like there's three possible motivations, and forgive me for possible erroneously extrapolating here and please correct me if I'm wrong.

There's more performant lowering than the deterministic alternative v128.bitselect on some platforms. That's the case for i8x16, i32x4, and i64x2 variants of relaxed_laneselect, or so I assume given a count-the-instructions metric. This isn't at least obviously the case for i16x8 though where v128.bitselect is a pand + pandn + por combo. I'll admit though that I know count-the-instructions metrics are not always accurate, so is the combo you're proposing @penzn noticeably better than the v128.bitselect combo?
It's symmetrical to have an i16x8 variant when the other variants are present. I mentioned above how I don't personally agree with this, but that's not to say that this argument can't be made. I'm not sure if this symmetry angle has been used to motivation other instructions for example.
It's a pain to remove instructions at this stage of the proposal. I don't want to diminish the work necessary to remove it or sweep it under the rug, I do think it's a real cost to eat. That being said if this is the motivation for keeping it then I do think it's worth being honest that this is the case.

My personal take is that none of those motivations justify the existence of the instruction, but I recognize I may very well be missing something crucial! (and that I'm just one person and this is a pretty minor issue all things concerned)

ngzhian commented 1 year ago

At the SIMD sync today notes we decided that a spec text tweak (https://github.com/WebAssembly/relaxed-simd/issues/125#issuecomment-1536472262) will be made, and all instructions kept.

Motivation is to keep the instruction lower optimal (single instruction for all 4 lane select).

penzn commented 1 year ago

Looks like the group is leaning towards changing the spec, though personally would prefer (not super strongly) either removing the operation or changing the lowering, which have more or less the same effect.

The argument in favor of changing semantics is that it would result in lowest instruction count (not exactly single instruction for SEE, but lowest nonetheless). However as you pointed out this allows for implementing everything via pblendvb eliminating dedicated 32-bit and 64-bit lowerings, which would require another detection step on top of x86 vs Arm to be used correctly (and XNNPACK already does that as @ngzhian pointed out above). My take here is that detection is may not a big deal for 'large volume' users, but can deter the smaller ones, that's why personally I don't like this approach.

is the combo you're proposing @penzn noticeably better than the v128.bitselect combo?

Slightly or about the same. I see where you are going with this, I don't know if there is a restriction anywhere preventing one type of relaxed_laneselect implemented one as blend and another - as bitselect. This would definitely throw a wrench into how user code would understand what it is getting in this instruction if that is the case.

3. It's a pain to remove instructions at this stage of the proposal. I don't want to diminish the work necessary to remove it or sweep it under the rug, I do think it's a real cost to eat.

Agree, the intention was to go with it if that is necessary. During the discussion we decided that removing is still OK since we are not in phase 4.

Probably also worth mentioning that strict, sign based, laneselect (aka x86's blend) is useful for vectorization of conditionals like if (a > 0) { b = a; } else { b = c; }. It can be implemented in a just two instructions on Arm and is friendly towards use in compiler-generated code.

Maratyszcza commented 1 year ago

Probably also worth mentioning that strict, sign based, laneselect (aka x86's blend) is useful for vectorization of conditionals like if (a > 0) { b = a; } else { b = c; }. It can be implemented in a just two instructions on Arm and is friendly towards use in compiler-generated code.

It was proposed as WebAssembly/simd#124, but rejected by the WG.

penzn commented 1 year ago

One more thought.

In my view the tradeoffs of changing to byte-wise select are between these two alternatives:

Some implementation will opt for blend bytes across the board
In those cases the other three types will have to be implemented by the user as compare + laneselect (after testing the behavior), resulting in one extra instruction added to execution of all three
Application developers can also skip the other laneselects for clarity sake, though that is somewhat debatable (same effect)

vs

16-bit laneselect gets an extra instruction, or get dropped; either way leads to to compare + select implementation for those who really want it

The tradeoff is really between handicapping three operations, but only sometimes, vs handicapping only one, but always. The effect is hard to quantify ahead of time - in the best case scenario there would be no overhead, but this requires both runtime with the most efficient implementation and application that would do all the testing to detect that. I think in practice the results are going to drift away from this best case scenario and we are probably are not going to be a lot better off than only affecting i16x8. This is isn't a very strong opinion, as I don't really have a way to measure this in advance.

Maratyszcza commented 1 year ago

I don't see why there're tradeoffs at all. The guarantee provided by relaxed laneselect is that if the mask is produced by a SIMD comparison (or boolean operations on results of SIMD comparisons), then bitselect on the mask can be replaced with relaxed laneselect of the corresponding element size. Note that this guarantee works regardless of how relaxed laneselect is implemented - as bitselect, checking top bit of each byte, or checking top bit of each lane - and allows for generation of the optimal single-instruction lowering across all element types on both x86 and ARM.

penzn commented 1 year ago

The guarantee provided by relaxed laneselect is that if the mask is produced by a SIMD comparison (or boolean operations on results of SIMD comparisons), then bitselect on the mask can be replaced with relaxed laneselect of the corresponding element size.

I'm a bit confused, is there an assumption that input mask to laneselect is always the result of element-wise comparsion? Was that the original intention with this operation? I guess this might be the source of misunderstanding :)

In case of mask produced via SIMD compare you are absolutely right - it would always be single instruction, since they the lane values are either all 0's or all 1's. However the comparison isn't necessary for sign-based select if if we are comparing to zero (it would be necessary for 16-bit lanes since there is no such operation though).

Maratyszcza commented 1 year ago

I'm a bit confused, is there an assumption that input mask to laneselect is always the result of element-wise comparsion? Was that the original intention with this operation?

The input mask to laneselect doesn't have to be the result of element-wise comparison, but this was the use-case targeted by these instructions.

Removing comparison for sign-based select is essentially a hack. It is made possible by some implementations of the relaxed laneselect, but it is not portable and not the target use-case.

penzn commented 1 year ago

Removing comparison for sign-based select is essentially a hack.

One person's hack might be other person's compiler optimization 😄

In addition to comparison to zero, on x86 it is possible to produce all a[i] < b[i] ? c[i] : d[i] without compares, just using subtraction and blend ("sign" select), by doing blend (a - b), c, d, with exception of 16-bit values where comparison is necessary. This is a rather straight-forward strength reduction optimization.

I've check some compilers on this, it sounds like HPC oriented compilers do that (wouldn't vouch for all though).

Edit: On a more conservative note, it should be very easy to detect that a mask to bitselect came from a compare and lower that bitselect to one of the blend instructions on x86. That would be a completely safe optimization that would produce exactly the same code as this instruction.

Maratyszcza commented 1 year ago

What would be the benefit of replacing compare with subtraction? On most microarchitectures, compares and subtractions have exactly the same execution characteristics (latency, throughput, pipe/port use).

penzn commented 1 year ago

Sorry, edited the post above before seeing the reply. You might be right about comparison vs subtraction, let me dig into it a bit more.

On the other hand, just producing a blend for bitselect when mask is a product of comparison is a reasonable conservative optimization. At first glance it seems like it would produce the same code relaxed laneselect.

penzn commented 1 year ago

I've implemented detecting compare+bitselect in V8 (it is out of date, but can be rebased). Probably some parts of the patch can be improved, but the machinery to detect the case is there and is well within 'limited optimizations' approach proposed as part of original SIMD proposal. Can we replace relaxed laneselect with this kind of optimization?

WebAssembly / relaxed-simd

Codegen of i16x8.relaxed_laneselect #125