WebAssembly / flexible-vectors

Vector operations for WebAssembly
https://webassembly.github.io/flexible-vectors/
Other
48 stars 6 forks source link

Special shuffles to rearrange lanes #30

Open penzn opened 3 years ago

penzn commented 3 years ago

Specialized shuffles to zip or unzip lanes. #28 proposes interleave and concat which roughly correspond to Arm's zip and unzip instructions. In short, interleave/zip takes odd or even lanes from two vectors and interleaves them in the output. Concat/unzip is the reverse operation - odd or even lanes are from each of the source are together in the destination.

Closest x86 to zip/interleave is unpack, but it takes adjacent lanes from the source operands instead of odd or even. It is called unpack, because when it used with a vector of zeros it is the opposite of "pack" which reduces lane sizes with signed or unsigned saturation.

Obvious takeaway is that this operations exist, but they are quite different on two major platforms. The less obvious thing is how to marry the two approaches.

lemaitre commented 3 years ago

Here is a summary table of the proposed instructions and their support for different architectures.

WASM SSE AVX2 AVX512 Neon SVE
vec.v8.interleave_even/odd general shuffle general shuffle general shuffle vtrn1/2q_s/u8 svtrn1/2_s/u8
vec.v16.interleave_even/odd general shuffle general shuffle general shuffle vtrn1/2q_s/u16 svtrn1/2_s/u16
vec.v32.interleave_even/odd general shuffle general shuffle general shuffle vtrn1/2q_s/u32 svtrn1/2_s/u32
vec.v64.interleave_even/odd _mm_unpacklo/hi_epi64 _mm256_unpacklo/hi_epi64 _mm512_unpacklo/hi_epi64 vtrn1/2q_s/u64 svtrn1/2_s/u64
vec.v8.interleave_low/high _mm_unpacklo/hi_epi8 general shuffle general shuffle vuzp1/2q_s/u8 svuzp1/2_s/u8
vec.v16.interleave_low/high _mm_unpacklo/hi_epi16 general shuffle general shuffle vuzp1/2q_s/u16 svuzp1/2_s/u16
vec.v32.interleave_low/high _mm_unpacklo/hi_epi32 general shuffle general shuffle vuzp1/2q_s/u32 svuzp1/2_s/u32
vec.v64.interleave_low/high _mm_unpacklo/hi_epi64 general shuffle general shuffle vuzp1/2q_s/u64 svuzp1/2_s/u64
vec.v8.concat_even/odd general shuffle general shuffle general shuffle vzip1/2q_s/u8 svzip1/2_s/u8
vec.v16.concat_even/odd general shuffle general shuffle general shuffle vzip1/2q_s/u16 svzip1/2_s/u16
vec.v32.concat_even/odd general shuffle general shuffle general shuffle vzip1/2q_s/u32 svzip1/2_s/u32
vec.v64.concat_even/odd general shuffle general shuffle general shuffle vzip1/2q_s/u64 svzip1/2_s/u64

To be noted that some "general shuffles" can be implemented quite efficiently without needing a _mm512_permutex2var_epi8 (or equivalent on previous archs). For instance, "interleave even|odd" on 32-bit elements can be implemented using a single _mmX_shuffle_epi32 followed by a _mmX_blend_epi32 (or even a single _mmX_shuffle_ps).

The main issue we have here is that AVX and AVX512 are modeled around "nested" SIMD, where most swizzle operations are defined in term of 128-bit swizzles. Thus, even though "unpacklo|hi" SSE operation directly match "interleave low|high", this match breaks for larger vectors. The same problem exists for narrowing operations like _mmX_pack(u)s_epiN.

I don't have a good solution to solve this issue.

jan-wassenberg commented 3 years ago

Concat low/high also might be useful (e.g. to combine two half regs of demoted/narrowed values).

For PSHUFB, 128-bit blocks can actually be helpful, whereas palignr is much less useful because of them. The interleave seems a middle ground, not great but also not crippled.

I suppose we could argue that trying to impose general shuffle on x86 would be very expensive, whereas the others can easily emulate it using their general shuffle+iota+masking?

lemaitre commented 3 years ago

I thought about adding "concat low/high", but I'm not sure what its use case would be as we would mostly not deal with half registers. The length of a register being unknown at compilation time, it will usually not appear naturally from the algorithms.

Also, if you look at #27, you will see that I proposed shuffles inside 128-bit blocks (vec.v128.shuffle and vec.v128.swizzle). This is intended to make the upgrade from 128-bit algorithms easier and with good efficiency on x86.

Concerning the "general shuffles" for x86, I would like to remind that some of them can be efficiently implemented on x86 (most of v32 and v64, I would say). Also, I think that "general shuffles" would not be that bad in the grand scheme of things.

jan-wassenberg commented 3 years ago

I'm not sure what its use case would be as we would mostly not deal with half registers.

We see this when demoting one register from u16 to u8 etc. I suppose requiring two inputs would mostly prevent that, but what about f32 -> i8, would we always need 4 inputs?

This is intended to make the upgrade from 128-bit algorithms easier and with good efficiency on x86.

Great!

lemaitre commented 3 years ago

I suppose requiring two inputs would mostly prevent that, but what about f32 -> i8, would we always need 4 inputs?

My point of view was that in the case you don't need 2 (or 4) inputs, you pass zeros as the relevant inputs.

jan-wassenberg commented 3 years ago

Yes, that would work. I'm curious whether you are interested in reducing code changes when porting scalar code to SIMD?

When allowing half vectors, the code for demoting is basically the same as scalar after replacing array access with Load/Store. With 2-inputs, that at least requires an extra param, or encourages to unroll the loop 2x. It is also less efficient on some architectures.

I am not saying we should disallow 2:1 entirely, but it's not clear to me that it is the best option, if there is only going to be one.

lemaitre commented 3 years ago

When allowing half vectors, the code for demoting is basically the same as scalar after replacing array access with Load/Store. With 2-inputs, that at least requires an extra param, or encourages to unroll the loop 2x. It is also less efficient on some architectures.

For memory accesses, I think we should provide narrowing/widening ones that would always deal with full vectors. For in-register conversions, it is true that some (most?) architectures would have an overhead to do 2:1. However, this overhead would usually be less than what will be gained from loop unrolling (because you would keep the full parallelism). So all in all, I think encouraging the loop unrolling is a good thing.

Also, If we provide 2:1 conversions and pass a zero, a smart-ish WASM engine could detect it and do a 1:½ conversion. The other way around would require much more "smartness" from the engine.

penzn commented 3 years ago

I think the value of this instructions is a bit less if they just map to general shuffles, though there might be situations where that is inevitable.

Concerning the "general shuffles" for x86, I would like to remind that some of them can be efficiently implemented on x86 (most of v32 and v64, I would say). Also, I think that "general shuffles" would not be that bad in the grand scheme of things.

Yes, 32-bit and 64-bit have separate instructions, which makes it easier like you mentioned above. This is a useful distinction, as those instructions are cheaper (even though still "general") than the byte-wise versions.

The main issue we have here is that AVX and AVX512 are modeled around "nested" SIMD, where most swizzle operations are defined in term of 128-bit swizzles. Thus, even though "unpacklo|hi" SSE operation directly match "interleave low|high", this match breaks for larger vectors. The same problem exists for narrowing operations like _mmX_pack(u)s_epiN.

True - I wonder what would the lowering be if we take the nested approach for this set of operations, though probably would not the most efficient on Arm.

lemaitre commented 3 years ago

True - I wonder what would the lowering be if we take the nested approach for this set of operations, though probably would not the most efficient on Arm.

The "nested" approach is efficiently implementable with vec.i8x16.shuffle (in a single operation), so I'm not sure we would need dedicated operations. On the contrary, the proposed vec.vX.interleave|concat_odd|even|low|half operations are not easily implementable using LUT2, especially to make it efficient on architectures with native support.

It seems that all vec.vX.interleave_even|odd are easily implementable using a lane-wise shift and a select:

__m512i vec.v8.interleave_even(__m512i a, __m512i b) {
  b = _mm512_slli_epi16(b, 8);
  __mmask64 mask = 0x5555555555555555ul; // mask for even elements
  return _mm512_mask_mov_epi8(mask, a, b);
}
__m512i vec.v8.interleave_odd(__m512i a, __m512i b) {
  a = _mm512_srli_epi16(a, 8);
  __mmask64 mask = 0x5555555555555555ul; // mask for even elements
  return _mm512_mask_mov_epi8(mask, a, b);
}

vec.vX.interleave_low|high and vec.vX.concat_even|odd are more complex to implement and I currently don't have the time to explore them.

jan-wassenberg commented 3 years ago

@lemaitre FYI here is another example where half vectors can help: https://github.com/riscv/riscv-v-spec/pull/657/files

lemaitre commented 3 years ago

@jan-wassenberg I understand this example, but I'm not convinced. To me, the best way would be to introduce a loop to process the larger elements inside the iteration. That way, you could still have an integer LMUL for the small datatype (no parallelism lost), while keeping a reasonable LMUL for the large datatype (no spill code).

Such an example in SSE would look like that (equivalent to LMUL=1 for both):

void add_ref(long N, ...) {
  for (long I = 0; I < N; I += 16) {
    __m128i vc_a = _mm_load_si128((__m128i*)(&c_a[I]));
    __m128i vc_b = _mm_load_si128((__m128i*)(&c_b[I]));
    __m128i vc_c = _mm_add_epi8(vc_a, vc_b);
    _mm_store_si128((__m128i*)(&c_c[I]), vc_c);
    for (long i = I; i < I+16; i += 2) {
      __m128i vl_a = _mm_load_si128((__m128i*)(&l_a[i]));
      __m128i vl_b = _mm_load_si128((__m128i*)(&l_b[i]));
      __m128i vl_c = _mm_add_epi64(vl_a, vl_b);
      _mm_store_si128((__m128i*)(&l_c[i]), vl_c);
      ...
      __m128i vl_m = _mm_load_si128((__m128i*)(&l_m[i]));
      vl_m = _mm_add_epi64(_mm_add_epi64(vl_m, vl_c), ...);
      _mm_store_si128((__m128i*)(&l_m[i]), vl_m);
    }
  }
}
penzn commented 3 years ago

Sorry it took me this long to reply.

The "nested" approach is efficiently implementable with vec.i8x16.shuffle (in a single operation), so I'm not sure we would need dedicated operations.

Would not vec.i8x16.shuffle produce a generic shuffle op? I guess there are ways to detect special forms by checking the mask, like it is currently done with simd128. On a related note, what is the difference between vec.i8x16.swizzle and LUT1, is it overflow?

To be honest, I don't think implied mask detection in i8x16.shuffle was a really good idea - it creates a layer of "hidden" decisions runtimes and toolchain have to collectively navigate. On the other hand here we would have even tougher job with variations between platforms.

lemaitre commented 3 years ago

Would not vec.i8x16.shuffle produce a generic shuffle op?

Depends on the actual mask and arch, but if the mask corresponds to a 32-bit shuffle on AVX2, a _mm256_shuffle_ps can be used for instance. The idea is that if you have a shuffle that does not cross 128-bit lane boundary, you can most likely use more specialized instructions on x86 (that does not use an extra constant, and/or does not have the 2 cycle latency penalty).

On 128-bit archs, vec.i8x16.shuffle will ensure the WASM engine actually sees the shuffle is constant (instead of trying to detect it through constant propagation). For wider archs like SVE or Risc-V V, then those instructions would have no real benefit and would fallback to the generic shuffle.

The other benefit of this vec.i8x16.shuffle is if you try to port a 128-bit SIMD code, to flexible vectors, you can keep the 128-bit block logic, but process multiple blocks at once. Some algorithms would still require an extra layer to "communicate" between blocks, but I tend to think this gymnastic is easier than redesigning the whole algorithm from scratch.

On a related note, what is the difference between vec.i8x16.swizzle and LUT1, is it overflow?

vec.i8x16.swizzle performs a (1-input) shuffle inside 128-bit blocks. It is really just calling i8x16.swizzle on all v128 elements of the vector (think of "sub-SIMD" or nested SIMD). This means the elements cannot cross 128-bit lane boundaries. In AVX2, the instructions _mm256_permutevar_ps and _mm256_shuffle_epi8 have such a semantics.

However, LUT1 is not limited to 128-bit blocks and elements can be fetch from anywhere within the source vector. Another distinction is the fact that LUT1 is not limited to 8-bit elements, but also supports wider elements. Overflow might also be different, but this has to be fledged out.

To be honest, I don't think implied mask detection in i8x16.shuffle was a really good idea - it creates a layer of "hidden" decisions runtimes and toolchain have to collectively navigate. On the other hand here we would have even tougher job with variations between platforms.

We are on the same page here. You might remember that I wanted more shuffle instructions to make WASM engines easier, but this never caught up.

The key to have a performant runtime is to keep as much semantics as possible from source code to the WASM engine. And this is done by having more specialized instructions, not less. The instructions vec.i8x16.shuffle, vec.i8x16.swizzle, vec.S.interleave|concat_low|high|even|odd definitely fall into this category.