Open penzn opened 3 years ago
Here is a summary table of the proposed instructions and their support for different architectures.
WASM | SSE | AVX2 | AVX512 | Neon | SVE |
---|---|---|---|---|---|
vec.v8.interleave_even/odd |
general shuffle | general shuffle | general shuffle | vtrn1/2q_s/u8 |
svtrn1/2_s/u8 |
vec.v16.interleave_even/odd |
general shuffle | general shuffle | general shuffle | vtrn1/2q_s/u16 |
svtrn1/2_s/u16 |
vec.v32.interleave_even/odd |
general shuffle | general shuffle | general shuffle | vtrn1/2q_s/u32 |
svtrn1/2_s/u32 |
vec.v64.interleave_even/odd |
_mm_unpacklo/hi_epi64 |
_mm256_unpacklo/hi_epi64 |
_mm512_unpacklo/hi_epi64 |
vtrn1/2q_s/u64 |
svtrn1/2_s/u64 |
vec.v8.interleave_low/high |
_mm_unpacklo/hi_epi8 |
general shuffle | general shuffle | vuzp1/2q_s/u8 |
svuzp1/2_s/u8 |
vec.v16.interleave_low/high |
_mm_unpacklo/hi_epi16 |
general shuffle | general shuffle | vuzp1/2q_s/u16 |
svuzp1/2_s/u16 |
vec.v32.interleave_low/high |
_mm_unpacklo/hi_epi32 |
general shuffle | general shuffle | vuzp1/2q_s/u32 |
svuzp1/2_s/u32 |
vec.v64.interleave_low/high |
_mm_unpacklo/hi_epi64 |
general shuffle | general shuffle | vuzp1/2q_s/u64 |
svuzp1/2_s/u64 |
vec.v8.concat_even/odd |
general shuffle | general shuffle | general shuffle | vzip1/2q_s/u8 |
svzip1/2_s/u8 |
vec.v16.concat_even/odd |
general shuffle | general shuffle | general shuffle | vzip1/2q_s/u16 |
svzip1/2_s/u16 |
vec.v32.concat_even/odd |
general shuffle | general shuffle | general shuffle | vzip1/2q_s/u32 |
svzip1/2_s/u32 |
vec.v64.concat_even/odd |
general shuffle | general shuffle | general shuffle | vzip1/2q_s/u64 |
svzip1/2_s/u64 |
To be noted that some "general shuffles" can be implemented quite efficiently without needing a _mm512_permutex2var_epi8
(or equivalent on previous archs). For instance, "interleave even|odd" on 32-bit elements can be implemented using a single _mmX_shuffle_epi32
followed by a _mmX_blend_epi32
(or even a single _mmX_shuffle_ps
).
The main issue we have here is that AVX and AVX512 are modeled around "nested" SIMD, where most swizzle operations are defined in term of 128-bit swizzles. Thus, even though "unpacklo|hi" SSE operation directly match "interleave low|high", this match breaks for larger vectors. The same problem exists for narrowing operations like _mmX_pack(u)s_epiN
.
I don't have a good solution to solve this issue.
Concat low/high also might be useful (e.g. to combine two half regs of demoted/narrowed values).
For PSHUFB, 128-bit blocks can actually be helpful, whereas palignr is much less useful because of them. The interleave seems a middle ground, not great but also not crippled.
I suppose we could argue that trying to impose general shuffle on x86 would be very expensive, whereas the others can easily emulate it using their general shuffle+iota+masking?
I thought about adding "concat low/high", but I'm not sure what its use case would be as we would mostly not deal with half registers. The length of a register being unknown at compilation time, it will usually not appear naturally from the algorithms.
Also, if you look at #27, you will see that I proposed shuffles inside 128-bit blocks (vec.v128.shuffle
and vec.v128.swizzle
).
This is intended to make the upgrade from 128-bit algorithms easier and with good efficiency on x86.
Concerning the "general shuffles" for x86, I would like to remind that some of them can be efficiently implemented on x86 (most of v32
and v64
, I would say).
Also, I think that "general shuffles" would not be that bad in the grand scheme of things.
I'm not sure what its use case would be as we would mostly not deal with half registers.
We see this when demoting one register from u16 to u8 etc. I suppose requiring two inputs would mostly prevent that, but what about f32 -> i8, would we always need 4 inputs?
This is intended to make the upgrade from 128-bit algorithms easier and with good efficiency on x86.
Great!
I suppose requiring two inputs would mostly prevent that, but what about f32 -> i8, would we always need 4 inputs?
My point of view was that in the case you don't need 2 (or 4) inputs, you pass zeros as the relevant inputs.
Yes, that would work. I'm curious whether you are interested in reducing code changes when porting scalar code to SIMD?
When allowing half vectors, the code for demoting is basically the same as scalar after replacing array access with Load/Store. With 2-inputs, that at least requires an extra param, or encourages to unroll the loop 2x. It is also less efficient on some architectures.
I am not saying we should disallow 2:1 entirely, but it's not clear to me that it is the best option, if there is only going to be one.
When allowing half vectors, the code for demoting is basically the same as scalar after replacing array access with Load/Store. With 2-inputs, that at least requires an extra param, or encourages to unroll the loop 2x. It is also less efficient on some architectures.
For memory accesses, I think we should provide narrowing/widening ones that would always deal with full vectors. For in-register conversions, it is true that some (most?) architectures would have an overhead to do 2:1. However, this overhead would usually be less than what will be gained from loop unrolling (because you would keep the full parallelism). So all in all, I think encouraging the loop unrolling is a good thing.
Also, If we provide 2:1 conversions and pass a zero, a smart-ish WASM engine could detect it and do a 1:½ conversion. The other way around would require much more "smartness" from the engine.
I think the value of this instructions is a bit less if they just map to general shuffles, though there might be situations where that is inevitable.
Concerning the "general shuffles" for x86, I would like to remind that some of them can be efficiently implemented on x86 (most of
v32
andv64
, I would say). Also, I think that "general shuffles" would not be that bad in the grand scheme of things.
Yes, 32-bit and 64-bit have separate instructions, which makes it easier like you mentioned above. This is a useful distinction, as those instructions are cheaper (even though still "general") than the byte-wise versions.
The main issue we have here is that AVX and AVX512 are modeled around "nested" SIMD, where most swizzle operations are defined in term of 128-bit swizzles. Thus, even though "unpacklo|hi" SSE operation directly match "interleave low|high", this match breaks for larger vectors. The same problem exists for narrowing operations like
_mmX_pack(u)s_epiN
.
True - I wonder what would the lowering be if we take the nested approach for this set of operations, though probably would not the most efficient on Arm.
True - I wonder what would the lowering be if we take the nested approach for this set of operations, though probably would not the most efficient on Arm.
The "nested" approach is efficiently implementable with vec.i8x16.shuffle
(in a single operation), so I'm not sure we would need dedicated operations.
On the contrary, the proposed vec.vX.interleave|concat_odd|even|low|half
operations are not easily implementable using LUT2, especially to make it efficient on architectures with native support.
It seems that all vec.vX.interleave_even|odd
are easily implementable using a lane-wise shift and a select:
__m512i vec.v8.interleave_even(__m512i a, __m512i b) {
b = _mm512_slli_epi16(b, 8);
__mmask64 mask = 0x5555555555555555ul; // mask for even elements
return _mm512_mask_mov_epi8(mask, a, b);
}
__m512i vec.v8.interleave_odd(__m512i a, __m512i b) {
a = _mm512_srli_epi16(a, 8);
__mmask64 mask = 0x5555555555555555ul; // mask for even elements
return _mm512_mask_mov_epi8(mask, a, b);
}
vec.vX.interleave_low|high
and vec.vX.concat_even|odd
are more complex to implement and I currently don't have the time to explore them.
@lemaitre FYI here is another example where half vectors can help: https://github.com/riscv/riscv-v-spec/pull/657/files
@jan-wassenberg I understand this example, but I'm not convinced. To me, the best way would be to introduce a loop to process the larger elements inside the iteration. That way, you could still have an integer LMUL for the small datatype (no parallelism lost), while keeping a reasonable LMUL for the large datatype (no spill code).
Such an example in SSE would look like that (equivalent to LMUL=1 for both):
void add_ref(long N, ...) {
for (long I = 0; I < N; I += 16) {
__m128i vc_a = _mm_load_si128((__m128i*)(&c_a[I]));
__m128i vc_b = _mm_load_si128((__m128i*)(&c_b[I]));
__m128i vc_c = _mm_add_epi8(vc_a, vc_b);
_mm_store_si128((__m128i*)(&c_c[I]), vc_c);
for (long i = I; i < I+16; i += 2) {
__m128i vl_a = _mm_load_si128((__m128i*)(&l_a[i]));
__m128i vl_b = _mm_load_si128((__m128i*)(&l_b[i]));
__m128i vl_c = _mm_add_epi64(vl_a, vl_b);
_mm_store_si128((__m128i*)(&l_c[i]), vl_c);
...
__m128i vl_m = _mm_load_si128((__m128i*)(&l_m[i]));
vl_m = _mm_add_epi64(_mm_add_epi64(vl_m, vl_c), ...);
_mm_store_si128((__m128i*)(&l_m[i]), vl_m);
}
}
}
Sorry it took me this long to reply.
The "nested" approach is efficiently implementable with
vec.i8x16.shuffle
(in a single operation), so I'm not sure we would need dedicated operations.
Would not vec.i8x16.shuffle
produce a generic shuffle op? I guess there are ways to detect special forms by checking the mask, like it is currently done with simd128. On a related note, what is the difference between vec.i8x16.swizzle
and LUT1, is it overflow?
To be honest, I don't think implied mask detection in i8x16.shuffle
was a really good idea - it creates a layer of "hidden" decisions runtimes and toolchain have to collectively navigate. On the other hand here we would have even tougher job with variations between platforms.
Would not
vec.i8x16.shuffle
produce a generic shuffle op?
Depends on the actual mask and arch, but if the mask corresponds to a 32-bit shuffle on AVX2, a _mm256_shuffle_ps
can be used for instance.
The idea is that if you have a shuffle that does not cross 128-bit lane boundary, you can most likely use more specialized instructions on x86 (that does not use an extra constant, and/or does not have the 2 cycle latency penalty).
On 128-bit archs, vec.i8x16.shuffle
will ensure the WASM engine actually sees the shuffle is constant (instead of trying to detect it through constant propagation).
For wider archs like SVE or Risc-V V, then those instructions would have no real benefit and would fallback to the generic shuffle.
The other benefit of this vec.i8x16.shuffle
is if you try to port a 128-bit SIMD code, to flexible vectors, you can keep the 128-bit block logic, but process multiple blocks at once. Some algorithms would still require an extra layer to "communicate" between blocks, but I tend to think this gymnastic is easier than redesigning the whole algorithm from scratch.
On a related note, what is the difference between
vec.i8x16.swizzle
and LUT1, is it overflow?
vec.i8x16.swizzle
performs a (1-input) shuffle inside 128-bit blocks. It is really just calling i8x16.swizzle
on all v128
elements of the vector (think of "sub-SIMD" or nested SIMD).
This means the elements cannot cross 128-bit lane boundaries.
In AVX2, the instructions _mm256_permutevar_ps
and _mm256_shuffle_epi8
have such a semantics.
However, LUT1 is not limited to 128-bit blocks and elements can be fetch from anywhere within the source vector. Another distinction is the fact that LUT1 is not limited to 8-bit elements, but also supports wider elements. Overflow might also be different, but this has to be fledged out.
To be honest, I don't think implied mask detection in
i8x16.shuffle
was a really good idea - it creates a layer of "hidden" decisions runtimes and toolchain have to collectively navigate. On the other hand here we would have even tougher job with variations between platforms.
We are on the same page here. You might remember that I wanted more shuffle instructions to make WASM engines easier, but this never caught up.
The key to have a performant runtime is to keep as much semantics as possible from source code to the WASM engine. And this is done by having more specialized instructions, not less.
The instructions vec.i8x16.shuffle
, vec.i8x16.swizzle
, vec.S.interleave|concat_low|high|even|odd
definitely fall into this category.
Specialized shuffles to zip or unzip lanes. #28 proposes interleave and concat which roughly correspond to Arm's zip and unzip instructions. In short, interleave/zip takes odd or even lanes from two vectors and interleaves them in the output. Concat/unzip is the reverse operation - odd or even lanes are from each of the source are together in the destination.
Closest x86 to zip/interleave is unpack, but it takes adjacent lanes from the source operands instead of odd or even. It is called unpack, because when it used with a vector of zeros it is the opposite of "pack" which reduces lane sizes with signed or unsigned saturation.
Obvious takeaway is that this operations exist, but they are quite different on two major platforms. The less obvious thing is how to marry the two approaches.