Broadcast lane - Githubissues

penzn commented 3 years ago

Proposed in #28 (originally #27). It is different from existing splat, since it broadcasts a lane from input, rather than a scalar, also takes an index to select which element to broadcast:

Gets a single lane from vector and broadcast it to the entire vector. idx is interpreted modulo the cardinal of the vector.

vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8

vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16

vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32

vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64

vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128

On x86 broadcast instructions first appear in AVX (32-bit floating point elements, AVX2 for integers), however x86 variants don't take an index and only broadcasts first element of the source. General-purpose shuffle would need to be used to emulate this on SSE, which is not great (definitely slower than specialized version). Also, taking an index would lead to this turning into a general purpose shuffle on AVX+ as well.

lemaitre commented 3 years ago

It should be noted that vec.vX.splat_lane is not a substitute to vec.S.splat (which takes a scalar). It is a complement. Yes, in general, vec.vX.splat_lane would be implemented using 1-input shuffles.

I have 2 use cases for this instruction:

prefix sums (to propagate the last value to the next iteration)
matrix multiplication (to save some loads)

akirilov-arm commented 3 years ago

This operation is not that easy to implement with Neon and SVE either because of the variable index. It is basically equivalent to extract_lane, followed by an indexed DUP (index 0), for example. If the index is known at compile time and in the case of SVE has a suitable value (fits within the first 16 bytes), then it can be reduced to just the DUP.

lemaitre commented 3 years ago

Actually, x86 SIMD has even more limitations, but overall, I don't think it's that bad.

First, I assume that WASM engines will actually do proper constant folding and thus detect when the index is "compile-time" known. This is not quite pattern matching, though (and is actually simpler and more robust, I assume).

Also, for 8-bit types, it is also not that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL. For larger types, a bit more index manipulation is required because TBL (and equivalents) have a byte granularity.

akirilov-arm commented 3 years ago

Also, for 8-bit types, it is also that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL.

I am not sure if in that case I am comfortable with the idea of using TBL for SVE (or a transfer from a general-purpose to a vector register), so what I have in mind is:

cntb    x1
udiv    w2, w0, w1
msub    w1, w1, w2, w0
whilelo p0.b, wzr, w1
lasta   b1, p0, z0.b
dup     z1.b, z1.b[0]

The first 3 instructions would be the same for the TBL approach as well (they take care of getting the index into the proper range).

lemaitre commented 3 years ago

Oh, I missed that lasta could put the result into vector (first lane I assume). Then yes, it is probably a good way to do it.

However, this trick is only doable in SVE, and certainly not on x86. So all in all, the "TBL"-like implementation will still be good for those archs.

Also, as cntb is constant during the execution, it could be hardcoded during the translation, and for a power of 2, those 3 instructions will become one.

akirilov-arm commented 3 years ago

I don't assume JIT compilation in my analysis and I use vector length-agnostic code generation - in fact, AOT compilation of WebAssembly doesn't sound like something completely out of the ordinary (still not common, though). Yes, that's arguably quite conservative, but it's usually straightforward to simplify once the assumptions are known.

WebAssembly / flexible-vectors

Broadcast lane #29