WebAssembly / flexible-vectors

Vector operations for WebAssembly
https://webassembly.github.io/flexible-vectors/
Other
46 stars 6 forks source link

Masked lanes SIMD operations #66

Open yurydelendik opened 1 month ago

yurydelendik commented 1 month ago

(per conversion from 2025-07-19 SIMD meeting filing it here)

Currently LLVM toolchain generates non-efficiently-translatable-to-native shuffle operations. The shuffle operation is doing right thing, but it is really hard to produce an efficient translation to native instructions. For example (1):

    i8x16.shuffle 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0
    v128.const i32x4 0x00000001 0x00000001 0x00000001 0x00000001
    v128.and

The shuffle operation will be encoded ad multiple instructions including loading additional constants. But it is possible to encode it more efficiently if it is known that some lanes will be discarded in the final value, e.g. we can place any lane in 'x' places 0 x x x 1 x x x 2 x x x 3 x x x and now it is possible to use couple of zero extend instructions instead. I guess the code is automatically generated by a auto-vectorizer which chose 0 as arbitrary lane.

More interesting operations:

2)

    i8x16.shuffle 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    v128.store16_lane align=1 0

3)

    i8x16.shuffle 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0
    i16x8.extend_low_i8x16_u
    i32x4.extend_low_i16x8_u

or

    i8x16.shuffle 8 9 10 11 12 13 14 15 0 1 0 1 0 1 0 1
    i32x4.extend_low_i16x8_u

4)

    local.get 3
    local.get 3
    local.get 3
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    i8x16.extract_lane_u 0

The common thing that if it is known that some zero-lane references are not important, it is easier to select a more performant instructions for i8x16.shuffle compiled code. If this burden falls on a toolchain/auto-vectorizer, the selected shuffle may "prefer" one CPU.

The snippets are taken from https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.1/dist/ort-wasm-simd.jsep.wasm

ppenzin commented 1 month ago

@yurydelendik are there sources for this code? +1 on representing discarded lanes.

The shuffle mask in the (1) pattern is 8-bit to 32-bit zero extend. On x86 that can be OK, but Arm doesn't have an instruction like this. Extra twist is that only the lowest bits of each 32-bit lane are preserved by and, but I can't figure out if that can help optimizing this sequence in a cross-patform way right now.

(2) seems to be equivalent to 8->16 extend followed by store lane, since all other lanes are ignored anyway. The question is what did original program intend to do here.

(3) is also 8->32 extend, I wonder why it doesn't produce the same pattern in 1 and 3. Alternative pattern is 16->32 extend (hence 0 1 for unused lanes), though for example x86 doesn't have a single instruction for this operation (Arm should, I believe)

(4) is a horizontal 8-bit unsigned min, these byte-wise shuffles are very expensive if this is lowered literally. I think we should consider not producing some of reductions like this.

@sparker-arm's recent change aimed at stabilizing patterns like (4) so that they can be pattern-matched in the engine, though I am not sure matching this long of a pattern is going to be easy.

Two potential follow-ups: representing discarded lanes and adding horizontal reductions.

sparker-arm commented 1 month ago

The snippets are taken from https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.1/dist/ort-wasm-simd.jsep.wasm

I was wondering where I could look for some reduction examples, thanks!

though I am not sure matching this long of a pattern is going to be easy.

I currently have patch in review for V8 that performs generic pattern matching for reductions and it's only a few hundred LOCs, even for i8x16. But there's still the inherent problem of relying on a specific, although sensible, shuffle pattern. I think adding horizontal operations is really the only long-term solution.