[SIMD] wasm_v8x16_shuffle fails to generate i18x16.shuffle with same indices

huningxin commented 5 years ago

Test case

v_load_deinterleave is a OpenCV.js universal intrinsic that is used by popluar image processing kernels, for example color conversion cvtColor.

By being inspired by its SSE2 implementation, the proposal of WASM SIMD implementation is like

inline v128_t wasm_unpacklo_i8x16(v128_t a, v128_t b) {
    // expect punpcklbw
    return wasm_v8x16_shuffle(a, b, 0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23);
}

inline v128_t wasm_unpackhi_i64x2(v128_t a, v128_t b) {
    // expect punpckhqdq
    return wasm_v8x16_shuffle(a, b, 8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31);
}

__attribute__((noinline))
inline void v_load_deinterleave(const unsigned char* ptr, __u8x16& a, __u8x16& b, __u8x16& c)
{
    v128_t t00 = wasm_v128_load(ptr);
    v128_t t01 = wasm_v128_load(ptr + 16);
    v128_t t02 = wasm_v128_load(ptr + 32);

    v128_t t10 = wasm_unpacklo_i8x16(t00, wasm_unpackhi_i64x2(t01, t01));
    v128_t t11 = wasm_unpacklo_i8x16(wasm_unpackhi_i64x2(t00, t00), t02);
    v128_t t12 = wasm_unpacklo_i8x16(t01, wasm_unpackhi_i64x2(t02, t02));

    v128_t t20 = wasm_unpacklo_i8x16(t10, wasm_unpackhi_i64x2(t11, t11));
    v128_t t21 = wasm_unpacklo_i8x16(wasm_unpackhi_i64x2(t10, t10), t12);
    v128_t t22 = wasm_unpacklo_i8x16(t11, wasm_unpackhi_i64x2(t12, t12));

    v128_t t30 = wasm_unpacklo_i8x16(t20, wasm_unpackhi_i64x2(t21, t21));
    v128_t t31 = wasm_unpacklo_i8x16(wasm_unpackhi_i64x2(t20, t20), t22);
    v128_t t32 = wasm_unpacklo_i8x16(t21, wasm_unpackhi_i64x2(t22, t22));

    a = wasm_unpacklo_i8x16(t30, wasm_unpackhi_i64x2(t31, t31));
    b = wasm_unpacklo_i8x16(wasm_unpackhi_i64x2(t30, t30), t32);
    c = wasm_unpacklo_i8x16(t31, wasm_unpackhi_i64x2(t32, t32));
}

Expected result

In particular, we expect wasm_v8x16_shuffle(a, b, 0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23) could generate i8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 that matches kX64S8x16UnpackLow pattern in V8 that generates punpcklbw instruction. Similarly, we expect wasm_v8x16_shuffle(a, b, 8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31) could generate i8x16.shuffle 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 that matches kX64S64x2UnpackHigh pattern in V8 that generates punpckhqdq instruction.

Actual result

However, it turns out emscripten compiles the above code to different v8x16.shuffle wasm ops that fails to match the pattern and leads V8 to generate slow pshufb with memory operands.

The wasm-dis output is

(func $v_load_deinterleave\28unsigned\20char\20const*\2c\20unsigned\20char\20vector\5b16\5d&\2c\20unsigned\20char\20vector\5b16\5d&\2c\20unsigned\20char\20vector\5b16\5d&\29 (; 10 ;) (type $7) (param $0 i32) (param $1 i32) (param $2 i32) (param $3 i32)
  (local $4 v128)
  (local $5 v128)
  (local $6 v128)
  (local $7 v128)
  (local $8 v128)
  (v128.store
   (local.get $1)
   (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
    (local.tee $7
     (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
      (local.tee $8
       (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
        (local.tee $6
         (v8x16.shuffle 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31
          (local.tee $4
           (v128.load align=1
            (local.get $0)
           )
          )
          (local.tee $5
           (v128.load offset=16 align=1
            (local.get $0)
           )
          )
         )
        )
        (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
         (local.tee $4
          (v8x16.shuffle 8 16 9 17 10 18 11 19 12 20 13 21 14 22 15 23
           (local.get $4)
           (local.tee $7
            (v128.load offset=32 align=1
             (local.get $0)
            )
           )
          )
         )
         (local.get $4)
        )
       )
      )
      (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
       (local.tee $6
        (v8x16.shuffle 8 16 9 17 10 18 11 19 12 20 13 21 14 22 15 23
         (local.get $6)
         (local.tee $5
          (v8x16.shuffle 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31
           (local.get $5)
           (local.get $7)
          )
         )
        )
       )
       (local.get $4)
      )
     )
    )
    (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
     (local.tee $5
      (v8x16.shuffle 8 16 9 17 10 18 11 19 12 20 13 21 14 22 15 23
       (local.get $8)
       (local.tee $4
        (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
         (local.get $4)
         (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
          (local.get $5)
          (local.get $4)
         )
        )
       )
      )
     )
     (local.get $4)
    )
   )
  )
  (v128.store
   (local.get $2)
   (v8x16.shuffle 8 16 9 17 10 18 11 19 12 20 13 21 14 22 15 23
    (local.get $7)
    (local.tee $4
     (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
      (local.get $6)
      (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
       (local.get $4)
       (local.get $4)
      )
     )
    )
   )
  )
  (v128.store
   (local.get $3)
   (v8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
    (local.get $5)
    (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
     (local.get $4)
     (local.get $4)
    )
   )
  )
 )

In particular, the v8x16.shuffle 8,9,10,11,12,13,14,15,0,0,0,0,0,0,0,0, v8x16.shuffle 8,16,9,17,10,18,11,19,12,20,13,21,14,22 15 23, v8x16.shuffle 0,24,1,25,2,26,3,27,4,28,5,29,6,30,7,31 lead V8 to generate slow pshufb with memory operands.

tlively commented 5 years ago

This is a tricky issue. This transformation is being applied in LLVM's final optimizations before code generation. The shuffle masks are being combined to reduce the total number of shuffle instructions, but LLVM does not realize that by reducing the number of shuffles it will ultimately produce worse code. To allow your code to work, we will probably have to implement a new wasm-specific shuffle builtin function that will be opaque to such optimizations. That new builtin function would then be used to implement wasm_v8x16_shuffle.

I also experimented with solving this problem using inline (web)assembly in the C code, but that had some issues. Essentially the inline assembly was too opaque to LLVM, so the output module was not valid.

Out of curiosity, are you implementing code paths tuned to ARM instructions as well?

huningxin commented 5 years ago

@tlively , thanks for your comments. Basically, I need a way that generates exact WASM op, e.g. v8x16.shuffle with certain mask. However, it turns out wasm_v8x16_shuffle doesn't guarantee that. So do you know any other ways? You also mentioned inline webassembly is not a right answer, is it?

Out of curiosity, are you implementing code paths tuned to ARM instructions as well?

According to the neon optimization of OpenCV, it uses vld3q_s8 to implement v_load_deinterleave. However, I didn't find an equivalent wasm simd op. (Did I miss anything?) So I'd leverage the shuffle ops first. But I am now somehow blocked by finding a way to generate exact wasm simd shuffle ops.

tlively commented 5 years ago

@huningxin. Currently there is no way to do this. I am working on a fix upstream in LLVM and Emscripten so that the code you've written will lower as expected, so you won't need to do anything except update emscripten once the fix is in. I will update this issue with my progress.

huningxin commented 5 years ago

great, thanks @tlively .

tlively commented 5 years ago

The LLVM/clang patch is https://reviews.llvm.org/D66983 and the corresponding emscripten patch is #9350.

gnzlbg commented 5 years ago

In particular, we expect wasm_v8x16_shuffle(a, b, 0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23) could generate i8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23

Since LLVM does not guarantee that this happens, and this is not really guaranteed for any hardware intrinsic on any programming language (e.g. the Intel SEE or ARM NEON ones), this expectation is incorrect. If you need this guarantee, the language feature you are looking for is called "inline assembly", and that guarantees that the exact same WASM you write will be generated by the backend.

What the hardware intrinsics guarantee is that they have those semantics, in this case, the same as those of performing that shuffle. For example, if by using the intrinsics LLVM would be evaluating your code at compile-time, not emitting any WASM instructions for this, you wouldn't be opening this issue 😆

So what's happening here is that LLVM is optimizing two shuffle instructions into one, making your program faster and smaller for the WASM abstract machine. The issue is that V8 is dropping the ball by generating bad code for it. A different WASM->NATIVE code generator could do better, and if you open a V8 bug, maybe a future V8 version could do better too.

Does this make sense to you?

tlively commented 5 years ago

Well, combining the shuffles does make the program smaller, but without reasoning about the underlying platform it is impossible to know whether the program would become faster or slower. The WebAssembly specification does not make any mention of the relative expected performance of different instructions. While combining instructions often leads to speedups across platforms, that is clearly not the case here. So LLVM is doing this shuffle combine based on unfounded assumptions and probably should not be doing it.

gnzlbg commented 5 years ago

Well, combining the shuffles does make the program smaller, but without reasoning about the underlying platform it is impossible to know whether the program would become faster or slower.

The underlying platform is WASM, we can think of V8, WABT, etc. maybe as different "WASM CPU"s, but that might not be sufficient, since V8 performance would depend drastically of whether V8 is generating code for x86 or ARM.

The WebAssembly specification does not make any mention of the relative expected performance of different instructions

Neither do most ISAs. Compilers use heuristics for this.

So LLVM is doing this shuffle combine based on unfounded assumptions and probably should not be doing it.

The combined operation is twice as fast as the uncombined operation on WABT, for example.

gnzlbg commented 5 years ago

I suppose that this means that you disagree that this is a V8 performance bug then ? Is there a V8 bug report for this that concluded that ?

tlively commented 5 years ago

There is no V8 bug report I know of for this issue. Whether V8 would consider lowering a single shuffle into a combination of multiple shuffles is an interesting question that I do not know the answer to, but I'm skeptical of calling it a bug if they don't because there is a design decision to be made about how much code to put into optimizing shuffles.

gnzlbg commented 5 years ago

It would be fair for V8 to decide that generating sub-optimal x86 instructions for this mask is an acceptable trade-off for their engine. But that answer would already convey that they consider the current performance of the code that LLVM generates acceptable.

huningxin commented 5 years ago

Since LLVM does not guarantee that this happens, and this is not really guaranteed for any hardware intrinsic on any programming language (e.g. the Intel SEE or ARM NEON ones), this expectation is incorrect.

My expectation comes from using Intel SSE intrinsics that indicate the corresponding instruction. For example, _mm_shuffle_epi8 indicates pshufb.

gnzlbg commented 5 years ago

@huningxin

My expectation comes from using Intel SSE intrinsics that indicate the corresponding instruction. For example, _mm_shuffle_epi8 indicates pshufb.

Those instructions are "examples" of what the compiler might generate. Not even the intel compiler commits to generating that. For example, _mm_add_epi32 says in that website that its instruction is a padd, but this example:

__m128 foo() { return _mm_add_epi32(_mm_set1_epi32(1), _mm_set1_epi32(1)); }

does not generate a padd in the Intel compiler, or clang, or gcc, or any other (proof: https://gcc.godbolt.org/z/a7Zvl5). So if the claim that the intrinsic documentation requires that, then all those compilers have a bug.

The intrinsics documentation has an "operational semantics" specification of what an intrinsic does, and that's the only thing the spec requires (in this case, that it adds two vectors). Which instruction the compiler then uses for that, is up to the compiler. In this case, the instruction can be constant folded, so no padd instruction is necessary.

As a user of those compilers, if they were to generate a padd here, I would fill a compiler bug, because the . In the same vein, if LLVM were to generate multiple shuffles here, I would fill in a compiler bug as well.

Intel, GCC, Clang, ... offer an "inline assembly" feature that lets you tell the compiler: "generate a padd here", and then these optimizations are not performed. So if those are the semantics you need, then the intrinsics are not the correct feature to use, and you should use inline assembly instead.

For most users, however, the intrinsics are the right feature to use, because its a performance feature that benefits from compiler optimizations, and most users want that.

V8 being slow here is unfortunate, but disabling LLVM optimizations or turning the intrinsics into "inline assembly" semantics is not the answer. V8 has all the information required to generate optimal code, so the ideal fix is for it to just do so. That isn't necessarily trivial, but unless your WASM module is comprised almost exclusively of SIMD shuffles, I can't imagine how generating optimal code would impact real-world WASM modules negatively. It might well be that they prefer to not introduce the complexity required to do that in their engine, and that might be fair, but then if you need good WASM SIMD performance, you should use a different engine, or accept V8's trade-offs.

Does that make sense?

huningxin commented 4 years ago

I happened to measure the slow down of v_load_deinterleave by applying @tlively 's patch

The LLVM/clang patch is https://reviews.llvm.org/D66983 and the corresponding emscripten patch is #9350.

The wasm test case is https://github.com/huningxin/emscripten-wasm-test/blob/master/v_load_deinterleave_wasm.cc. I also measure the native sse performance by same version of clang in emscripten. The sse test case is https://github.com/huningxin/emscripten-wasm-test/blob/master/v_load_deinterleave_sse.cc

Here are result on my machine with i7-8700 CPU @3.2GHz. The test is run by v8 7.8.0 @8567ac2.

compiler	elapsed time (second)
vanilla emcc 1.38.42	7.927000
patched emcc 1.38.42	1.01000
clang 10.0.0 @edfaee0	0.919862

It turns out 7.9X slow down for me.

omnisip commented 4 years ago

That's significant.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

tlively commented 3 years ago

This was fixed upstream.

emscripten-core / emscripten