Closed huningxin closed 3 years ago
This is a tricky issue. This transformation is being applied in LLVM's final optimizations before code generation. The shuffle masks are being combined to reduce the total number of shuffle instructions, but LLVM does not realize that by reducing the number of shuffles it will ultimately produce worse code. To allow your code to work, we will probably have to implement a new wasm-specific shuffle builtin function that will be opaque to such optimizations. That new builtin function would then be used to implement wasm_v8x16_shuffle
.
I also experimented with solving this problem using inline (web)assembly in the C code, but that had some issues. Essentially the inline assembly was too opaque to LLVM, so the output module was not valid.
Out of curiosity, are you implementing code paths tuned to ARM instructions as well?
@tlively , thanks for your comments. Basically, I need a way that generates exact WASM op, e.g. v8x16.shuffle
with certain mask. However, it turns out wasm_v8x16_shuffle
doesn't guarantee that. So do you know any other ways? You also mentioned inline webassembly is not a right answer, is it?
Out of curiosity, are you implementing code paths tuned to ARM instructions as well?
According to the neon optimization of OpenCV, it uses vld3q_s8
to implement v_load_deinterleave
. However, I didn't find an equivalent wasm simd op. (Did I miss anything?) So I'd leverage the shuffle ops first. But I am now somehow blocked by finding a way to generate exact wasm simd shuffle ops.
@huningxin. Currently there is no way to do this. I am working on a fix upstream in LLVM and Emscripten so that the code you've written will lower as expected, so you won't need to do anything except update emscripten once the fix is in. I will update this issue with my progress.
great, thanks @tlively .
The LLVM/clang patch is https://reviews.llvm.org/D66983 and the corresponding emscripten patch is #9350.
In particular, we expect wasm_v8x16_shuffle(a, b, 0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23) could generate i8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
Since LLVM does not guarantee that this happens, and this is not really guaranteed for any hardware intrinsic on any programming language (e.g. the Intel SEE or ARM NEON ones), this expectation is incorrect. If you need this guarantee, the language feature you are looking for is called "inline assembly", and that guarantees that the exact same WASM you write will be generated by the backend.
What the hardware intrinsics guarantee is that they have those semantics, in this case, the same as those of performing that shuffle. For example, if by using the intrinsics LLVM would be evaluating your code at compile-time, not emitting any WASM instructions for this, you wouldn't be opening this issue 😆
So what's happening here is that LLVM is optimizing two shuffle instructions into one, making your program faster and smaller for the WASM abstract machine. The issue is that V8 is dropping the ball by generating bad code for it. A different WASM->NATIVE code generator could do better, and if you open a V8 bug, maybe a future V8 version could do better too.
Does this make sense to you?
Well, combining the shuffles does make the program smaller, but without reasoning about the underlying platform it is impossible to know whether the program would become faster or slower. The WebAssembly specification does not make any mention of the relative expected performance of different instructions. While combining instructions often leads to speedups across platforms, that is clearly not the case here. So LLVM is doing this shuffle combine based on unfounded assumptions and probably should not be doing it.
Well, combining the shuffles does make the program smaller, but without reasoning about the underlying platform it is impossible to know whether the program would become faster or slower.
The underlying platform is WASM, we can think of V8, WABT, etc. maybe as different "WASM CPU"s, but that might not be sufficient, since V8 performance would depend drastically of whether V8 is generating code for x86 or ARM.
The WebAssembly specification does not make any mention of the relative expected performance of different instructions
Neither do most ISAs. Compilers use heuristics for this.
So LLVM is doing this shuffle combine based on unfounded assumptions and probably should not be doing it.
The combined operation is twice as fast as the uncombined operation on WABT, for example.
I suppose that this means that you disagree that this is a V8 performance bug then ? Is there a V8 bug report for this that concluded that ?
There is no V8 bug report I know of for this issue. Whether V8 would consider lowering a single shuffle into a combination of multiple shuffles is an interesting question that I do not know the answer to, but I'm skeptical of calling it a bug if they don't because there is a design decision to be made about how much code to put into optimizing shuffles.
It would be fair for V8 to decide that generating sub-optimal x86 instructions for this mask is an acceptable trade-off for their engine. But that answer would already convey that they consider the current performance of the code that LLVM generates acceptable.
Since LLVM does not guarantee that this happens, and this is not really guaranteed for any hardware intrinsic on any programming language (e.g. the Intel SEE or ARM NEON ones), this expectation is incorrect.
My expectation comes from using Intel SSE intrinsics that indicate the corresponding instruction. For example, _mm_shuffle_epi8
indicates pshufb
.
@huningxin
My expectation comes from using Intel SSE intrinsics that indicate the corresponding instruction. For example, _mm_shuffle_epi8 indicates pshufb.
Those instructions are "examples" of what the compiler might generate. Not even the intel compiler commits to generating that. For example, _mm_add_epi32
says in that website that its instruction is a padd
, but this example:
__m128 foo() { return _mm_add_epi32(_mm_set1_epi32(1), _mm_set1_epi32(1)); }
does not generate a padd
in the Intel compiler, or clang, or gcc, or any other (proof: https://gcc.godbolt.org/z/a7Zvl5). So if the claim that the intrinsic documentation requires that, then all those compilers have a bug.
The intrinsics documentation has an "operational semantics" specification of what an intrinsic does, and that's the only thing the spec requires (in this case, that it adds two vectors). Which instruction the compiler then uses for that, is up to the compiler. In this case, the instruction can be constant folded, so no padd
instruction is necessary.
As a user of those compilers, if they were to generate a padd
here, I would fill a compiler bug, because the . In the same vein, if LLVM were to generate multiple shuffles here, I would fill in a compiler bug as well.
Intel, GCC, Clang, ... offer an "inline assembly" feature that lets you tell the compiler: "generate a padd
here", and then these optimizations are not performed. So if those are the semantics you need, then the intrinsics are not the correct feature to use, and you should use inline assembly instead.
For most users, however, the intrinsics are the right feature to use, because its a performance feature that benefits from compiler optimizations, and most users want that.
V8 being slow here is unfortunate, but disabling LLVM optimizations or turning the intrinsics into "inline assembly" semantics is not the answer. V8 has all the information required to generate optimal code, so the ideal fix is for it to just do so. That isn't necessarily trivial, but unless your WASM module is comprised almost exclusively of SIMD shuffles, I can't imagine how generating optimal code would impact real-world WASM modules negatively. It might well be that they prefer to not introduce the complexity required to do that in their engine, and that might be fair, but then if you need good WASM SIMD performance, you should use a different engine, or accept V8's trade-offs.
Does that make sense?
I happened to measure the slow down of v_load_deinterleave
by applying @tlively 's patch
The LLVM/clang patch is https://reviews.llvm.org/D66983 and the corresponding emscripten patch is #9350.
The wasm test case is https://github.com/huningxin/emscripten-wasm-test/blob/master/v_load_deinterleave_wasm.cc. I also measure the native sse performance by same version of clang in emscripten. The sse test case is https://github.com/huningxin/emscripten-wasm-test/blob/master/v_load_deinterleave_sse.cc
Here are result on my machine with i7-8700 CPU @3.2GHz. The test is run by v8 7.8.0 @8567ac2.
compiler | elapsed time (second) |
---|---|
vanilla emcc 1.38.42 | 7.927000 |
patched emcc 1.38.42 | 1.01000 |
clang 10.0.0 @edfaee0 | 0.919862 |
It turns out 7.9X slow down for me.
That's significant.
This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.
This was fixed upstream.
Test case
v_load_deinterleave
is a OpenCV.js universal intrinsic that is used by popluar image processing kernels, for example color conversioncvtColor
.By being inspired by its SSE2 implementation, the proposal of WASM SIMD implementation is like
Expected result
In particular, we expect
wasm_v8x16_shuffle(a, b, 0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23)
could generatei8x16.shuffle 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23
that matcheskX64S8x16UnpackLow
pattern in V8 that generatespunpcklbw
instruction. Similarly, we expectwasm_v8x16_shuffle(a, b, 8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31)
could generatei8x16.shuffle 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
that matches kX64S64x2UnpackHigh pattern in V8 that generatespunpckhqdq
instruction.Actual result
However, it turns out emscripten compiles the above code to different
v8x16.shuffle
wasm ops that fails to match the pattern and leads V8 to generate slowpshufb
with memory operands.The wasm-dis output is
In particular, the
v8x16.shuffle 8,9,10,11,12,13,14,15,0,0,0,0,0,0,0,0
,v8x16.shuffle 8,16,9,17,10,18,11,19,12,20,13,21,14,22 15 23
,v8x16.shuffle 0,24,1,25,2,26,3,27,4,28,5,29,6,30,7,31
lead V8 to generate slowpshufb
with memory operands.