WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
531 stars 43 forks source link

Extended pairwise addition instructions #380

Closed Maratyszcza closed 3 years ago

Maratyszcza commented 3 years ago

Introduction

This PR introduce extended pairwise addition operation that computes extended sums within adjacent pairs of lanes of a SIMD vector, and produce a SIMD vector with twice fewer of twice wider lanes. This operation naturally arise when doing partial reductions in fixed-point algorithms, nicely maps to a single instruction on ARM, and can be simulated with just 1-4 instructions on SSE4+ x86.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

omnisip commented 3 years ago

@Maratyszcza You beat me to it! :-) I was just about finished my proposal on this. ;-)

omnisip commented 3 years ago

@Maratyszcza This might be a little bit more performant for the i16x8->i32x4 variant. It's also going to work on SSE2.

Unsigned:

        movdqa xmm_x, xmm0
        movdqa  xmm_tmp, [wasm_i32x4_splat(0x0000ffff)]
        pand    xmm_tmp, xmm_x
        psrld   xmm_x, 16
        paddd    xmm_x, xmm_tmp

Signed:

        movdqa  xmm_tmp, xmm0
        movdqa xmm_out, xmm0
        pslld   xmm_tmp, 16
        psrad   xmm_tmp, 16
        psrad   xmm_out, 16
        paddd   xmm_out, xmm_tmp
Maratyszcza commented 3 years ago

@omnisip You snippets destroy the input x value, WAsm engines are not allowed to do this in general.

omnisip commented 3 years ago

@omnisip You snippets destroy the input x value, WAsm engines are not allowed to do this in general.

Got it. I corrected the examples above and added a movdqa to prevent clobbering an input register.

ngzhian commented 3 years ago

https://crrev.com/c/2513872 prototypes this for arm64.

Maratyszcza commented 3 years ago

I evaluated this proposal on the O(N^3) part of the 3-Point Correlation computation, described here. The use of extended pairwise addition instructions is guarded by USE_EXTPADD macro. Results on ARM64 systems are presented below:

Implementation  USE_EXTPADD=0 USE_EXTPADD=1 Speedup
Snapdragon 670 (Pixel 3a) 188 us 164 us 15%
Exynos 8895 (Galaxy S8) 125 us 114 us 10%

As evidenced by the above results, even though the algorithm is not particularly rich on pairwise summations, extended pairwise addition instructions bring noticeable speedup.

Maratyszcza commented 3 years ago

Results on x86-64 systems are presented below:

Implementation  USE_EXTPADD=0 USE_EXTPADD=1 Speedup
AMD Pro A10-8700B 93 us 93 us 0%
AMD A4-7210 303 us 300 us 1%
Intel Xeon W-2135 66 us 66 us 0%
Intel Celeron N3060 396 us 396 us 0%

Unlike ARM64, x86-64 results demonstrate mostly unchanged performance. However, the x86-64 performance is hindered by suboptimal loading of constants in V8, and could be expected to improve in the future.

penzn commented 3 years ago

I have concerns regarding with both the use cases and speedup. From my point of view, needing to implement code from an unrelated paper raises some questions about feasibility of use cases listed above. Secondly, in addition to being flat on one platform, this shows low double digit millisecond gains on the other while the whole run is well under half a second on both, I don't think this would make noticeable difference in production (also with short runs like this I am not sure we can make accurate measurements). I think we ought to have somewhat higher standards for inclusion of the instructions this late in the game.

Maratyszcza commented 3 years ago

@penzn I disagree that N-point correlation is not a suitable workload for evaluation, but nonetheless did another experiment within XNNPACK to evaluate these instruction. In the experiment I added two new variants of 8-bit fixed-point GEMM/IGEMM microkernels: one using Extended Multiplication instructions in combination with the proposed Extended Pairwise Addition instructions, and another using the Extended Multiplication instructions in combination with the widening and addition instructions (not equivalent to extended pairwise addition in general, but works in this particular case and is cheaper than emulation of the extended pairwise addition operation). In these microkernels the extended pairwise addition instructions are the hot loop and their effect on end-to-end performance is more pronounced than in the N-point correlation kernels. Results for end-to-end inference performance are presented in the table below:

Processor (Device)  NN Model Time with widen+add Time with extended pairwise addition Speedup
AMD PRO A10-8700B MN v1 130 ms 84 ms 55%
AMD PRO A10-8700B MN v2 80 ms 56 ms 43%
AMD A4-7210 MN v1 323 ms 232 ms 39%
AMD A4-7210 MN v2 203 ms 155 ms 31%
Intel Xeon W-2135 MN v1 91 ms 46 ms 98%
Intel Xeon W-2135 MN v2 55 ms 31 ms 77%
Intel Celeron N3060 MN v1 420 ms 431 ms -3%
Intel Celeron N3060 MN v2 266 ms 267 ms 0%
Qualcomm Snapdragon 670 (Pixel 3a) MN v1 203 ms 137 ms 48%
Qualcomm Snapdragon 670 (Pixel 3a) MN v2 135 ms 101 ms 34%
Samsung Exynos 8895 (Galaxy S8) MN v1 224 ms 116 ms 93%
Samsung Exynos 8895 (Galaxy S8) MN v2 145 ms 89 ms 63%
Maratyszcza commented 3 years ago

The source code changes for the last experiment can be seen here

penzn commented 3 years ago

@penzn I disagree that N-point correlation is not a suitable workload for evaluation

I don't think I ever said that - it is a fine test, what I meant is that we were using something different from use cases listed in PR description. XNNPACK is a much better example - thank you for that!

dtig commented 3 years ago

Adding a preliminary vote for the inclusion of extended pairwise addition operations to the SIMD proposal below. Please vote with -

👍 For including extended pairwise addition operations 👎 Against including extended pairwise addition operations

fbarchard commented 3 years ago

In native ARM I've used paddl, padal alot. They've very fast and useful. On Intel I've used the trick of vpmaddubsw with constants of 1 to achieve the same. The instruction is very fast, even with the redundent multiply. Ideally we'd have the accumulate version, but paddl is still useful, followed by an add. An example of using it for an image is a 2x2 down sample. paddl to add values horizontally in pairs. Then add the results, and rounding shift by 2 to get average of the 4 pixels. It can also be used to accumulate values, such as popcnts.

penzn commented 3 years ago

I am not 100% convinced - there are drops in performance on some platforms, plus this is, again, short microkernels.

Maratyszcza commented 3 years ago

plus this is, again, short microkernels.

The XNNPACK results are end-to-end

penzn commented 3 years ago

The XNNPACK results are end-to-end

@Maratyszcza sorry, "microkernels" threw me off, you did mention they were end to end.

Maratyszcza commented 3 years ago

Updated end-to-end results in XNNPACK on x86-64 to reflect the recent optimizations in V8.

ngzhian commented 3 years ago

The XNNPACK benchmark only use __builtin_wasm_extadd_pairwise_i16x8_s_i32x4, I think we can probably expect similar (good) speedups for the other 3 instructions. I see examples of the other 3 instructions in the listed use cases too.

Side note: the naming of this intrinsic seems off? Should it be __builtin_wasm_extadd_pairwise_i32x4_i16x8_s instead? (dst shape first).

Maratyszcza commented 3 years ago

The XNNPACK benchmark only use __builtin_wasm_extadd_pairwise_i16x8_s_i32x4, I think we can probably expect similar (good) speedups for the other 3 instructions. I see examples of the other 3 instructions in the listed use cases too.

The 3-Point Correlation experiment used __builtin_wasm_extadd_pairwise_i8x16_u_i16x8 and __builtin_wasm_extadd_pairwise_i16x8_u_i32x4.

Side note: the naming of this intrinsic seems off? Should it be __builtin_wasm_extadd_pairwise_i32x4_i16x8_s instead? (dst shape first).

cc @tlively

tlively commented 3 years ago

Don't worry about that name. The __builtin_wasm_* functions are not intended to be user-facing and follow their own weird naming conventions. We will add properly named user-facing intrinsics eventually if we approve these instructions.