Maratyszcza commented 4 years ago

Introduction

The PR introduce a SIMD variant of Population Count operation. This operation counts the number of bits set to 1, is commonly used in algorithms involving Hamming distance between two binary vectors. Population Count is represented in the scalar WebAssembly instruction set, and many native SIMD instruction sets. E.g. ARM NEON and AVX512BITALG extensions natively support the 8-bit lane variant introduced in this proposal. Implementation in SSSE3 is less trivial, but still more efficient than emulation with i8x16.swizzle instruction due to elimination of PADDUSB instruction and direct lowering of table lookups into PSHUFB. Moreover, the SIMD Population Count can be efficiently implemented on SSE2-level processors where i8x16.swizzle-based emulation would have been scalarized.

Applications

Hamming distance (e.g. implemented in Fast Library for Approximate Nearest Neighbors (FLANN) and OpenCV computer vision library)
N-point correlations
Binary neural networks (e.g. implemented in larq/compute-engine and Blueoil, cc @lgeiger)
Ultra low precision neural networks (implemented in Caffe2/PyTorch, cc @ajtulloch)
2D similarity in chemical informatics

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VPOPCNTB xmm_y, xmm_x

x86/x86-64 processors with AVX instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to:
- VMOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
- VPANDN xmm_tmp1, xmm_tmp0, xmm_x
- VPAND xmm_y, xmm_tmp0, xmm_x
- VMOVDQA xmm_tmp0, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
- VPSRLW xmm_tmp1, xmm_tmp1, 4
- VPSHUFB xmm_y, xmm_tmp0, xmm_y
- VPSHUFB xmm_tmp1, xmm_tmp0, xmm_tmp1
- VPADDB xmm_y, xmm_y, xmm_tmp1

x86/x86-64 processors with SSSE3 instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to:
- MOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
- MOVDQA xmm_tmp1, xmm_x
- PAND xmm_tmp1, xmm_tmp0
- PANDN xmm_tmp0, xmm_x
- MOVDQA xmm_y, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
- PSRLW xmm_tmp0, 4
- PSHUFB xmm_y, xmm_tmp1
- MOVDQA xmm_tmp1, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
- PSHUFB xmm_tmp1, xmm_tmp0
- PADDB xmm_y, xmm_tmp1

x86/x86-64 processors with SSE2 instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to (based on @WojciechMula's algorithm here):
- MOVDQA xmm_tmp, xmm_x
- PSRLW xmm_tmp, 1
- MOVDQA xmm_y, xmm_x
- PAND xmm_tmp, [wasm_i8x16_splat(0x55)]
- PSUBB xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_y
- PAND xmm_y, [wasm_i8x16_splat(0x33)]
- PSRLW xmm_tmp, 2
- PAND xmm_tmp, [wasm_i8x16_splat(0x33)]
- PADDB xmm_y, xmm_tmp
- MOVDQA xmm_tmp, xmm_y
- PSRLW xmm_tmp, 4
- PADDB xmm_y, xmm_tmp
- PAND xmm_y, [wasm_i8x16_splat(0x0F)]

ARM64 processors

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to CNT Vy.16B, Vx.16B

ARMv7 processors with NEON instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VCNT.8 Qy, Qx

POWER processors with POWER 2.07+ instruction set and VMX

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VPOPCNTB VRy, VRx

MIPS processors with MSA instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to PCNT.B Wy, Wx

arunetm commented 4 years ago

Is the i8x16 variant of popcnt widely used in the implementations of these applications? If so can you share any implementation references that use these in its critical path? The emulation on IA for this instruction will be very expensive considering that AVX or AVX512 availability is not the baseline for SIMD128 MVP. If we have good evidence that this instruction variant is necessary for the real world implementations, I suggest we consider these in SIMD post MVP.

This seems to be a risky addition to current MVP given its non-trivial architecture support.

Maratyszcza commented 4 years ago

The applications are insensitive to the element size in population size instruction: the maximum output for byte inputs is 8, which enables enough accumulation using packed 8-bit SIMD to amortize the overhead of converting to a wider data type. I choose 8-bit elements for the instruction as it is the most portable, but another option would be to support all variants (8/16/32/64-bit). This repository by @WojciechMula provides many implementations for population count, and Larq compute engine use the 8-bit population count on the critical path.

I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.

mratsim commented 4 years ago

Another key application is computational biology and genetics with as example 2 biologists exploring the popcount rabbithole:

(genetic ancestry) https://github.com/numforge/laser/issues/41, https://github.com/brentp/somalier#how-it-works
(molecular fingerprinting) http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

penzn commented 4 years ago

I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.

Published results on this algorithm are not clear cut - it has negative speedups on smaller inputs when compiled natively. Wasm use of this instruction would be even slower, so a good starting point would be some description of why this should produce any speedup at all.

Maratyszcza commented 4 years ago

a good starting point would be some description of why this should produce any speedup at all.

We're removing two PADDUSB instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through the i8x16.swizzle operation.

penzn commented 4 years ago

We're removing two PADDUSB instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through the i8x16.swizzle operation.

If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.

Maratyszcza commented 4 years ago

If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.

I don't think we ever discussed the intended baseline, but de facto all previous proposals were evaluated against the baseline of prior SIMD instructions. I don't mind to evaluate against both SIMD and scalar baselines in this case.

penzn commented 4 years ago

@tlively and @zeux can correct me, but I thought we evaluated new instructions against scalar, the logic being that it does not make sense to add SIMD instructions that are slower than scalar. Also, in the comment you are referencing there are the following criteria:

hard/expensive to emulate in absence of spec change

well supported on 2+ major archs

useful in 2+ important use cases

I am not sure having this complex of a lowering on x86 makes it "well supported" on the platform (edit: the point is that SIMD popcount is known to be not well supported on x86).

penzn commented 4 years ago

Since prototyping seems to be underway, I am curious how are we going to run this. The use cases here are a number of research papers (I personally can't access the chemistry one), and a Wikipedia article. I have to admit I did not read the entirety of the papers, but at a first glance I don't see a code we can actually run.

Maratyszcza commented 4 years ago

My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.

lgeiger commented 4 years ago

My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.

That sounds great! @Maratyszcza Just for reference, in addition to the binary GEMM you linked to we also have an indirect binary convolution implementation similar to XNNPack available in larq/compute-engine which might be easier to experiment with for this analysis. Feel free to reach out to use, if you run into any trouble.

tlively commented 4 years ago

This instruction has been landed in both LLVM and Binaryen, so it should be ready to use from tip-of-tree Emscripten in a few hours. The builtin function to use is __builtin_wasm_popcnt_i8x16.

Maratyszcza commented 3 years ago

To evaluate this proposal I implemented scalar and SIMD versions of the O(N^3) part of the 3-Point Correlation computation. This part is responsible for most time spent in the 3-Point Correlation computation. The algorithm follows this paper and is loosely based on the reference code for it. The code is specialized for 128 points.

Here's the scalar version:

uint32_t ThreePointCorrelationsScalar64Bit(
    const uint64_t pairwise_ab[2 * 3 * 128],
    const uint64_t pairwise_ac[2 * 3 * 128],
    const uint64_t pairwise_bc[2 * 3 * 128])
{
  uint32_t count = 0;
  for (size_t i = 128; i != 0; i--) {
    uint64_t sat_ab0_hi = pairwise_ab[1];
    uint64_t sat_ab1_hi = pairwise_ab[3];
    uint64_t sat_ab2_hi = pairwise_ab[5];

    const uint64_t sat_ac0_lo = pairwise_ac[0];
    const uint64_t sat_ac0_hi = pairwise_ac[1];
    const uint64_t sat_ac1_lo = pairwise_ac[2];
    const uint64_t sat_ac1_hi = pairwise_ac[3];
    const uint64_t sat_ac2_lo = pairwise_ac[4];
    const uint64_t sat_ac2_hi = pairwise_ac[5];
    pairwise_ac += 6;

    const uint64_t* restrict bc = pairwise_bc;
    for (size_t j = 64; j != 0; j--) {
      const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_hi >> 63);
      const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_hi >> 63);
      const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_hi >> 63);

      const uint64_t sat_bc0_lo = bc[0];
      const uint64_t sat_bc0_hi = bc[1];
      const uint64_t sat_bc1_lo = bc[2];
      const uint64_t sat_bc1_hi = bc[3];
      const uint64_t sat_bc2_lo = bc[4];
      const uint64_t sat_bc2_hi = bc[5];
      bc += 6;

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
        (sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
        (sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
        (sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
        (sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));

      sat_ab0_hi += sat_ab0_hi;
      sat_ab1_hi += sat_ab1_hi;
      sat_ab2_hi += sat_ab2_hi;
    }

    uint64_t sat_ab0_lo = pairwise_ab[0];
    uint64_t sat_ab1_lo = pairwise_ab[2];
    uint64_t sat_ab2_lo = pairwise_ab[4];
    pairwise_ab += 6;
    for (size_t j = 64; j != 0; j--) {
      const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_lo >> 63);
      const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_lo >> 63);
      const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_lo >> 63);

      const uint64_t sat_bc0_lo = bc[0];
      const uint64_t sat_bc0_hi = bc[1];
      const uint64_t sat_bc1_lo = bc[2];
      const uint64_t sat_bc1_hi = bc[3];
      const uint64_t sat_bc2_lo = bc[4];
      const uint64_t sat_bc2_hi = bc[5];
      bc += 6;

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
        (sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
        (sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
        (sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
        (sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));

      sat_ab0_lo += sat_ab0_lo;
      sat_ab1_lo += sat_ab1_lo;
      sat_ab2_lo += sat_ab2_lo;
    }
  }
  return count;
}

And here's the SIMD version with the Population Count instruction:

#define USE_X86_WORKAROUND 0
#define USE_EXTPADD 0

uint32_t ThreePointCorrelationsSIMDPopcount(
    const uint64_t pairwise_ab[2 * 3 * 128],
    const uint64_t pairwise_ac[2 * 3 * 128],
    const uint64_t pairwise_bc[2 * 3 * 128])
{
#if !USE_EXTPADD
  const v128_t mask_00FF00FF = wasm_i32x4_const(0x00FF00FF, 0x00FF00FF, 0x00FF00FF, 0x00FF00FF);
  const v128_t mask_0000FFFF = wasm_i32x4_const(0x0000FFFF, 0x0000FFFF, 0x0000FFFF, 0x0000FFFF);
#endif

  v128_t count_simd32 = wasm_i32x4_const(0, 0, 0, 0);
  for (size_t i = 128; i != 0; i--) {

    const v128_t sat_ac0 = wasm_v128_load(pairwise_ac);
    const v128_t sat_ac1 = wasm_v128_load(pairwise_ac + 2);
    const v128_t sat_ac2 = wasm_v128_load(pairwise_ac + 4);
    pairwise_ac += 6;

    const uint64_t* restrict bc = pairwise_bc;
    v128_t sat_ab0_hi = wasm_v64x2_load_splat(pairwise_ab + 1);
    v128_t sat_ab1_hi = wasm_v64x2_load_splat(pairwise_ab + 3);
    v128_t sat_ab2_hi = wasm_v64x2_load_splat(pairwise_ab + 5);
    v128_t count_simd16 = wasm_i16x8_const(0, 0, 0, 0, 0, 0, 0, 0);
    for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
      const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_hi, sat_ab0_hi, 1, 1, 3, 3), 31);
      const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_hi, sat_ab1_hi, 1, 1, 3, 3), 31);
      const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_hi, sat_ab2_hi, 1, 1, 3, 3), 31);
#else
      const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_hi, 63);
      const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_hi, 63);
      const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_hi, 63);
#endif

      const v128_t sat_bc0 = wasm_v128_load(bc);
      const v128_t sat_bc1 = wasm_v128_load(bc + 2);
      const v128_t sat_bc2 = wasm_v128_load(bc + 4);
      bc += 6;

      const v128_t bitmask =
        wasm_v128_or(
          wasm_v128_or(
            wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
            wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
          wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));

      const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);

#if USE_EXTPADD
      count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif

      sat_ab0_hi = wasm_i64x2_shl(sat_ab0_hi, 1);
      sat_ab1_hi = wasm_i64x2_shl(sat_ab1_hi, 1);
      sat_ab2_hi = wasm_i64x2_shl(sat_ab2_hi, 1);
    }
    v128_t sat_ab0_lo = wasm_v64x2_load_splat(pairwise_ab);
    v128_t sat_ab1_lo = wasm_v64x2_load_splat(pairwise_ab + 2);
    v128_t sat_ab2_lo = wasm_v64x2_load_splat(pairwise_ab + 4);
    pairwise_ab += 6;
    for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
      const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_lo, sat_ab0_lo, 1, 1, 3, 3), 31);
      const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_lo, sat_ab1_lo, 1, 1, 3, 3), 31);
      const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_lo, sat_ab2_lo, 1, 1, 3, 3), 31);
#else
      const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_lo, 63);
      const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_lo, 63);
      const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_lo, 63);
#endif

      const v128_t sat_bc0 = wasm_v128_load(bc);
      const v128_t sat_bc1 = wasm_v128_load(bc + 2);
      const v128_t sat_bc2 = wasm_v128_load(bc + 4);
      bc += 6;

      const v128_t bitmask =
        wasm_v128_or(
          wasm_v128_or(
            wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
            wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
          wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));

      const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);

#if USE_EXTPADD
      count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif

      sat_ab0_lo = wasm_i64x2_shl(sat_ab0_lo, 1);
      sat_ab1_lo = wasm_i64x2_shl(sat_ab1_lo, 1);
      sat_ab2_lo = wasm_i64x2_shl(sat_ab2_lo, 1);
    }
#if USE_EXTPADD
    count_simd32 = wasm_i32x4_add(count_simd32, __builtin_wasm_extadd_pairwise_i16x8_u_i32x4(count_simd16));
#else
    count_simd32 = wasm_i32x4_add(count_simd32, wasm_u32x4_shr(count_simd16, 16));
    count_simd32 = wasm_i32x4_add(count_simd32, wasm_v128_and(count_simd16, mask_0000FFFF));
#endif
  }
  count_simd32 = wasm_i32x4_add(count_simd32,
                                wasm_v32x4_shuffle(count_simd32, count_simd32, 2, 3, 0, 1));
  return wasm_i32x4_extract_lane(count_simd32, 0) + wasm_i32x4_extract_lane(count_simd32, 1);
}

Additionally, I evaluated SIMD versions without the SIMD Population Count instruction. One version simulates SIMD Population Count using i8x16.swizzle instruction:

const v128_t table = wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);

...

const v128_t count_simd8 = wasm_i8x16_add(
    wasm_v8x16_swizzle(table, wasm_v128_and(bitmask, mask_0F0F0F0F)),
    wasm_v8x16_swizzle(table, wasm_u8x16_shr(bitmask, 4)));

Another version simulates SIMD Population Count using HACKMEM algorithm:

const v128_t mask_55555555 = wasm_i32x4_const(0x55555555, 0x55555555, 0x55555555, 0x55555555);
const v128_t mask_33333333 = wasm_i32x4_const(0x33333333, 0x33333333, 0x33333333, 0x33333333);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);

...

const v128_t count_simd2 = wasm_i8x16_sub(bitmask, wasm_v128_and(wasm_u8x16_shr(bitmask, 1), mask_55555555));
const v128_t count_simd4 = wasm_i8x16_add(wasm_v128_and(count_simd2, mask_33333333),
                                          wasm_v128_and(wasm_u8x16_shr(count_simd2, 2), mask_33333333));
const v128_t count_simd8 = wasm_v128_and(wasm_i8x16_add(count_simd4, wasm_u8x16_shr(count_simd4, 4)), mask_0F0F0F0F);

Performance on ARM64 devices is presented below:

Implementation	Time on Snapdragon 670 (Pixel 3a)	Time on Exynos 8895 (Galaxy S8)
64-bit scalar	441 us	466 us
WAsm SIMD with HACKMEM	325 us	207 us
WAsm SIMD with `i8x16.swizzle`	274 us	169 us
WAsm SIMD with `i8x16.popcount`	188 us	125 us

The version with the i8x16.popcount instruction is 1.4-1.5X faster than the fastest SIMD version without i8x16.popcount, and 2.3-3.7X faster than the scalar version.

Maratyszcza commented 3 years ago

In all SIMD versions on x86 I use USE_X86_WORKAROUND=1 to work-around poor code-generation for wasm_i64x2_shr(v, 63) in V8. Performance on x86-64 systems is presented below:

Implementation	Time on AMD PRO A10-8700B	Time on AMD A4-7210	Time on Intel Xeon W-2135	Time on Intel Celeron N3060
64-bit scalar	161 us	375 us	77 us	302 us
WAsm SIMD with HACKMEM	141 us	376 us	66 us	307 us
WAsm SIMD with `i8x16.swizzle`	103 us	317 us	55 us	350 us
WAsm SIMD with `i8x16.popcount`	91 us	267 us	54 us	316 us

Performance on Intel Celeron N3060 is poor for two reasons:

~~V8 implementation for x86-64 SSE loads the lookup table twice. Combined with suboptimal loading of constants this generates 4 redundant instructions.~~ This is fixed in a recent change
PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

Performance on Intel Xeon is presumably poor due to the overhead of reloading the lookup table and the 0x0F mask. Because of suboptimal loading of constants in V8, it generates 5 more instructions that the necessary minimum.

Still, despite the overhead related to loading constants in V8, which would presumably get fixed later on, the i8x16.popcount version wins on 3 out of 4 tested systems.

penzn commented 3 years ago

In all SIMD versions on x86 I use USE_X86_WORKAROUND=1 to work-around poor code-generation for wasm_i64x2_shr(v, 64) in V8.

This illustrates the point I've been trying to make for some time about feasibility of building and measuring larger apps - shaving off a few instructions by adding a more specialized operation only pays off if there are no other bottlenecks. That's why a speedup on a larger app is not only more convincing, but may be necessary - it would reduce the likelihood of running into issues like this. I suspect some of the other recently added operations would suffer from "somebody else's slowdown" if ran on all of listed use cases (from respective PR descriptions).

Also, some of the issues from https://github.com/WebAssembly/simd/pull/380#issuecomment-763958780 would apply here as well.

Maratyszcza commented 3 years ago

Re-benchmarked performance on Celeron N3060 after the recent fix for inefficient i8x16.popcnt on SSE and update the results table.

dtig commented 3 years ago

Adding a preliminary vote for the inclusion of population count operation to the SIMD proposal below. Please vote with -

👍 For including population count 👎 Against including population count

Maratyszcza commented 3 years ago

Updated Applications section with links to code. Also, I plan to re-run the benchmarks once https://chromium-review.googlesource.com/c/v8/v8/+/2643242 is merged.

Maratyszcza commented 3 years ago

To summarize, why I think it important to have this instruction vs emulating this operation on top of existing WAsm SIMD instructions:

On most x86-64 systems the most efficient way to emulate i8x16.popcnt would be to extract the top and bottom nibbles of each byte, and use i8x16.swizzle as a 16-entry table lookup. i8x16.swizzle lowers to two instructions: [V]PADDUSB to guarantee WAsm SIMD semantics for out-of-range indices and [V]PSHUFB for the actual table lookup. Having a built-in i8x16.popcount instruction in WAsm SIMD enables the code generator to eliminate [V]PADDUSB instruction from the lowering because the indices (the top and bottom nibbles of the input) are known to be in-range.
The newest x86-64 systems with AVX512-BITALG (Intel Cannon Lake and Tiger Lake) support single-instruction SIMD population could via VPOPCNTB instruction, so i8x16.popcount would lower to a single instruction.
On ARM systems i8x16.popcount would lower to a single instruction compared to 6 instructions for the emulation code.

zeux commented 3 years ago

FWIW my vote is more "neutral" than "against" but I wasn't sure how to reflect it - I've removed my original downvote to not be misleading - but my concerns are:

Scalar Wasm includes 32 and 64-bit popcnt but this one is 8-bit, which feels somewhat one-off. Is this because wider widths don't have interesting practical uses or because of suboptimal lowering?
The Intel lowering is pretty sad and that's after using rip-relative loads which I think v8 used to not support (although this may have changed)

The upsides are that Intel seems to be alone in not supporting this operation, and eventually AVX512 will correct that.

penzn commented 3 years ago

I think is, frankly, an exotic operation rarely found in practice - there are surely a few algorithms which would benefit from it (just like with anything else), but they are quite specialized and obscure. Also, despite this being present in libraries, we don't have a path to running wasm apps with this operation yet.

I don't think orthogonality with scalar Wasm should be the motivation - scalar popcnt exists due to its widespread availability in hardware.

fbarchard commented 3 years ago

I've implemented a number of HammingDistance functions with Neon, SSSE3 & AVX2. The first step is a vector popcnt that produces a count per element. Neon and AVX512 have that instruction. SSE/AVX work almost as well using shufb. You'll eventually need a horizontal or paired add, but you can unroll a few vpopcnt's before the sum needs to be widened. If we have use cases for popcnts I'm in favor of a dedicated instruction.

Maratyszcza commented 3 years ago

Scalar Wasm includes 32 and 64-bit popcnt but this one is 8-bit, which feels somewhat one-off. Is this because wider widths don't have interesting practical uses or because of suboptimal lowering?

ARM and ARM64 support only 8-bit SIMD Population Count. Wider Population Count instruction could be emulated through extended pairwise addition instructions (SADDL/UADDL), but they are partially exposed in #380.

Both HACKMEM and PSHUFB methods on pre-AVX512 x86 produce per-byte Population Count. 16- and 32-bit Population Count could be emulated using the instructions from the extended pairwise addition proposal #380. 64-bit Population Count could be emulated more efficiently, through, by doing PSADBW with a zero register.

To sum up, wider-element Population Count instructions are not exposed because they don't have an efficient lowering, and their inefficient lowering can be expressed using more general extended pairwise addition instructions.

The Intel lowering is pretty sad and that's after using rip-relative loads which I think v8 used to not support (although this may have changed)

V8 now does memory loads, albeit not yet rip-relative (needs an extra instruction to setup a general-purpose register for the load address).

zeux commented 3 years ago

@Maratyszcza Thanks, this helps. So it looks like 8-bit popcnt is the only width that's practical to run natively on ARM and the emulation on Intel is more expensive for larger widths as well. Wrt Intel lowering it looks like this is similar to some other instructions that we've standardized in this proposal where it's not 1-1, but there's no good substitute.

IOW if your algorithm requires popcnt, there's no good alternative and as such it's not obvious that the perf. "cliff" is an issue...

(It's somewhat surprising [to me] that on ARM the delta between emulation and native execution isn't as high as I would expect although maybe the throughput/latency characteristics of cnt aren't perfect...)

Maratyszcza commented 3 years ago

I think is, frankly, an exotic operation rarely found in practice - there are surely a few algorithms which would benefit from it (just like with anything else), but they are quite specialized and obscure.

Consider a dot product primitive: sum(A[i] * B[i]). When A[i] and B[i] are both binary, multiplication gets replaced with the boolean AND, and sum gets replaced with POPCOUNT. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction. These include most popular machine learning methods, e.g. neural networks, support vector machines, linear regression, and kernel methods.

Moreover, it is unnecessary to reduce precision down to a single bit: dot products of multi-bit elements can be processed bit-by-bit. The complexity is quadratic in the number of bits, but @ajtulloch found that it still outperforms direct integer computations for 3-bit elements (i.e. 9x the cost of single-bit computation).

Also, despite this being present in libraries, we don't have a path to running wasm apps with this operation yet.

IIRC, OpenCV has a WAsm SIMD port in the works.

I don't think orthogonality with scalar Wasm should be the motivation - scalar popcnt exists due to its widespread availability in hardware.

This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.

zeux commented 3 years ago

This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.

In the interest of being precise, for Intel it's the opposite - any CPU that supports AVX512BITALG supports POPCNT but the reverse is not true.

jan-wassenberg commented 3 years ago

FWIW SVE and RiscV V have popcnt instructions for masks, and on RVV that extends to whole registers (by taking LMUL=8).

Maratyszcza commented 3 years ago

Updated the table with x86-64 results following the merge of additional i8x16.popcount optimizations in V8.

lars-t-hansen commented 3 years ago

As an implementor I'm pretty neutral on this and I'm not able to properly judge whether the instruction is useful or not - several concrete uses are presented above, while it's a tough job to argue against it except in the abstract ("exotic operation", "few applications"). On ARM64 it's a no-brainer to include it. The Intel AVX512 implementation may not see wide use given the dearth of consumer devices with AVX512, but even the ssse3 implementation beats the alternatives (except on Celeron), if only barely.

arunetm commented 3 years ago

AVX512 availability on future Intel client platforms is not guaranteed and we should not make any decisions in a hurry at this point on the assumption that IA implementation will eventually become efficient with AVX512 support.
"Alder Lake Intel Hybrid Technology will not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2, and UMONITOR/UMWAIT/TPAUSE are supported." Ref: https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

Maratyszcza commented 3 years ago

@arunetm To be clear: i8x16.popcnt instruction improves performance on x86-64 systems with SSSE3 compared to emulating this operation on top of existing i8x16.swizzle WAsm SIMD instruction. The fact that i8x16.popcnt gets further speedup on AVX512-BITALG is just a cherry on top of that.

penzn commented 3 years ago

Sorry if I missed something, the thread is getting long.

Consider a dot product primitive: sum(A[i] * B[i]). When A[i] and B[i] are both binary, multiplication gets replaced with the boolean AND, and sum gets replaced with POPCOUNT. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction.

This is not helping the generality argument, as outside of machine learning this would be viewed as an esoteric take on those three operations - which doesn't make it wrong by any measure, it just that precision requirements outside of ML make this infeasible. To my best knowledge bio and chemistry algorithms would use it in a much more conventional way (as its name implies). However it don't think that just this being niche is a good enough argument of not including it - I am not convinced the examples and 3-point correlation performance prove that we really need this.

As an aside and a background, among the 3-point correlation implementations the fastest one on Celeron is the scalar one (which is in and of itself problematic). Also, Xeon performance is far from great (it passes the bar technically, though). So firstly, I don't think adding one more SIMD operation that is slower than scalar is beneficial, as instead of two inefficient ways to write something we give users three (even if a new problematic operation is faster than existing problematic operation). Surely, existing alternatives (swizzle and HACKMEM) are not portable across x86 chips, but this operation won't be portable either.

Given that, I am trying to understand why we actually need a popcount. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulate popcount on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMD popcount on x86.

On the data side, we know from the 3-point example that simulating this instruction does not work on some of the lower end x86 chips, while using alternative sequence of instructions works on Arm. Frankly, I think the overall solution for porting code which uses Neon's popcount would neither be an SSE/AVX clone nor a Neon clone - it is probably somewhere in between.

PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

I think we should investigate a solution like this instead of adding a new instruction.

Maratyszcza commented 3 years ago

Given that, I am trying to understand why we actually need a popcount. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulate popcount on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMD popcount on x86.

OpenCV uses SIMD population count on all platforms, and does simulate it on x86. The SSE version uses a HACKMEM-like algorithm, the AVX version uses VPSHUFB instruction for 16-entry table lookup (similar to i8x16.swizzle in WAsm SIMD), and the AVX512 version uses the hardware SIMD population count instruction when available.

Moreover, OpenCV has a port to WebAssembly SIMD, which simulates the SIMD population count too! The implementation uses HACKMEM-like algorithm: it predates i8x16.swizzle instruction in WAsm SIMD, so HACKMEM was the only option at the time.

PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

I think we should investigate a solution like this instead of adding a new instruction.

A solution necessitates adding a new instruction. WAsm engines have the information (microarchitecture ID) to generate alternative lowering, but this information can't be passed to WAsm because of anti-fingerprinting considerations. Thus, a perfect work-around is possible only by abstracting the code-generation for this operation away from the WAsm application into the WAsm engine.

penzn commented 3 years ago

So OpenCV already works and uses HACKMEM solution - this still does not convince me we need a new instruction for popcount. I don't think we need to add another inneficient way to perform the same operation, this time codified as an instruction. I also don't think we need to add another instruction with a caveat that it is is slower than scalar on some platforms.

Maratyszcza commented 3 years ago

I investigated performance issues on the Celeron N3060, and turns out there's a lot of performance variance between V8 versions. E.g. on the latest V8 build, SIMD versions outperform scalar:

Implementation	Time on Intel Celeron N3060
64-bit scalar	340 us
WAsm SIMD with HACKMEM	344 us
WAsm SIMD with `i8x16.swizzle`	350 us
WAsm SIMD with `i8x16.popcount`	316 us

I also tried to change the SSE implementation of i8x16.popcount with the following (SSE2 implementation based on HACKMEM):

        __ xorps(tmp, tmp);
        __ pavgb(tmp, src);
        __ Move(dst, src);
        __ andps(tmp,
                 __ ExternalReferenceAsOperand(
                     ExternalReference::address_of_wasm_i8x16_splat_0x55()));
        __ psubb(dst, tmp);
        Operand splat_0x33 = __ ExternalReferenceAsOperand(
            ExternalReference::address_of_wasm_i8x16_splat_0x33());
        __ movaps(tmp, dst);
        __ andps(dst, splat_0x33);
        __ psrlw(tmp, 2);
        __ andps(tmp, splat_0x33);
        __ paddb(dst, tmp);
        __ movaps(tmp, dst);
        __ psrlw(dst, 4);
        __ paddb(dst, tmp);
        __ andps(dst, 
                 __ ExternalReferenceAsOperand(
                     ExternalReference::address_of_wasm_i8x16_splat_0x0f()));

The resulting V8 build further improves the i8x16.popcount microkernel timing: 316 us -> 304 us.

ngzhian commented 3 years ago

The latest v8 build has this optimization https://chromium.googlesource.com/v8/v8/+/173d660849c1e4f43b15b67321b074a89b73f9ce so that's probably why it got slightly faster.

Maratyszcza commented 3 years ago

The optimization for Silvermont (and similar Atom-like processors) is merged in https://chromium.googlesource.com/v8/v8/+/71fc222f4930d7f9425e45ee1b2111bee742f20e

tlively commented 3 years ago

We voted on this instruction at the 1/29/21 meeting with these results: SF 1, F 6, N 3, A 2, SA 1. This is closer to the border between clear consensus and clear lack of consensus than many other votes we've taken, but I am inclined to interpret this vote as consensus to include the instruction. Here are the considerations leading me to this decision:

The favorable/unfavorable ratio is 7/3, which is greater than in the recent vote we declared not to achieve consensus (for signed i64x2 comparisons), which had ratio 5/3. There is no inconsistency in accepting this vote as successful.
Although the objections based on poor lowering, niche/limited use cases, and limited benchmarking are reasonable, they are not unique to this instruction. Many of those in favor of this instruction are users who can immediately make use of it. When in doubt, I would prefer to err on the side of making the proposal more useful to our users, especially when they have put a great deal of effort into demonstrating that the instruction is indeed useful to them.

aqrit commented 3 years ago

I believe the suggested lowering for SSE2 is incorrect. @Maratyszcza
PAVGB is (a + b + 1) >> 1 // rounds up! so one would want to x & (0x55 << 1) first then PAVGB or else follow the original formula of (x >> 1) & 0x55

Chromium also seems affected by this. @ngzhian

Maratyszcza commented 3 years ago

@aqrit Good catch, updated the PR description and filed issue 11591 for V8.

Maratyszcza commented 3 years ago

@aqrit On a side note, would you be interesting to participate in WAsm SIMD meetings? The next one will be on April 9 (see WebAssembly/flexible-vectors#31 for details).

ngzhian commented 3 years ago

Thanks for the report, this is fixed in v8 in https://crrev.com/c/2780218, should go out in canary tomorrow.

WebAssembly / simd

Population count instruction #379

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

POWER processors with POWER 2.07+ instruction set and VMX

MIPS processors with MSA instruction set