Closed Maratyszcza closed 3 years ago
Is the i8x16 variant of popcnt widely used in the implementations of these applications? If so can you share any implementation references that use these in its critical path? The emulation on IA for this instruction will be very expensive considering that AVX or AVX512 availability is not the baseline for SIMD128 MVP. If we have good evidence that this instruction variant is necessary for the real world implementations, I suggest we consider these in SIMD post MVP.
This seems to be a risky addition to current MVP given its non-trivial architecture support.
The applications are insensitive to the element size in population size instruction: the maximum output for byte inputs is 8, which enables enough accumulation using packed 8-bit SIMD to amortize the overhead of converting to a wider data type. I choose 8-bit elements for the instruction as it is the most portable, but another option would be to support all variants (8/16/32/64-bit). This repository by @WojciechMula provides many implementations for population count, and Larq compute engine use the 8-bit population count on the critical path.
I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.
Another key application is computational biology and genetics with as example 2 biologists exploring the popcount rabbithole:
I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.
Published results on this algorithm are not clear cut - it has negative speedups on smaller inputs when compiled natively. Wasm use of this instruction would be even slower, so a good starting point would be some description of why this should produce any speedup at all.
a good starting point would be some description of why this should produce any speedup at all.
We're removing two PADDUSB
instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through the i8x16.swizzle
operation.
We're removing two
PADDUSB
instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through thei8x16.swizzle
operation.
If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.
If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.
I don't think we ever discussed the intended baseline, but de facto all previous proposals were evaluated against the baseline of prior SIMD instructions. I don't mind to evaluate against both SIMD and scalar baselines in this case.
@tlively and @zeux can correct me, but I thought we evaluated new instructions against scalar, the logic being that it does not make sense to add SIMD instructions that are slower than scalar. Also, in the comment you are referencing there are the following criteria:
- hard/expensive to emulate in absence of spec change
- well supported on 2+ major archs
- useful in 2+ important use cases
I am not sure having this complex of a lowering on x86 makes it "well supported" on the platform (edit: the point is that SIMD popcount is known to be not well supported on x86).
Since prototyping seems to be underway, I am curious how are we going to run this. The use cases here are a number of research papers (I personally can't access the chemistry one), and a Wikipedia article. I have to admit I did not read the entirety of the papers, but at a first glance I don't see a code we can actually run.
My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.
My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.
That sounds great! @Maratyszcza Just for reference, in addition to the binary GEMM you linked to we also have an indirect binary convolution implementation similar to XNNPack available in larq/compute-engine which might be easier to experiment with for this analysis. Feel free to reach out to use, if you run into any trouble.
This instruction has been landed in both LLVM and Binaryen, so it should be ready to use from tip-of-tree Emscripten in a few hours. The builtin function to use is __builtin_wasm_popcnt_i8x16
.
To evaluate this proposal I implemented scalar and SIMD versions of the O(N^3) part of the 3-Point Correlation computation. This part is responsible for most time spent in the 3-Point Correlation computation. The algorithm follows this paper and is loosely based on the reference code for it. The code is specialized for 128 points.
Here's the scalar version:
uint32_t ThreePointCorrelationsScalar64Bit(
const uint64_t pairwise_ab[2 * 3 * 128],
const uint64_t pairwise_ac[2 * 3 * 128],
const uint64_t pairwise_bc[2 * 3 * 128])
{
uint32_t count = 0;
for (size_t i = 128; i != 0; i--) {
uint64_t sat_ab0_hi = pairwise_ab[1];
uint64_t sat_ab1_hi = pairwise_ab[3];
uint64_t sat_ab2_hi = pairwise_ab[5];
const uint64_t sat_ac0_lo = pairwise_ac[0];
const uint64_t sat_ac0_hi = pairwise_ac[1];
const uint64_t sat_ac1_lo = pairwise_ac[2];
const uint64_t sat_ac1_hi = pairwise_ac[3];
const uint64_t sat_ac2_lo = pairwise_ac[4];
const uint64_t sat_ac2_hi = pairwise_ac[5];
pairwise_ac += 6;
const uint64_t* restrict bc = pairwise_bc;
for (size_t j = 64; j != 0; j--) {
const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_hi >> 63);
const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_hi >> 63);
const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_hi >> 63);
const uint64_t sat_bc0_lo = bc[0];
const uint64_t sat_bc0_hi = bc[1];
const uint64_t sat_bc1_lo = bc[2];
const uint64_t sat_bc1_hi = bc[3];
const uint64_t sat_bc2_lo = bc[4];
const uint64_t sat_bc2_hi = bc[5];
bc += 6;
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
(sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
(sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
(sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
(sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));
sat_ab0_hi += sat_ab0_hi;
sat_ab1_hi += sat_ab1_hi;
sat_ab2_hi += sat_ab2_hi;
}
uint64_t sat_ab0_lo = pairwise_ab[0];
uint64_t sat_ab1_lo = pairwise_ab[2];
uint64_t sat_ab2_lo = pairwise_ab[4];
pairwise_ab += 6;
for (size_t j = 64; j != 0; j--) {
const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_lo >> 63);
const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_lo >> 63);
const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_lo >> 63);
const uint64_t sat_bc0_lo = bc[0];
const uint64_t sat_bc0_hi = bc[1];
const uint64_t sat_bc1_lo = bc[2];
const uint64_t sat_bc1_hi = bc[3];
const uint64_t sat_bc2_lo = bc[4];
const uint64_t sat_bc2_hi = bc[5];
bc += 6;
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
(sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
(sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
(sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
(sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));
sat_ab0_lo += sat_ab0_lo;
sat_ab1_lo += sat_ab1_lo;
sat_ab2_lo += sat_ab2_lo;
}
}
return count;
}
And here's the SIMD version with the Population Count instruction:
#define USE_X86_WORKAROUND 0
#define USE_EXTPADD 0
uint32_t ThreePointCorrelationsSIMDPopcount(
const uint64_t pairwise_ab[2 * 3 * 128],
const uint64_t pairwise_ac[2 * 3 * 128],
const uint64_t pairwise_bc[2 * 3 * 128])
{
#if !USE_EXTPADD
const v128_t mask_00FF00FF = wasm_i32x4_const(0x00FF00FF, 0x00FF00FF, 0x00FF00FF, 0x00FF00FF);
const v128_t mask_0000FFFF = wasm_i32x4_const(0x0000FFFF, 0x0000FFFF, 0x0000FFFF, 0x0000FFFF);
#endif
v128_t count_simd32 = wasm_i32x4_const(0, 0, 0, 0);
for (size_t i = 128; i != 0; i--) {
const v128_t sat_ac0 = wasm_v128_load(pairwise_ac);
const v128_t sat_ac1 = wasm_v128_load(pairwise_ac + 2);
const v128_t sat_ac2 = wasm_v128_load(pairwise_ac + 4);
pairwise_ac += 6;
const uint64_t* restrict bc = pairwise_bc;
v128_t sat_ab0_hi = wasm_v64x2_load_splat(pairwise_ab + 1);
v128_t sat_ab1_hi = wasm_v64x2_load_splat(pairwise_ab + 3);
v128_t sat_ab2_hi = wasm_v64x2_load_splat(pairwise_ab + 5);
v128_t count_simd16 = wasm_i16x8_const(0, 0, 0, 0, 0, 0, 0, 0);
for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_hi, sat_ab0_hi, 1, 1, 3, 3), 31);
const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_hi, sat_ab1_hi, 1, 1, 3, 3), 31);
const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_hi, sat_ab2_hi, 1, 1, 3, 3), 31);
#else
const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_hi, 63);
const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_hi, 63);
const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_hi, 63);
#endif
const v128_t sat_bc0 = wasm_v128_load(bc);
const v128_t sat_bc1 = wasm_v128_load(bc + 2);
const v128_t sat_bc2 = wasm_v128_load(bc + 4);
bc += 6;
const v128_t bitmask =
wasm_v128_or(
wasm_v128_or(
wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));
const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);
#if USE_EXTPADD
count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif
sat_ab0_hi = wasm_i64x2_shl(sat_ab0_hi, 1);
sat_ab1_hi = wasm_i64x2_shl(sat_ab1_hi, 1);
sat_ab2_hi = wasm_i64x2_shl(sat_ab2_hi, 1);
}
v128_t sat_ab0_lo = wasm_v64x2_load_splat(pairwise_ab);
v128_t sat_ab1_lo = wasm_v64x2_load_splat(pairwise_ab + 2);
v128_t sat_ab2_lo = wasm_v64x2_load_splat(pairwise_ab + 4);
pairwise_ab += 6;
for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_lo, sat_ab0_lo, 1, 1, 3, 3), 31);
const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_lo, sat_ab1_lo, 1, 1, 3, 3), 31);
const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_lo, sat_ab2_lo, 1, 1, 3, 3), 31);
#else
const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_lo, 63);
const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_lo, 63);
const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_lo, 63);
#endif
const v128_t sat_bc0 = wasm_v128_load(bc);
const v128_t sat_bc1 = wasm_v128_load(bc + 2);
const v128_t sat_bc2 = wasm_v128_load(bc + 4);
bc += 6;
const v128_t bitmask =
wasm_v128_or(
wasm_v128_or(
wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));
const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);
#if USE_EXTPADD
count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif
sat_ab0_lo = wasm_i64x2_shl(sat_ab0_lo, 1);
sat_ab1_lo = wasm_i64x2_shl(sat_ab1_lo, 1);
sat_ab2_lo = wasm_i64x2_shl(sat_ab2_lo, 1);
}
#if USE_EXTPADD
count_simd32 = wasm_i32x4_add(count_simd32, __builtin_wasm_extadd_pairwise_i16x8_u_i32x4(count_simd16));
#else
count_simd32 = wasm_i32x4_add(count_simd32, wasm_u32x4_shr(count_simd16, 16));
count_simd32 = wasm_i32x4_add(count_simd32, wasm_v128_and(count_simd16, mask_0000FFFF));
#endif
}
count_simd32 = wasm_i32x4_add(count_simd32,
wasm_v32x4_shuffle(count_simd32, count_simd32, 2, 3, 0, 1));
return wasm_i32x4_extract_lane(count_simd32, 0) + wasm_i32x4_extract_lane(count_simd32, 1);
}
Additionally, I evaluated SIMD versions without the SIMD Population Count instruction. One version simulates SIMD Population Count using i8x16.swizzle
instruction:
const v128_t table = wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);
...
const v128_t count_simd8 = wasm_i8x16_add(
wasm_v8x16_swizzle(table, wasm_v128_and(bitmask, mask_0F0F0F0F)),
wasm_v8x16_swizzle(table, wasm_u8x16_shr(bitmask, 4)));
Another version simulates SIMD Population Count using HACKMEM algorithm:
const v128_t mask_55555555 = wasm_i32x4_const(0x55555555, 0x55555555, 0x55555555, 0x55555555);
const v128_t mask_33333333 = wasm_i32x4_const(0x33333333, 0x33333333, 0x33333333, 0x33333333);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);
...
const v128_t count_simd2 = wasm_i8x16_sub(bitmask, wasm_v128_and(wasm_u8x16_shr(bitmask, 1), mask_55555555));
const v128_t count_simd4 = wasm_i8x16_add(wasm_v128_and(count_simd2, mask_33333333),
wasm_v128_and(wasm_u8x16_shr(count_simd2, 2), mask_33333333));
const v128_t count_simd8 = wasm_v128_and(wasm_i8x16_add(count_simd4, wasm_u8x16_shr(count_simd4, 4)), mask_0F0F0F0F);
Performance on ARM64 devices is presented below:
Implementation | Time on Snapdragon 670 (Pixel 3a) | Time on Exynos 8895 (Galaxy S8) |
---|---|---|
64-bit scalar | 441 us | 466 us |
WAsm SIMD with HACKMEM | 325 us | 207 us |
WAsm SIMD with i8x16.swizzle |
274 us | 169 us |
WAsm SIMD with i8x16.popcount |
188 us | 125 us |
The version with the i8x16.popcount
instruction is 1.4-1.5X faster than the fastest SIMD version without i8x16.popcount
, and 2.3-3.7X faster than the scalar version.
In all SIMD versions on x86 I use USE_X86_WORKAROUND=1
to work-around poor code-generation for wasm_i64x2_shr(v, 63)
in V8. Performance on x86-64 systems is presented below:
Implementation | Time on AMD PRO A10-8700B | Time on AMD A4-7210 | Time on Intel Xeon W-2135 | Time on Intel Celeron N3060 |
---|---|---|---|---|
64-bit scalar | 161 us | 375 us | 77 us | 302 us |
WAsm SIMD with HACKMEM | 141 us | 376 us | 66 us | 307 us |
WAsm SIMD with i8x16.swizzle |
103 us | 317 us | 55 us | 350 us |
WAsm SIMD with i8x16.popcount |
91 us | 267 us | 54 us | 316 us |
Performance on Intel Celeron N3060 is poor for two reasons:
PSHUFB
instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.Performance on Intel Xeon is presumably poor due to the overhead of reloading the lookup table and the 0x0F mask. Because of suboptimal loading of constants in V8, it generates 5 more instructions that the necessary minimum.
Still, despite the overhead related to loading constants in V8, which would presumably get fixed later on, the i8x16.popcount
version wins on 3 out of 4 tested systems.
In all SIMD versions on x86 I use USE_X86_WORKAROUND=1 to work-around poor code-generation for wasm_i64x2_shr(v, 64) in V8.
This illustrates the point I've been trying to make for some time about feasibility of building and measuring larger apps - shaving off a few instructions by adding a more specialized operation only pays off if there are no other bottlenecks. That's why a speedup on a larger app is not only more convincing, but may be necessary - it would reduce the likelihood of running into issues like this. I suspect some of the other recently added operations would suffer from "somebody else's slowdown" if ran on all of listed use cases (from respective PR descriptions).
Also, some of the issues from https://github.com/WebAssembly/simd/pull/380#issuecomment-763958780 would apply here as well.
Re-benchmarked performance on Celeron N3060 after the recent fix for inefficient i8x16.popcnt
on SSE and update the results table.
Adding a preliminary vote for the inclusion of population count operation to the SIMD proposal below. Please vote with -
👍 For including population count 👎 Against including population count
Updated Applications section with links to code. Also, I plan to re-run the benchmarks once https://chromium-review.googlesource.com/c/v8/v8/+/2643242 is merged.
To summarize, why I think it important to have this instruction vs emulating this operation on top of existing WAsm SIMD instructions:
i8x16.popcnt
would be to extract the top and bottom nibbles of each byte, and use i8x16.swizzle
as a 16-entry table lookup. i8x16.swizzle
lowers to two instructions: [V]PADDUSB
to guarantee WAsm SIMD semantics for out-of-range indices and [V]PSHUFB
for the actual table lookup. Having a built-in i8x16.popcount
instruction in WAsm SIMD enables the code generator to eliminate [V]PADDUSB
instruction from the lowering because the indices (the top and bottom nibbles of the input) are known to be in-range.VPOPCNTB
instruction, so i8x16.popcount
would lower to a single instruction.i8x16.popcount
would lower to a single instruction compared to 6 instructions for the emulation code.FWIW my vote is more "neutral" than "against" but I wasn't sure how to reflect it - I've removed my original downvote to not be misleading - but my concerns are:
The upsides are that Intel seems to be alone in not supporting this operation, and eventually AVX512 will correct that.
I think is, frankly, an exotic operation rarely found in practice - there are surely a few algorithms which would benefit from it (just like with anything else), but they are quite specialized and obscure. Also, despite this being present in libraries, we don't have a path to running wasm apps with this operation yet.
I don't think orthogonality with scalar Wasm should be the motivation - scalar popcnt
exists due to its widespread availability in hardware.
I've implemented a number of HammingDistance functions with Neon, SSSE3 & AVX2. The first step is a vector popcnt that produces a count per element. Neon and AVX512 have that instruction. SSE/AVX work almost as well using shufb. You'll eventually need a horizontal or paired add, but you can unroll a few vpopcnt's before the sum needs to be widened. If we have use cases for popcnts I'm in favor of a dedicated instruction.
- Scalar Wasm includes 32 and 64-bit popcnt but this one is 8-bit, which feels somewhat one-off. Is this because wider widths don't have interesting practical uses or because of suboptimal lowering?
ARM and ARM64 support only 8-bit SIMD Population Count. Wider Population Count instruction could be emulated through extended pairwise addition instructions (SADDL
/UADDL
), but they are partially exposed in #380.
Both HACKMEM and PSHUFB
methods on pre-AVX512 x86 produce per-byte Population Count. 16- and 32-bit Population Count could be emulated using the instructions from the extended pairwise addition proposal #380. 64-bit Population Count could be emulated more efficiently, through, by doing PSADBW
with a zero register.
To sum up, wider-element Population Count instructions are not exposed because they don't have an efficient lowering, and their inefficient lowering can be expressed using more general extended pairwise addition instructions.
- The Intel lowering is pretty sad and that's after using rip-relative loads which I think v8 used to not support (although this may have changed)
V8 now does memory loads, albeit not yet rip-relative (needs an extra instruction to setup a general-purpose register for the load address).
@Maratyszcza Thanks, this helps. So it looks like 8-bit popcnt is the only width that's practical to run natively on ARM and the emulation on Intel is more expensive for larger widths as well. Wrt Intel lowering it looks like this is similar to some other instructions that we've standardized in this proposal where it's not 1-1, but there's no good substitute.
IOW if your algorithm requires popcnt, there's no good alternative and as such it's not obvious that the perf. "cliff" is an issue...
(It's somewhat surprising [to me] that on ARM the delta between emulation and native execution isn't as high as I would expect although maybe the throughput/latency characteristics of cnt
aren't perfect...)
I think is, frankly, an exotic operation rarely found in practice - there are surely a few algorithms which would benefit from it (just like with anything else), but they are quite specialized and obscure.
Consider a dot product primitive: sum(A[i] * B[i])
. When A[i]
and B[i]
are both binary, multiplication gets replaced with the boolean AND
, and sum
gets replaced with POPCOUNT
. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction. These include most popular machine learning methods, e.g. neural networks, support vector machines, linear regression, and kernel methods.
Moreover, it is unnecessary to reduce precision down to a single bit: dot products of multi-bit elements can be processed bit-by-bit. The complexity is quadratic in the number of bits, but @ajtulloch found that it still outperforms direct integer computations for 3-bit elements (i.e. 9x the cost of single-bit computation).
Also, despite this being present in libraries, we don't have a path to running wasm apps with this operation yet.
IIRC, OpenCV has a WAsm SIMD port in the works.
I don't think orthogonality with scalar Wasm should be the motivation - scalar
popcnt
exists due to its widespread availability in hardware.
This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.
This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.
In the interest of being precise, for Intel it's the opposite - any CPU that supports AVX512BITALG supports POPCNT but the reverse is not true.
FWIW SVE and RiscV V have popcnt instructions for masks, and on RVV that extends to whole registers (by taking LMUL=8).
Updated the table with x86-64 results following the merge of additional i8x16.popcount
optimizations in V8.
As an implementor I'm pretty neutral on this and I'm not able to properly judge whether the instruction is useful or not - several concrete uses are presented above, while it's a tough job to argue against it except in the abstract ("exotic operation", "few applications"). On ARM64 it's a no-brainer to include it. The Intel AVX512 implementation may not see wide use given the dearth of consumer devices with AVX512, but even the ssse3 implementation beats the alternatives (except on Celeron), if only barely.
AVX512 availability on future Intel client platforms is not guaranteed and we should not make any decisions in a hurry at this point on the assumption that IA implementation will eventually become efficient with AVX512 support.
"Alder Lake Intel Hybrid Technology will not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2, and UMONITOR/UMWAIT/TPAUSE are supported." Ref: https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
@arunetm To be clear: i8x16.popcnt
instruction improves performance on x86-64 systems with SSSE3 compared to emulating this operation on top of existing i8x16.swizzle
WAsm SIMD instruction. The fact that i8x16.popcnt
gets further speedup on AVX512-BITALG is just a cherry on top of that.
Sorry if I missed something, the thread is getting long.
Consider a dot product primitive:
sum(A[i] * B[i])
. WhenA[i]
andB[i]
are both binary, multiplication gets replaced with the booleanAND
, andsum
gets replaced withPOPCOUNT
. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction.
This is not helping the generality argument, as outside of machine learning this would be viewed as an esoteric take on those three operations - which doesn't make it wrong by any measure, it just that precision requirements outside of ML make this infeasible. To my best knowledge bio and chemistry algorithms would use it in a much more conventional way (as its name implies). However it don't think that just this being niche is a good enough argument of not including it - I am not convinced the examples and 3-point correlation performance prove that we really need this.
As an aside and a background, among the 3-point correlation implementations the fastest one on Celeron is the scalar one (which is in and of itself problematic). Also, Xeon performance is far from great (it passes the bar technically, though). So firstly, I don't think adding one more SIMD operation that is slower than scalar is beneficial, as instead of two inefficient ways to write something we give users three (even if a new problematic operation is faster than existing problematic operation). Surely, existing alternatives (swizzle
and HACKMEM
) are not portable across x86 chips, but this operation won't be portable either.
Given that, I am trying to understand why we actually need a popcount
. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulate popcount
on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMD popcount
on x86.
On the data side, we know from the 3-point example that simulating this instruction does not work on some of the lower end x86 chips, while using alternative sequence of instructions works on Arm. Frankly, I think the overall solution for porting code which uses Neon's popcount
would neither be an SSE/AVX clone nor a Neon clone - it is probably somewhere in between.
PSHUFB
instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.
I think we should investigate a solution like this instead of adding a new instruction.
Given that, I am trying to understand why we actually need a
popcount
. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulatepopcount
on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMDpopcount
on x86.
OpenCV uses SIMD population count on all platforms, and does simulate it on x86. The SSE version uses a HACKMEM-like algorithm, the AVX version uses VPSHUFB
instruction for 16-entry table lookup (similar to i8x16.swizzle
in WAsm SIMD), and the AVX512 version uses the hardware SIMD population count instruction when available.
Moreover, OpenCV has a port to WebAssembly SIMD, which simulates the SIMD population count too! The implementation uses HACKMEM-like algorithm: it predates i8x16.swizzle
instruction in WAsm SIMD, so HACKMEM was the only option at the time.
PSHUFB
instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.I think we should investigate a solution like this instead of adding a new instruction.
A solution necessitates adding a new instruction. WAsm engines have the information (microarchitecture ID) to generate alternative lowering, but this information can't be passed to WAsm because of anti-fingerprinting considerations. Thus, a perfect work-around is possible only by abstracting the code-generation for this operation away from the WAsm application into the WAsm engine.
So OpenCV already works and uses HACKMEM solution - this still does not convince me we need a new instruction for popcount
. I don't think we need to add another inneficient way to perform the same operation, this time codified as an instruction. I also don't think we need to add another instruction with a caveat that it is is slower than scalar on some platforms.
I investigated performance issues on the Celeron N3060, and turns out there's a lot of performance variance between V8 versions. E.g. on the latest V8 build, SIMD versions outperform scalar:
Implementation | Time on Intel Celeron N3060 |
---|---|
64-bit scalar | 340 us |
WAsm SIMD with HACKMEM | 344 us |
WAsm SIMD with i8x16.swizzle |
350 us |
WAsm SIMD with i8x16.popcount |
316 us |
I also tried to change the SSE implementation of i8x16.popcount
with the following (SSE2 implementation based on HACKMEM):
__ xorps(tmp, tmp);
__ pavgb(tmp, src);
__ Move(dst, src);
__ andps(tmp,
__ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x55()));
__ psubb(dst, tmp);
Operand splat_0x33 = __ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x33());
__ movaps(tmp, dst);
__ andps(dst, splat_0x33);
__ psrlw(tmp, 2);
__ andps(tmp, splat_0x33);
__ paddb(dst, tmp);
__ movaps(tmp, dst);
__ psrlw(dst, 4);
__ paddb(dst, tmp);
__ andps(dst,
__ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x0f()));
The resulting V8 build further improves the i8x16.popcount
microkernel timing: 316 us -> 304 us.
The latest v8 build has this optimization https://chromium.googlesource.com/v8/v8/+/173d660849c1e4f43b15b67321b074a89b73f9ce so that's probably why it got slightly faster.
The optimization for Silvermont (and similar Atom-like processors) is merged in https://chromium.googlesource.com/v8/v8/+/71fc222f4930d7f9425e45ee1b2111bee742f20e
We voted on this instruction at the 1/29/21 meeting with these results: SF 1, F 6, N 3, A 2, SA 1. This is closer to the border between clear consensus and clear lack of consensus than many other votes we've taken, but I am inclined to interpret this vote as consensus to include the instruction. Here are the considerations leading me to this decision:
I believe the suggested lowering for SSE2 is incorrect. @Maratyszcza
PAVGB
is (a + b + 1) >> 1
// rounds up!
so one would want to x & (0x55 << 1)
first then PAVGB
or else follow the original formula of (x >> 1) & 0x55
Chromium also seems affected by this. @ngzhian
@aqrit Good catch, updated the PR description and filed issue 11591 for V8.
@aqrit On a side note, would you be interesting to participate in WAsm SIMD meetings? The next one will be on April 9 (see WebAssembly/flexible-vectors#31 for details).
Thanks for the report, this is fixed in v8 in https://crrev.com/c/2780218, should go out in canary tomorrow.
Introduction
The PR introduce a SIMD variant of Population Count operation. This operation counts the number of bits set to 1, is commonly used in algorithms involving Hamming distance between two binary vectors. Population Count is represented in the scalar WebAssembly instruction set, and many native SIMD instruction sets. E.g. ARM NEON and AVX512BITALG extensions natively support the 8-bit lane variant introduced in this proposal. Implementation in SSSE3 is less trivial, but still more efficient than emulation with
i8x16.swizzle
instruction due to elimination ofPADDUSB
instruction and direct lowering of table lookups intoPSHUFB
. Moreover, the SIMD Population Count can be efficiently implemented on SSE2-level processors wherei8x16.swizzle
-based emulation would have been scalarized.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets
y = i8x16.popcnt(x)
is lowered toVPOPCNTB xmm_y, xmm_x
x86/x86-64 processors with AVX instruction set
y = i8x16.popcnt(x)
is lowered to:VMOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
VPANDN xmm_tmp1, xmm_tmp0, xmm_x
VPAND xmm_y, xmm_tmp0, xmm_x
VMOVDQA xmm_tmp0, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
VPSRLW xmm_tmp1, xmm_tmp1, 4
VPSHUFB xmm_y, xmm_tmp0, xmm_y
VPSHUFB xmm_tmp1, xmm_tmp0, xmm_tmp1
VPADDB xmm_y, xmm_y, xmm_tmp1
x86/x86-64 processors with SSSE3 instruction set
y = i8x16.popcnt(x)
is lowered to:MOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
MOVDQA xmm_tmp1, xmm_x
PAND xmm_tmp1, xmm_tmp0
PANDN xmm_tmp0, xmm_x
MOVDQA xmm_y, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
PSRLW xmm_tmp0, 4
PSHUFB xmm_y, xmm_tmp1
MOVDQA xmm_tmp1, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
PSHUFB xmm_tmp1, xmm_tmp0
PADDB xmm_y, xmm_tmp1
x86/x86-64 processors with SSE2 instruction set
y = i8x16.popcnt(x)
is lowered to (based on @WojciechMula's algorithm here):MOVDQA xmm_tmp, xmm_x
PSRLW xmm_tmp, 1
MOVDQA xmm_y, xmm_x
PAND xmm_tmp, [wasm_i8x16_splat(0x55)]
PSUBB xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_y
PAND xmm_y, [wasm_i8x16_splat(0x33)]
PSRLW xmm_tmp, 2
PAND xmm_tmp, [wasm_i8x16_splat(0x33)]
PADDB xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_y
PSRLW xmm_tmp, 4
PADDB xmm_y, xmm_tmp
PAND xmm_y, [wasm_i8x16_splat(0x0F)]
ARM64 processors
y = i8x16.popcnt(x)
is lowered toCNT Vy.16B, Vx.16B
ARMv7 processors with NEON instruction set
y = i8x16.popcnt(x)
is lowered toVCNT.8 Qy, Qx
POWER processors with POWER 2.07+ instruction set and VMX
y = i8x16.popcnt(x)
is lowered toVPOPCNTB VRy, VRx
MIPS processors with MSA instruction set
y = i8x16.popcnt(x)
is lowered toPCNT.B Wy, Wx