WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
532 stars 43 forks source link

movemask instruction #131

Closed zeux closed 4 years ago

zeux commented 5 years ago

SSE2 exposes the movemask instruction (pmovmskb) that extracts 16 most significant bits from each byte of the vector, and returns a 16-bit integer with all the bits combined.

This instruction is very useful for certain types of processing. For example, when performing byte-wise processing on strings, such as scanning the string for a specific character (memchr), pmovmskb can be used to produce a mask with occurrences of the character in a string; bsf/bsr (in WebAssembly that's equivalent to clz/ctz) can be used to quickly iterate over the bits set in this mask (see "Regular expression search" in https://zeux.io/2019/04/20/qgrep-internals/ as one example).

It is also used in fast integer decoding in my vertex data decompressor; in its absence I have to emulate it using scalar math, see https://github.com/zeux/meshoptimizer/blob/master/src/vertexcodec.cpp#L712 - on x64 using the same fallback results in a ~15% performance penalty to the overall benchmark despite the fact that the instruction is not dominating the execution cost otherwise.

On x86/x64, movemask directly maps to pmovmskb (SSE2).

On PowerPC movemask can be implemented with vbpermq instruction, typically either as lvlsl+vector shift+vbpermq or as load+vbpermq.

On NEON, movemask isn't available natively but it can be easily synthesized with horizontal adds from AArch64 - you need to take the mask, replace each byte with a high bit set with a power of two corresponding to the byte index (this takes a couple of vector shifts) and use vaddv_u8 for each half of the vector. On ARMv7 with NEON you can emulate two vaddv_u8 with three vpadd_u8 so the cost is still somewhat reasonable (6 vector instructions + a couple of scalar instructions to create 16-bit mask from two 8-bit lanes).

I'm not sure what the emulation strategy would be on MIPS / RISC-V.

I wanted to file this to get a sense of whether this meets the balance of "performance cliffs" available on various architectures.

I'm generally happy with WASM SIMD but the problem with movemask is that there are no other SIMD instructions in WASM SIMD that provide a reasonable emulation path (in particular, no horizontal adds - they are a bit less exotic than vbpermq). Of course horizontal adds also have a non-trivial cost on various architectures, including x64, so emulating movemask through horizontal adds on x64 is bound to result in worse performance on x64 compared to a natively supported instruction.

arunetm commented 5 years ago

Useful and hard to emulate operations are high value to be part of the wasm simd featureset. As reasonable cost can often be subjective, perf cliffs are ideally quantified through architecture-specific performance data when needed. Unless we already have a consensus here to add movemask, it helps to justify any reason not to through data on emulation penalty.

sunfishcode commented 5 years ago

This is one of the things that motivated the all_true and any_true functions in the current proposal, as those are two common uses for movmskps for, but in being slightly higher-level, they give non-x86 architectures more flexibility.

In theory, clang could support existing code that uses movmskps intrinsics in some cases by pattern-matching a movmskps intrinsic followed by a comparison of the result with 0 to an all_true or any_true.

This approach could be extended by adding instructions such as first_true and last_true, which could handle the bsr/bsf use cases, and num_trues or other things for other use cases. This isn't a perfect approach, because there will always be some use cases that clang can't pattern-match into all_true/any_true/first_true/last_true/etc., however it has the advantage that for use cases that can clang can pattern-match, it makes things simpler for non-x86 architectures.

gnzlbg commented 5 years ago

I tend to agree with @sunfishcode in that many use cases could be better served by other instructions.

It is unclear how much value a vNxM.msb_bitmask instruction would add on its own. However, if we had a vNxM.bitmask_select instruction, then the motivation for a vNxM.msb_bitmask instruction becomes much stronger, so it might be worth it to consider both as a "pair" of instructions to add.

zeux commented 5 years ago

Just to be clear - I agree that all_true / any_true are good specific versions of movemask. They solve most of the usecases, and they can be more efficiently implemented on many architectures.

The issue is that sometimes, they aren't enough and you need movemask specifically. When you do need movemask, it's not clear how to efficiently synthesize it out of existing WASM SIMD instructions.

@gnzlbg can you clarify what would bitmask_select do? Is the idea that it would be similar to movemask but specify an immediate bit location, so bitmask_select(7) is equivalent to msb_bitmask?

gnzlbg commented 5 years ago

Right now we have a v128.bitselect(v1: v128, v2: v128, c: v128) -> v128 instruction, where the mask is 128-bits wide. For a v8x16 the mask only needs to be 16-bits wide if the intent is to select 8-bit wide vector lanes, so we could add a v8x16.bitselect(v1: v8x16, v2: v8x16, c: i16) -> v128 instruction that select the values of the 16 vector lanes according to the bits of the i16.

If we had such an instruction, it would be very useful to add a way to compute such masks from a vector efficiently, e.g., by also adding a v8x16.bitmask(x: v8x16) -> i16 instruction that can be used to compute the mask. Such an instruction would take one bit of each vector lane, and pack them into an integer and, e.g., zero-extending the result. If that bit that gets taken is the MSB of each lane, then that instruction can be lowered to the movmsk variants on x86.

That would mean that one can write v128.select(a, b, v8x16.eq(c, d)) or v8x16.bitselect(a, b, v8x16.bitmask(v8x16.eq(c, d))), and that one might be better than the other depending on the underlying hardware, and depending on how much the machine code generator optimizes the resulting code.

mratsim commented 5 years ago

I personally use movemask to implement fast clamp (restrict input to a [min, max] bound). The min/max vector intrinsics are very slow and have a carried dependency. Casting to int and using movemask allow me to check if those intrinsic are needed at very little cost or if I need to follow the slow path: https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/simd_math/exp_log_sse2.nim#L41

Usage: I use it to implement fast vectorized exponentiation (10x faster than math.h) and fast clamping brings noticeable benefits. Also in machine learning it is common to clamp inputs or parameters (see "gradient clipping")

zeux commented 5 years ago

@mratsim Note that the usecase you have can be solved using i32x4.any_true instruction.

zeux commented 5 years ago

Through experimentation I've discovered a better way to emulate movemask, at least given a 64-bit target platform. It requires fewer scalar instructions at the cost of doing most math in "64-bit SIMD":

static void wasmMoveMask(v128_t mask, unsigned char& mask0, unsigned char& mask1)
{
    v128_t mask_0 = wasmx_shuffle_v32x4(mask, 0, 2, 1, 3);

    // TODO: when Chrome supports v128.const we can try doing vectorized and?
    uint64_t mask_1a = wasm_i64x2_extract_lane(mask_0, 0) & 0x0804020108040201ull;
    uint64_t mask_1b = wasm_i64x2_extract_lane(mask_0, 1) & 0x8040201080402010ull;

    uint64_t mask_2 = mask_1a | mask_1b;
    uint64_t mask_4 = mask_2 | (mask_2 >> 16);
    uint64_t mask_8 = mask_4 | (mask_4 >> 8);

    mask0 = uint8_t(mask_8);
    mask1 = uint8_t(mask_8 >> 32);
}

This is still much more expensive than a single instruction on x64, but is starting to approach NEON emulation cost.

I think the general sentiment here is that movemask emulation cost on various SIMD architectures is imbalanced enough that it's not an obvious candidate for the cross-platform SIMD subset. So with that in mind I'm going to close this for now; we can reopen it if more data is presented in favor of native support.

AndrewScheidecker commented 5 years ago

I think the general sentiment here is that movemask emulation cost on various SIMD architectures is imbalanced enough that it's not an obvious candidate for the cross-platform SIMD subset. So with that in mind I'm going to close this for now; we can reopen it if more data is presented in favor of native support.

I'd like to see how this wasmMoveMask -> WASM -> NEON compares to neonMoveMask -> NEON (which is what a hypothetical emulated i8x16.ltz_mask instruction could do).

I think we should not add an instruction that is too slow to be useful on NEON targets, but for your application at least, it seems like an emulated instruction would still beat what you can do without it.

zeux commented 5 years ago

@AndrewScheidecker The NEON emulation path looks like this (AArch64 variant):

static void neonMoveMask(uint8x16_t mask, unsigned char& mask0, unsigned char& mask1)
{
    static const unsigned char byte_mask_data[16] = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

    uint8x16_t byte_mask = vld1q_u8(byte_mask_data);
    uint8x16_t masked = vandq_u8(mask, byte_mask);

    mask0 = vaddv_u8(vget_low_u8(masked));
    mask1 = vaddv_u8(vget_high_u8(masked));
}

I would expect this to be faster than the scalar emulation above. However, for other architectures (MIPS/RISCV) I'm not sure what the codegen strategy even should be (use the scalar path?).

MaxGraey commented 4 years ago

@zeux It seems only movemask_epi8 hard to emulate on neon. rest on AArch64 could be emulate as:

uint16_t movemask_epu16(v128_t val) {
  const uint16x8_t mask = { 1, 2, 4, 8, 16, 32, 64, 128 };
  return vaddvq_u16(vandq_u16((uint16x8_t)val, mask));
}

uint32_t movemask_epu32(v128_t val) {
  const uint32x4_t mask = { 1, 2, 4, 8 };
  return vaddvq_u32(vandq_u32((uint32x4_t)val, mask));
} 

uint64_t movemask_epu64(v128_t val) {
  const uint64x2_t mask = { 1, 2 };
  return vaddvq_u64(vandq_u64((uint64x2_t)val, mask));
}

For 32-bit ARM:

uint32_t movemask_epu32(v128_t val) {
  const uint32x4_t mask = { 1, 2, 4, 8 };
  const uint32x4_t av = vandq_u32((uint32x4_t)val, mask);
  const uint32x4_t xv = vextq_u32(av, av, 2);
  const uint32x4_t ov = vorrq_u32(av, xv);
  return vgetq_lane_u32(vorrq_u32(ov, vextq_u32(ov, ov, 3)), 0);
}
jan-wassenberg commented 4 years ago

Hi! Brief introduction: I've had fun with SIMD since 2002, currently working on image compression (JPEG XL) and hashing at Google.

Glad to see this discussion, we also have a use case for 'get bits from mask': comparing four things in parallel and interpreting the resulting 4 bits as a number 0-15 (for traversing a decision tree).

I understand that performance portability is a concern. Looks like not supporting this means applications that want it spend ~11 operations (wasmMoveMask), whereas ARM emulation would involve only 5. Isn't that a win?

  const uint8x8_t x2 = vget_low_u8(vpaddq_u8(masked, masked));
  const uint8x8_t x4 = vpadd_u8(x2, x2);
  const uint8x8_t x8 = vpadd_u8(x4, x4);
  return vreinterpret_u16_u8(x8)[0];

Ironically, I've thought that multi-operation intrinsics might be counterproductive if people use them more than necessary, but that's only when there are reasonable alternatives. all/any_true are nice, but we've seen several use cases where they are insufficient.

BTW how about a function count_true that takes masks (FF..FF or 0) and returns the number that are not 0? x86 would be movemask+popcnt, ARMv8/v7 as described here, but ANDing with 1 instead of 1,2,4,8..

zeux commented 4 years ago

Since there's ongoing discussion on this I'll reopen this for now.

@MaxGraey Thanks for bringing this up, this is true - I only considered 8-bit movemask because that's what I need usually. If something does get standardized it would make sense to at least standardize the 32-bit variant because that's also commonly useful.

@jan-wassenberg Yeah, I think it is the case that when you have to use a movemask, it's going to be beneficial to have it as part of the spec, even if the lowering is inefficient - you can't implement it better yourself. So on these grounds I would prefer that the spec adds this instruction.

However, an alternative side of this is something I've frequently encountered with the current SIMD spec: if you aren't sure you have to use a movemask (if you haven't exhausted alternative implementation strategies), it's really tempting to use it, and if there's a performance cliff it's going to be hard to find. This might be endemic to the idea of portable SIMD, but we can try to minimize this.

The last few examples I've ran into is the v8x16.shuffle, which produces really poor fallback code in v8 if it fails to pattern-match to an optimized variant (although the codegen can be vastly improved in theory, see https://bugs.chromium.org/p/v8/issues/detail?id=10117) and f32x4.max, which sounds like a Great Idea to clamp a value to 0 (max(splat(0.f), v4)) but is actually really bad on x64 because of the NaN handling strategy, so it's much faster to use something like and(v4, cmp_ge(v4, splat(0.f))).

So there's something to be said about not exposing movemask since that will make it less likely that, given alternate more efficient implementation options, movemask will be chosen and will result in a cliff on a less widely used architecture. For example, movemask_u8(v) == 15 would be a suboptimal way to implement all_ones(v).

jan-wassenberg commented 4 years ago

I agree that it helps to include all data types. @MaxGraey has shown they are efficient on ARM, and for u16 on x86 we can use the 8-bit path after _mm_packs_epi16(x, 0).

@zeux I totally agree about the "moral hazard" (and was recently advocating the same position). Perhaps one compromise would be to include movemask but have some kind of naming that indicates this is potentially slow/exotic? In Highway we put such operations in an ext namespace.

Interesting about f32x4.max. Would a zero_if_negative operation help? That could use blendvps on x86, or your implementation on other platforms.

zeux commented 4 years ago

zero_if_negative is a bit too specific; one alternative implementation strategy that would help here is signselect which is proposed in #124.

jan-wassenberg commented 4 years ago

Oh, hadn't seen that one yet, thanks. signselect indeed looks like a good replacement for zero_if_negative.

juj commented 4 years ago

The lack/slowness of emulating movemask came up recently in our tests as well.

Also visible in Emscripten support for SSE 1 and its synthetic microbenchmark.

+1 for adding the movemask instruction to Wasm SIMD. In general prefer the direction of adding direct instructions rather than higher level pseudoinstructions that do not exist in any hardware SIMD instruction set, because

a) cross-compiling native code to pseudoinstructions will not be possible. b) adding the direct instructions will be fewer instructions as well, N pseudoinstructions to cover the use case of a single real instruction would take up more opcode space, while still can leave out an important use case. c) people already know the direct instructions, but need to invest developer time to rewrite known algorithms to a new virtual instruction set - and still may get suboptimal result.

If Wasm SIMD MVP/v1 is to be like a set intersection of SSE and NEON, philosophically it would be strongly preferable for v2 to look more like a set union of SSE and NEON, as opposed to v2 becoming a "fantasy SIMD" instruction set that would try to catch high level use cases with virtual instructions that do not exist in any relevant hardware.

wuxb45 commented 3 years ago

@AndrewScheidecker The NEON emulation path looks like this (AArch64 variant):

static void neonMoveMask(uint8x16_t mask, unsigned char& mask0, unsigned char& mask1)
{
  static const unsigned char byte_mask_data[16] = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

  uint8x16_t byte_mask = vld1q_u8(byte_mask_data);
  uint8x16_t masked = vandq_u8(mask, byte_mask);

  mask0 = vaddv_u8(vget_low_u8(masked));
  mask1 = vaddv_u8(vget_high_u8(masked));
}

I would expect this to be faster than the scalar emulation above. However, for other architectures (MIPS/RISCV) I'm not sure what the codegen strategy even should be (use the scalar path?).

This movemask_u8 could be slightly faster (1% faster on a rpi4):


uint32_t aarch64_movemask_u8(uint8x16_t v)
{
    static const uint8x16_t idx = {0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15};
    static const uint16x8_t mbits = {0x0101, 0x0202, 0x0404, 0x0808, 0x1010, 0x2020, 0x4040, 0x8080};
    const uint8x16_t perm = vqtbl1q_u8(v, idx); // reorder
    return vaddvq_u16(vandq_u16(vreinterpretq_u16_u8(perm), mbits));
}
zeux commented 3 years ago

@wuxb45 fwiw I've tested this variant on one of my native ARM benchmarks and it looks like it's actually slightly slower on Amazon Graviton2. So perhaps the delta varies based on the specific CPU used.