Maratyszcza commented 3 years ago

Introduction

As discussed in #348, WebAssembly SIMD specification is functionally incomplete w.r.t double-precision computations as it doesn't cover many basic C/C++ operations, particularly conversions between double type and other types. Currently any code that use these operations would be de-vectorized (see examples in emmintrin.h header in Emscripten). Double-precision computations are extensively used in scientific computing, and providing a complete set of double-precision is a must to bring scientific computing applications to the Web. In the foreseeable future WebAssembly SIMD will be the only avenue for high-performance scientific computing in the Web, as neither WebGL nor WebGPU support double-precision computations.

This PR introduce the minimal set of conversion operations to fill this gap.

Applications

Conversions from double to int are used when storing data in quantized representation and in table-based algorithms (need to real indices to int index for table lookup).
Conversions from int to double are used when loading data in quantized representation and in random number generation (the raw output of RNGs is typically int, but numerically algorithms need real values).
Conversions between double and float are used in mixed-precision algorithms, where less sensitive parts of the application are computed in single precision and the more sensitive parts - in double precision. E.g. iterative algorithms may first run in single precision to convergence, and then starting with the single-precision solution run double precision iterations.

Use-cases in OpenCV

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F + AVX512VL instruction sets

f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to VCVTUDQ2PD xmm_y, xmm_x
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) (y is NOT x) is lowered to:
- VXORPD xmm_y, xmm_y, xmm_y
- VMAXPD xmm_y, xmm_x, xmm_y
- VCVTTPD2UDQ xmm_y, xmm_y

x86/x86-64 processors with AVX instruction set

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to VCVTDQ2PD xmm_y, xmm_x
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
- VUNPCKLPS xmm_y, xmm_x, [wasm_i32x4_splat(0x43300000)]
- VSUBPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) (y is NOT x) is lowered to:
- VCMPEQPD xmm_y, xmm_x, xmm_x
- VANDPD xmm_y, [wasm_f64x2_splat(2147483647.0)]
- VMINPD xmm_y, xmm_x, xmm_y
- VCVTTPD2DQ xmm_y, xmm_y
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
- VXORPD xmm_tmp, xmm_tmp, xmm_tmp
- VMAXPD xmm_y, xmm_x, xmm_tmp
- VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
- VROUNDPD xmm_y, xmm_y, 0x0B
- VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
- VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to VCVTPD2PS xmm_y, xmm_x
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to VCVTPS2PD xmm_y, xmm_x

x86/x86-64 processors with SSE4.1 instruction set

i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
- MOVAPD xmm_y, xmm_x
- XORPD xmm_tmp, xmm_tmp
- MAXPD xmm_y, xmm_tmp
- MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
- ROUNDPD xmm_y, xmm_y, 0x0B
- ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
- SHUFPS xmm_y, xmm_xmp, 0x88

x86/x86-64 processors with SSE2 instruction set

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to CVTDQ2PD xmm_y, xmm_x
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
- MOVAPS xmm_y, xmm_x
- UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
- SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
- XORPS xmm_tmp, xmm_tmp
- CMPEQPD xmm_tmp, xmm_x
- MOVAPS xmm_y, xmm_x
- ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
- MINPD xmm_y, xmm_tmp
- CVTTPD2DQ xmm_y, xmm_y
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to CVTPD2PS xmm_y, xmm_x
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to CVTPS2PD xmm_y, xmm_x

ARM64 processors

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to:
- SSHLL Vy.2D, Vx.2S, 0
- SCVTF Vy.2D, Vy.2D
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
- USHLL Vy.2D, Vx.2S, 0
- UCVTF Vy.2D, Vy.2D
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
- FCVTZS Vy.2D, Vx.2D
- SQXTN Vy.2S, Vy.2D
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
- FCVTZU Vy.2D, Vx.2D
- UQXTN Vy.2S, Vy.2D
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to FCVTN Vy.2S, Vx.2D
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to FCVTL Vy.2D, Vx.2S

ARMv7 processors with NEON

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) (x in q0-q7) is lowered to:
- VCVT.F64.S32 Dy_lo, Sx_ll
- VCVT.F64.S32 Dy_hi, Sx_lh
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) (x in q0-q7) is lowered to:
- VCVT.F64.U32 Dy_lo, Sx_ll
- VCVT.F64.U32 Dy_hi, Sx_lh
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) (y in q0-q7) is lowered to:
- VCVT.S32.F64 Sy_ll, Dx_lo
- VCVT.S32.F64 Sy_lh, Dx_lo
- VMOV.I32 Dy_hi, 0
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) (y in q0-q7) is lowered to:
- VCVT.U32.F64 Sy_ll, Dx_lo
- VCVT.U32.F64 Sy_lh, Dx_lo
- VMOV.I32 Dy_hi, 0
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to:
- VCVT.F32.F64 Sy_ll, Dx_lo
- VCVT.F32.F64 Sy_lh, Dx_hi
- VMOV.I32 Dy_hi, 0
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to:
- VCVT.F64.F32 Dy_lo, Sx_ll
- VCVT.F64.F32 Dy_hi, Sx_lh

ngzhian commented 3 years ago

How does ARM codegen look like? Similar to ARM64? Or complicated enough that we will likely do runtime calls?

Maratyszcza commented 3 years ago

Similar to ARM64, but with additional constraints on register allocation (only the first 8 NEON registers are accessible as S FPU registers).

ngzhian commented 3 years ago

Do you have more specific pointers to code that will benefit from this? And also plans for benchmarking if this is prototyped? I would also like to hear from more people whether this is useful or not.

Maratyszcza commented 3 years ago

These instructions are commonly used in scientific computing, but it is hard to provide code pointers as explicit vectorization in scientific computing codes is rare, they usually rely on compiler auto-vectorization of code in C/C++ and FORTRAN. My plan for evaluating these is to implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

jan-wassenberg commented 3 years ago

@Maratyszcza

implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

FYI there is code for the integer portion in https://gitlab.com/wg1/jpeg-xl/-/blob/master/lib/jxl/xorshift128plus-inl.h :)

dtig commented 3 years ago

Compiler auto-vectorization performance is somewhat nebulous, especially for cross platform performance for i64x2 operations. The last time we tried this IIRC we didn't see any benefit from autovectorizing. Perhaps one way to make a decision on many of these i64x2 operations is to evaluate a couple of sample scientific applications that might use these operations and see if the generated code would actually map to these operations.

@Maratyszcza (or anyone else) Are there any open source applications that we can evaluate in addition to the jpeg-xl link above?

jan-wassenberg commented 3 years ago

I'd like to express support for this. JPEG XL promotes floats to double and the Highway math library requires f64 -> i32 conversions.

ngzhian commented 3 years ago

Promote and demote maps to native nicely. The others, specially f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

jan-wassenberg commented 3 years ago

f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

Can you help me understand your concern? Math libraries need floatN -> intM to do bitwise ops (shifting exponent into place for LoadExp). For N=64 (double), x86 only provides M=32. This works out OK for ARM as well. double->int64 would be much more expensive on x86. Do you see a better alternative?

ngzhian commented 3 years ago

I'm only pointing out that the codegen for i32x4.trunc_sat_f64x2_s_zero looks bad, it doesn't map well to native instruction at all. I don't know what LoadExp is. For math libraries, would they choose to use this instruction given that it has such poor mapping? Or would they come up with workarounds for the specific use case? The obvious alternative is to scalarize, it would be slower, but these SIMD instructions are slow too.

Maratyszcza commented 3 years ago

i32x4.trunc_sat_f64x2_s_zero lowers to multiple instructions on x86 because it has to simulate WAsm semantics for out-of-bounds values. Scalar instructions would have to simulate it too, one lane at a time.

ngzhian commented 3 years ago

For i32x4.trunc_sat_f64x2_s_zero SSE2 lowering, I think the first instruction is wrong:

XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
...

This is a compare with 0, should it be

MOVAPS xmm_tmp, xmm_x
CMPEQPD xmm_tmp, xmm_tmp
...

?

ngzhian commented 3 years ago

Any suggested lowering for SSE2 i32x4.trunc_sat_f64x2_u_zero, translate the AVX to non-AVX?

dtig commented 3 years ago

Adding a preliminary vote for the inclusion of double-precision conversion operations to the SIMD proposal below. Please vote with -

👍 For including double-precision conversion operations 👎 Against including double-precision conversion operations

penzn commented 3 years ago

I don't think using a poor SIMD sequence instead of a poor scalar sequence is really a solution, plus we have no performance data on this. I don't think this instruction is ready to be included.

Maratyszcza commented 3 years ago

@penzn I'm working on performance data. Expect it later today.

Maratyszcza commented 3 years ago

To evaluate this proposal I implemented SIMD and semi-SIMD (mostly SIMD, but using scalar conversions) versions of the mixed-precision dot product operating where inputs are single-precision floats, but accumulators and outputs are double-precision.

Here's the SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    v128_t vx = wasm_v128_load(x);
    x += 4;

    v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));
    vx = wasm_v32x4_shuffle(vx, vx, 2, 3, 0, 1);
    vy = wasm_v32x4_shuffle(vy, vy, 2, 3, 0, 1);
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

And here's the semi-SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const v128_t vx = wasm_v128_load(x);
    x += 4;

    const v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 0),
          (double) wasm_f32x4_extract_lane(vx, 1)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 0),
          (double) wasm_f32x4_extract_lane(vy, 1))));
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 2),
          (double) wasm_f32x4_extract_lane(vx, 3)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 2),
          (double) wasm_f32x4_extract_lane(vy, 3))));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

So far V8 implements the double-precision conversion instructions only on x86-64 (ARM64 implementation is in review), and the results on x86-64 systems are presented below:

Implementation	Performance, scalar conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	4.55 GB/s	9.65 GB/s	2.1X
AMD A4-7210	1.37 GB/s	4.94 GB/s	3.6X
Intel Xeon W-2135	5.01 GB/s	21.6 GB/s	4.3X
Intel Celeron N3060	1.29 GB/s	4.03 GB/s	3.1X

penzn commented 3 years ago

The version using conversion looks great, but for the semi-SIMD version would not it be faster to construct f64x2 from individual array elements (instead of first loading an array chunk into a f32x4 and then breaking it down and then recreating as f64x2)?

Maratyszcza commented 3 years ago

@penzn I tried a version with scalar loads:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const float vx0 = x[0];
    const float vx1 = x[1];
    const float vy0 = y[0];
    const float vy1 = y[1];

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx0, (double) vx1),
        wasm_f64x2_make((double) vy0, (double) vy1)));

    const float vx2 = x[2];
    const float vx3 = x[3];
    x += 4;

    const float vy2 = y[2];
    const float vy3 = y[3];
    y += 4;

    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx2, (double) vx3),
        wasm_f64x2_make((double) vy2, (double) vy3)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

Performance is presented below:

Implementation	Performance, scalar loads & conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	4.77 GB/s	9.65 GB/s	2.0X
AMD A4-7210	1.52 GB/s	4.94 GB/s	3.3X
Intel Xeon W-2135	6.52 GB/s	21.6 GB/s	3.3X
Intel Celeron N3060	1.31 GB/s	4.03 GB/s	3.1X

ngzhian commented 3 years ago

Some good speedups with f64x2.promote_low_f32x4. We also have a use case in OpenCV. The lowering is fantastic as well, so I am supportive of this instruction. However, this PR is a set of 6 different instructions. Some of them look very different in terms of lowering, and don't have use cases listed. Will it be useful to split this instruction out (maybe promote and demote)?

Maratyszcza commented 3 years ago

ARM64 implementation was merged in V8, and the performance benchmarks on ARM64 devices are below:

Processor (Device)	Performance, scalar loads & conversions	Performance, SIMD conversions	Speedup
Qualcomm Snapdragon 670 (Pixel 3a)	3.05 GB/s	5.34 GB/s	1.8X
Samsung Exynos 8895 (Galaxy S8)	3.38 GB/s	7.14 GB/s	2.1X

Maratyszcza commented 3 years ago

I implemented one more algorithm of generating double-precision random numbers in [-1, 1] range using XorShift random number generator. I prepared two implementations: SIMD-all-the-way (using the proposed conversion instructions) and SIMD-Xoshift-only with scalar conversions and stores. The SIMD-all-the-way version is presented below:

void GenerateUniform(
    struct rng rng[restrict static 1],
    size_t size,
    double output[restrict static 1])
{
  assert(size != 0);
  assert(size % 4 == 0);

  const v128_t offset = wasm_i32x4_const(2147483648, 2147483648, 2147483648, 2147483648);
  const v128_t scale = wasm_f64x2_const(0x1.00000002p-31, 0x1.00000002p-31);
  v128_t state = wasm_v128_load(&rng->state);
  do {
    state = wasm_v128_xor(state, wasm_i32x4_shl(state, 13));
    state = wasm_v128_xor(state, wasm_u32x4_shr(state, 17));
    state = wasm_v128_xor(state, wasm_i32x4_shl(state, 5));

    v128_t value = wasm_i32x4_sub(state, offset);
    wasm_v128_store(output, wasm_f64x2_mul(__builtin_wasm_convert_low_s_i32x4_f64x2(value), scale));
    value = wasm_v32x4_shuffle(value, value, 2, 3, 0, 1);
    wasm_v128_store(output + 2, wasm_f64x2_mul(__builtin_wasm_convert_low_s_i32x4_f64x2(value), scale));
    output += 4;

    size -= 4;
  } while (size != 0);
  wasm_v128_store(&rng->state, state);
}

The version is SIMD Xorshift only is below:

void GenerateUniform(
    struct rng rng[restrict static 1],
    size_t size,
    double output[restrict static 1])
{
  assert(size != 0);
  assert(size % 4 == 0);

  const v128_t offset = wasm_i32x4_const(2147483648, 2147483648, 2147483648, 2147483648);
  const double scale = 0x1.00000002p-31;
  v128_t state = wasm_v128_load(&rng->state);
  do {
    state = wasm_v128_xor(state, wasm_i32x4_shl(state, 13));
    state = wasm_v128_xor(state, wasm_u32x4_shr(state, 17));
    state = wasm_v128_xor(state, wasm_i32x4_shl(state, 5));

    const v128_t value = wasm_i32x4_sub(state, offset);
    output[0] = scale * (double) wasm_i32x4_extract_lane(value, 0);
    output[1] = scale * (double) wasm_i32x4_extract_lane(value, 1);
    output[2] = scale * (double) wasm_i32x4_extract_lane(value, 2);
    output[3] = scale * (double) wasm_i32x4_extract_lane(value, 3);
    output += 4;

    size -= 4;
  } while (size != 0);
  wasm_v128_store(&rng->state, state);
}

Here're the performance results:

Processor (Device)	Performance, scalar conversions & stores	Performance, SIMD-all-the-way	Speedup
AMD PRO A10-8700B	4.68 GB/s	5.76 GB/s	1.2X
AMD A4-7210	2.54 GB/s	4.77 GB/s	1.9X
Intel Xeon W-2135	10.4 GB/s	15.8 GB/s	1.5X
Intel Celeron N3060	1.41 GB/s	2.16 GB/s	1.5X
Qualcomm Snapdragon 670 (Pixel 3a)	2.54 GB/s	2.85 GB/s	1.1X
Samsung Exynos 8895 (Galaxy S8)	1.49 GB/s	3.38 GB/s	2.3X

jan-wassenberg commented 3 years ago

For later relaxed semantics of floating-point to integer, I think it would suffice to have, on x86:

truncated=cvttps
return truncated ^ BroadcastSign(AndNot(original, truncated))

(We're fixing up 0x80..00 to 0x7F..FF if the sign changed)

Maratyszcza commented 3 years ago

For relaxed semantics my preference would be to directly map to CVTTPD2DQ. In relaxed semantics the out-of-bounds behavior doesn't need to be the same across architectures.

ngzhian commented 3 years ago

At the risk of making you do more work @Maratyszcza, what do you think about splitting up promote/demote from this PR? From the meeting earlier, I got the sense that most of us feel pretty comfortable about promote/demote. We can probably get that in without much discussion.

f64x2.converti32x4{s,u} maybe should be split out as well, these require less discussion than truncsat (better lowering), and also already have use cases and benchmarks (at least the signed onces).

Having trunc_sat out on its own will let us have a more focused discussion. The risk here I guess is orthogonality. But grouping all these instructions together risk us losing some useful instructions because of objections to poor lowering of others.

Maratyszcza commented 3 years ago

@ngzhian These instructions were voted on together, so I'd rather have them merged together. f64x2.convert_i32x4_{s,u} don't really have an alternative, aside from scalarization which has the same lowering issues, and if we have these instructions in the spec we can at least fix inefficient lowerings in Fast SIMD.

Besides, I evaluated performance on this code for evaluation of log2(x) which is reused across several GitHub projects. Here's the full-SIMD version (with SIMD conversions):

void remez5_0_log2_simd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    k1 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a1);
    k2 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a2);
    p1 = __builtin_wasm_convert_low_s_i32x4_f64x2(k1);
    p2 = __builtin_wasm_convert_low_s_i32x4_f64x2(k2);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here's the semi-SIMD version (scalar conversions, rest is SIMD):

void remez5_0_log2_semisimd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    const int k1_lo = (int) wasm_f64x2_extract_lane(a1, 0);
    const int k1_hi = (int) wasm_f64x2_extract_lane(a1, 1);
    const int k2_lo = (int) wasm_f64x2_extract_lane(a2, 0);
    const int k2_hi = (int) wasm_f64x2_extract_lane(a2, 1);
    k1 = wasm_i32x4_make(k1_lo, k1_hi, 0, 0);
    k2 = wasm_i32x4_make(k2_lo, k2_hi, 0, 0);
    p1 = wasm_f64x2_make((double) k1_lo, (double) k1_hi);
    p2 = wasm_f64x2_make((double) k2_lo, (double) k2_hi);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here're the performance results:

Processor (Device)	Performance, scalar conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	604 MB/s	1428 MB/s	2.4X
AMD A4-7210	76.4 MB/s	89.7 MB/s	1.2X
Intel Xeon W-2135	291 MB/s	355 MB/s	1.2X
Intel Celeron N3060	102 MB/s	130 MB/s	1.3X
Qualcomm Snapdragon 670 (Pixel 3a)	414 MB/s	640 MB/s	1.5X
Samsung Exynos 8895 (Galaxy S8)	389 GB/s	487 GB/s	2.3X

dtig commented 3 years ago

Merging as per #429 and discussion above.

jlb6740 commented 3 years ago

@Maratyszcza @ngzhian Hi .. I am lowering f64x2.convert_low_i32x4_u for sse2 in a backend. I see this conversion here:

f64x2.convert_low_i32x4_u
y = f64x2.convert_low_i32x4_u(x) is lowered to:
MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]

I am trying to understand the magic here. Do you a reference where there is more detail or have an explaination. I checked out v8's implementation here: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=2406;drc=92a46414765f4873251cf9ecefbef130648e2af8 where it is mentioned that

 // dst = [ src_low, 0x43300000, src_high, 0x4330000 ];
 // 0x43300000'00000000 is a special double where the significand bits
 // precisely represents all uint32 numbers.

But I am not really sure what this implies as far as how this conversion works. Any comments?

Maratyszcza commented 3 years ago

@jbl 0x43300000 is the high 32 bits of 0x1.0p+52. ULP(0x1.0p+52) == 1.0, so the whole number after unpack is 0x1.0p+52 + double(x)

jlb6740 commented 3 years ago

@Maratyszcza Hi, thanks so much. I got it now! The "magic" I was missing was is in the subtract.

WebAssembly / simd

Double-precision conversion instructions #383

Introduction

Applications

Use-cases in OpenCV

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F + AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON