Closed Maratyszcza closed 3 years ago
How does ARM codegen look like? Similar to ARM64? Or complicated enough that we will likely do runtime calls?
Similar to ARM64, but with additional constraints on register allocation (only the first 8 NEON registers are accessible as S FPU registers).
Do you have more specific pointers to code that will benefit from this? And also plans for benchmarking if this is prototyped? I would also like to hear from more people whether this is useful or not.
These instructions are commonly used in scientific computing, but it is hard to provide code pointers as explicit vectorization in scientific computing codes is rare, they usually rely on compiler auto-vectorization of code in C/C++ and FORTRAN. My plan for evaluating these is to implement XorShift RNG with mapping of output to [0.0, 1.0]
-range double-precision numbers.
@Maratyszcza
implement XorShift RNG with mapping of output to
[0.0, 1.0]
-range double-precision numbers.
FYI there is code for the integer portion in https://gitlab.com/wg1/jpeg-xl/-/blob/master/lib/jxl/xorshift128plus-inl.h :)
Compiler auto-vectorization performance is somewhat nebulous, especially for cross platform performance for i64x2 operations. The last time we tried this IIRC we didn't see any benefit from autovectorizing. Perhaps one way to make a decision on many of these i64x2 operations is to evaluate a couple of sample scientific applications that might use these operations and see if the generated code would actually map to these operations.
@Maratyszcza (or anyone else) Are there any open source applications that we can evaluate in addition to the jpeg-xl link above?
I'd like to express support for this. JPEG XL promotes floats to double and the Highway math library requires f64 -> i32 conversions.
Promote and demote maps to native nicely. The others, specially f64x2->i32x4 looks bad. We shouldn't encourage the use of these.
f64x2->i32x4 looks bad. We shouldn't encourage the use of these.
Can you help me understand your concern? Math libraries need floatN -> intM to do bitwise ops (shifting exponent into place for LoadExp). For N=64 (double), x86 only provides M=32. This works out OK for ARM as well. double->int64 would be much more expensive on x86. Do you see a better alternative?
I'm only pointing out that the codegen for i32x4.trunc_sat_f64x2_s_zero
looks bad, it doesn't map well to native instruction at all.
I don't know what LoadExp is. For math libraries, would they choose to use this instruction given that it has such poor mapping? Or would they come up with workarounds for the specific use case?
The obvious alternative is to scalarize, it would be slower, but these SIMD instructions are slow too.
i32x4.trunc_sat_f64x2_s_zero
lowers to multiple instructions on x86 because it has to simulate WAsm semantics for out-of-bounds values. Scalar instructions would have to simulate it too, one lane at a time.
For i32x4.trunc_sat_f64x2_s_zero
SSE2 lowering, I think the first instruction is wrong:
XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
...
This is a compare with 0, should it be
MOVAPS xmm_tmp, xmm_x
CMPEQPD xmm_tmp, xmm_tmp
...
?
Any suggested lowering for SSE2 i32x4.trunc_sat_f64x2_u_zero
, translate the AVX to non-AVX?
Adding a preliminary vote for the inclusion of double-precision conversion operations to the SIMD proposal below. Please vote with -
š For including double-precision conversion operations š Against including double-precision conversion operations
I don't think using a poor SIMD sequence instead of a poor scalar sequence is really a solution, plus we have no performance data on this. I don't think this instruction is ready to be included.
@penzn I'm working on performance data. Expect it later today.
To evaluate this proposal I implemented SIMD and semi-SIMD (mostly SIMD, but using scalar conversions) versions of the mixed-precision dot product operating where inputs are single-precision floats, but accumulators and outputs are double-precision.
Here's the SIMD version:
double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
v128_t vx = wasm_v128_load(x);
x += 4;
v128_t vy = wasm_v128_load(y);
y += 4;
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
__builtin_wasm_promote_low_f32x4_f64x2(vx),
__builtin_wasm_promote_low_f32x4_f64x2(vy)));
vx = wasm_v32x4_shuffle(vx, vx, 2, 3, 0, 1);
vy = wasm_v32x4_shuffle(vy, vy, 2, 3, 0, 1);
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
__builtin_wasm_promote_low_f32x4_f64x2(vx),
__builtin_wasm_promote_low_f32x4_f64x2(vy)));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}
And here's the semi-SIMD version:
double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
const v128_t vx = wasm_v128_load(x);
x += 4;
const v128_t vy = wasm_v128_load(y);
y += 4;
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vx, 0),
(double) wasm_f32x4_extract_lane(vx, 1)),
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vy, 0),
(double) wasm_f32x4_extract_lane(vy, 1))));
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vx, 2),
(double) wasm_f32x4_extract_lane(vx, 3)),
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vy, 2),
(double) wasm_f32x4_extract_lane(vy, 3))));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}
So far V8 implements the double-precision conversion instructions only on x86-64 (ARM64 implementation is in review), and the results on x86-64 systems are presented below:
ImplementationĀ | Performance, scalar conversions | Performance, SIMD conversions | Speedup |
---|---|---|---|
AMD PRO A10-8700B | 4.55 GB/s | 9.65 GB/s | 2.1X |
AMD A4-7210 | 1.37 GB/s | 4.94 GB/s | 3.6X |
Intel Xeon W-2135 | 5.01 GB/s | 21.6 GB/s | 4.3X |
Intel Celeron N3060 | 1.29 GB/s | 4.03 GB/s | 3.1X |
The version using conversion looks great, but for the semi-SIMD version would not it be faster to construct f64x2
from individual array elements (instead of first loading an array chunk into a f32x4
and then breaking it down and then recreating as f64x2
)?
@penzn I tried a version with scalar loads:
double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
const float vx0 = x[0];
const float vx1 = x[1];
const float vy0 = y[0];
const float vy1 = y[1];
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
wasm_f64x2_make((double) vx0, (double) vx1),
wasm_f64x2_make((double) vy0, (double) vy1)));
const float vx2 = x[2];
const float vx3 = x[3];
x += 4;
const float vy2 = y[2];
const float vy3 = y[3];
y += 4;
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
wasm_f64x2_make((double) vx2, (double) vx3),
wasm_f64x2_make((double) vy2, (double) vy3)));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}
Performance is presented below:
ImplementationĀ | Performance, scalar loads & conversions | Performance, SIMD conversions | Speedup |
---|---|---|---|
AMD PRO A10-8700B | 4.77 GB/s | 9.65 GB/s | 2.0X |
AMD A4-7210 | 1.52 GB/s | 4.94 GB/s | 3.3X |
Intel Xeon W-2135 | 6.52 GB/s | 21.6 GB/s | 3.3X |
Intel Celeron N3060 | 1.31 GB/s | 4.03 GB/s | 3.1X |
Some good speedups with f64x2.promote_low_f32x4. We also have a use case in OpenCV. The lowering is fantastic as well, so I am supportive of this instruction. However, this PR is a set of 6 different instructions. Some of them look very different in terms of lowering, and don't have use cases listed. Will it be useful to split this instruction out (maybe promote and demote)?
ARM64 implementation was merged in V8, and the performance benchmarks on ARM64 devices are below:
Processor (Device)Ā | Performance, scalar loads & conversions | Performance, SIMD conversions | Speedup |
---|---|---|---|
Qualcomm Snapdragon 670 (Pixel 3a) | 3.05 GB/s | 5.34 GB/s | 1.8X |
Samsung Exynos 8895 (Galaxy S8) | 3.38 GB/s | 7.14 GB/s | 2.1X |
I implemented one more algorithm of generating double-precision random numbers in [-1, 1] range using XorShift random number generator. I prepared two implementations: SIMD-all-the-way (using the proposed conversion instructions) and SIMD-Xoshift-only with scalar conversions and stores. The SIMD-all-the-way version is presented below:
void GenerateUniform(
struct rng rng[restrict static 1],
size_t size,
double output[restrict static 1])
{
assert(size != 0);
assert(size % 4 == 0);
const v128_t offset = wasm_i32x4_const(2147483648, 2147483648, 2147483648, 2147483648);
const v128_t scale = wasm_f64x2_const(0x1.00000002p-31, 0x1.00000002p-31);
v128_t state = wasm_v128_load(&rng->state);
do {
state = wasm_v128_xor(state, wasm_i32x4_shl(state, 13));
state = wasm_v128_xor(state, wasm_u32x4_shr(state, 17));
state = wasm_v128_xor(state, wasm_i32x4_shl(state, 5));
v128_t value = wasm_i32x4_sub(state, offset);
wasm_v128_store(output, wasm_f64x2_mul(__builtin_wasm_convert_low_s_i32x4_f64x2(value), scale));
value = wasm_v32x4_shuffle(value, value, 2, 3, 0, 1);
wasm_v128_store(output + 2, wasm_f64x2_mul(__builtin_wasm_convert_low_s_i32x4_f64x2(value), scale));
output += 4;
size -= 4;
} while (size != 0);
wasm_v128_store(&rng->state, state);
}
The version is SIMD Xorshift only is below:
void GenerateUniform(
struct rng rng[restrict static 1],
size_t size,
double output[restrict static 1])
{
assert(size != 0);
assert(size % 4 == 0);
const v128_t offset = wasm_i32x4_const(2147483648, 2147483648, 2147483648, 2147483648);
const double scale = 0x1.00000002p-31;
v128_t state = wasm_v128_load(&rng->state);
do {
state = wasm_v128_xor(state, wasm_i32x4_shl(state, 13));
state = wasm_v128_xor(state, wasm_u32x4_shr(state, 17));
state = wasm_v128_xor(state, wasm_i32x4_shl(state, 5));
const v128_t value = wasm_i32x4_sub(state, offset);
output[0] = scale * (double) wasm_i32x4_extract_lane(value, 0);
output[1] = scale * (double) wasm_i32x4_extract_lane(value, 1);
output[2] = scale * (double) wasm_i32x4_extract_lane(value, 2);
output[3] = scale * (double) wasm_i32x4_extract_lane(value, 3);
output += 4;
size -= 4;
} while (size != 0);
wasm_v128_store(&rng->state, state);
}
Here're the performance results:
Processor (Device)Ā | Performance, scalar conversions & stores | Performance, SIMD-all-the-way | Speedup |
---|---|---|---|
AMD PRO A10-8700B | 4.68 GB/s | 5.76 GB/s | 1.2X |
AMD A4-7210 | 2.54 GB/s | 4.77 GB/s | 1.9X |
Intel Xeon W-2135 | 10.4 GB/s | 15.8 GB/s | 1.5X |
Intel Celeron N3060 | 1.41 GB/s | 2.16 GB/s | 1.5X |
Qualcomm Snapdragon 670 (Pixel 3a) | 2.54 GB/s | 2.85 GB/s | 1.1X |
Samsung Exynos 8895 (Galaxy S8) | 1.49 GB/s | 3.38 GB/s | 2.3X |
For later relaxed semantics of floating-point to integer, I think it would suffice to have, on x86:
truncated=cvttps
return truncated ^ BroadcastSign(AndNot(original, truncated))
(We're fixing up 0x80..00 to 0x7F..FF if the sign changed)
For relaxed semantics my preference would be to directly map to CVTTPD2DQ
. In relaxed semantics the out-of-bounds behavior doesn't need to be the same across architectures.
At the risk of making you do more work @Maratyszcza, what do you think about splitting up promote/demote from this PR? From the meeting earlier, I got the sense that most of us feel pretty comfortable about promote/demote. We can probably get that in without much discussion.
f64x2.converti32x4{s,u} maybe should be split out as well, these require less discussion than truncsat (better lowering), and also already have use cases and benchmarks (at least the signed onces).
Having trunc_sat out on its own will let us have a more focused discussion. The risk here I guess is orthogonality. But grouping all these instructions together risk us losing some useful instructions because of objections to poor lowering of others.
@ngzhian These instructions were voted on together, so I'd rather have them merged together. f64x2.convert_i32x4_{s,u}
don't really have an alternative, aside from scalarization which has the same lowering issues, and if we have these instructions in the spec we can at least fix inefficient lowerings in Fast SIMD.
Besides, I evaluated performance on this code for evaluation of log2(x) which is reused across several GitHub projects. Here's the full-SIMD version (with SIMD conversions):
void remez5_0_log2_simd(const double *inputs, double* outputs, size_t num) {
assert(num != 0);
assert(num % 4 == 0);
const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);
do {
v128_t k1, k2;
v128_t p1, p2;
v128_t a1, a2;
v128_t xmm0, xmm1;
v128_t x1, x2;
/* Load four double values. */
xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
x1 = wasm_v128_load(inputs);
x2 = wasm_v128_load(inputs+2);
inputs += 4;
x1 = wasm_f64x2_pmin(xmm0, x1);
x2 = wasm_f64x2_pmin(xmm0, x2);
x1 = wasm_f64x2_pmax(xmm1, x1);
x2 = wasm_f64x2_pmax(xmm1, x2);
/* a = x / log2; */
xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
xmm1 = wasm_f64x2_const(0.0, 0.0);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
/* k = (int)floor(a); p = (float)k; */
p1 = wasm_f64x2_lt(a1, xmm1);
p2 = wasm_f64x2_lt(a2, xmm1);
xmm0 = wasm_f64x2_const(1.0, 1.0);
p1 = wasm_v128_and(p1, xmm0);
p2 = wasm_v128_and(p2, xmm0);
a1 = wasm_f64x2_sub(a1, p1);
a2 = wasm_f64x2_sub(a2, p2);
k1 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a1);
k2 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a2);
p1 = __builtin_wasm_convert_low_s_i32x4_f64x2(k1);
p2 = __builtin_wasm_convert_low_s_i32x4_f64x2(k2);
/* x -= p * log2; */
xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
a1 = wasm_f64x2_mul(p1, xmm0);
a2 = wasm_f64x2_mul(p2, xmm0);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
a1 = wasm_f64x2_mul(p1, xmm1);
a2 = wasm_f64x2_mul(p2, xmm1);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
/* Compute e^x using a polynomial approximation. */
xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
/* p = 2^k; */
k1 = wasm_i32x4_add(k1, offset);
k2 = wasm_i32x4_add(k2, offset);
k1 = wasm_i32x4_shl(k1, 20);
k2 = wasm_i32x4_shl(k2, 20);
k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
p1 = k1;
p2 = k2;
/* a *= 2^k. */
a1 = wasm_f64x2_mul(a1, p1);
a2 = wasm_f64x2_mul(a2, p2);
/* Store the results. */
wasm_v128_store(outputs, a1);
wasm_v128_store(outputs+2, a2);
outputs += 4;
num -= 4;
} while (num != 0);
}
Here's the semi-SIMD version (scalar conversions, rest is SIMD):
void remez5_0_log2_semisimd(const double *inputs, double* outputs, size_t num) {
assert(num != 0);
assert(num % 4 == 0);
const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);
do {
v128_t k1, k2;
v128_t p1, p2;
v128_t a1, a2;
v128_t xmm0, xmm1;
v128_t x1, x2;
/* Load four double values. */
xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
x1 = wasm_v128_load(inputs);
x2 = wasm_v128_load(inputs+2);
inputs += 4;
x1 = wasm_f64x2_pmin(xmm0, x1);
x2 = wasm_f64x2_pmin(xmm0, x2);
x1 = wasm_f64x2_pmax(xmm1, x1);
x2 = wasm_f64x2_pmax(xmm1, x2);
/* a = x / log2; */
xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
xmm1 = wasm_f64x2_const(0.0, 0.0);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
/* k = (int)floor(a); p = (float)k; */
p1 = wasm_f64x2_lt(a1, xmm1);
p2 = wasm_f64x2_lt(a2, xmm1);
xmm0 = wasm_f64x2_const(1.0, 1.0);
p1 = wasm_v128_and(p1, xmm0);
p2 = wasm_v128_and(p2, xmm0);
a1 = wasm_f64x2_sub(a1, p1);
a2 = wasm_f64x2_sub(a2, p2);
const int k1_lo = (int) wasm_f64x2_extract_lane(a1, 0);
const int k1_hi = (int) wasm_f64x2_extract_lane(a1, 1);
const int k2_lo = (int) wasm_f64x2_extract_lane(a2, 0);
const int k2_hi = (int) wasm_f64x2_extract_lane(a2, 1);
k1 = wasm_i32x4_make(k1_lo, k1_hi, 0, 0);
k2 = wasm_i32x4_make(k2_lo, k2_hi, 0, 0);
p1 = wasm_f64x2_make((double) k1_lo, (double) k1_hi);
p2 = wasm_f64x2_make((double) k2_lo, (double) k2_hi);
/* x -= p * log2; */
xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
a1 = wasm_f64x2_mul(p1, xmm0);
a2 = wasm_f64x2_mul(p2, xmm0);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
a1 = wasm_f64x2_mul(p1, xmm1);
a2 = wasm_f64x2_mul(p2, xmm1);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
/* Compute e^x using a polynomial approximation. */
xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
/* p = 2^k; */
k1 = wasm_i32x4_add(k1, offset);
k2 = wasm_i32x4_add(k2, offset);
k1 = wasm_i32x4_shl(k1, 20);
k2 = wasm_i32x4_shl(k2, 20);
k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
p1 = k1;
p2 = k2;
/* a *= 2^k. */
a1 = wasm_f64x2_mul(a1, p1);
a2 = wasm_f64x2_mul(a2, p2);
/* Store the results. */
wasm_v128_store(outputs, a1);
wasm_v128_store(outputs+2, a2);
outputs += 4;
num -= 4;
} while (num != 0);
}
Here're the performance results:
Processor (Device)Ā | Performance, scalar conversions | Performance, SIMD conversions | Speedup |
---|---|---|---|
AMD PRO A10-8700B | 604 MB/s | 1428 MB/s | 2.4X |
AMD A4-7210 | 76.4 MB/s | 89.7 MB/s | 1.2X |
Intel Xeon W-2135 | 291 MB/s | 355 MB/s | 1.2X |
Intel Celeron N3060 | 102 MB/s | 130 MB/s | 1.3X |
Qualcomm Snapdragon 670 (Pixel 3a) | 414 MB/s | 640 MB/s | 1.5X |
Samsung Exynos 8895 (Galaxy S8) | 389 GB/s | 487 GB/s | 2.3X |
Merging as per #429 and discussion above.
@Maratyszcza @ngzhian Hi .. I am lowering f64x2.convert_low_i32x4_u for sse2 in a backend. I see this conversion here:
f64x2.convert_low_i32x4_u
y = f64x2.convert_low_i32x4_u(x) is lowered to:
MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
I am trying to understand the magic here. Do you a reference where there is more detail or have an explaination. I checked out v8's implementation here: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=2406;drc=92a46414765f4873251cf9ecefbef130648e2af8 where it is mentioned that
// dst = [ src_low, 0x43300000, src_high, 0x4330000 ];
// 0x43300000'00000000 is a special double where the significand bits
// precisely represents all uint32 numbers.
But I am not really sure what this implies as far as how this conversion works. Any comments?
@jbl 0x43300000
is the high 32 bits of 0x1.0p+52
. ULP(0x1.0p+52) == 1.0
, so the whole number after unpack is 0x1.0p+52 + double(x)
@Maratyszcza Hi, thanks so much. I got it now! The "magic" I was missing was is in the subtract.
Introduction
As discussed in #348, WebAssembly SIMD specification is functionally incomplete w.r.t double-precision computations as it doesn't cover many basic C/C++ operations, particularly conversions between
double
type and other types. Currently any code that use these operations would be de-vectorized (see examples inemmintrin.h
header in Emscripten). Double-precision computations are extensively used in scientific computing, and providing a complete set of double-precision is a must to bring scientific computing applications to the Web. In the foreseeable future WebAssembly SIMD will be the only avenue for high-performance scientific computing in the Web, as neither WebGL nor WebGPU support double-precision computations.This PR introduce the minimal set of conversion operations to fill this gap.
Applications
Use-cases in OpenCV
f32x4.demote_f64x2_zero
f64x2.promote_low_f32x4
f64x2.convert_low_i32x4_s
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F + AVX512VL instruction sets
y = f64x2.convert_low_i32x4_u(x)
is lowered toVCVTUDQ2PD xmm_y, xmm_x
y = i32x4.trunc_sat_f64x2_u_zero(x)
(y
is NOTx
) is lowered to:VXORPD xmm_y, xmm_y, xmm_y
VMAXPD xmm_y, xmm_x, xmm_y
VCVTTPD2UDQ xmm_y, xmm_y
x86/x86-64 processors with AVX instruction set
y = f64x2.convert_low_i32x4_s(x)
is lowered toVCVTDQ2PD xmm_y, xmm_x
y = f64x2.convert_low_i32x4_u(x)
is lowered to:VUNPCKLPS xmm_y, xmm_x, [wasm_i32x4_splat(0x43300000)]
VSUBPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
y = i32x4.trunc_sat_f64x2_s_zero(x)
(y
is NOTx
) is lowered to:VCMPEQPD xmm_y, xmm_x, xmm_x
VANDPD xmm_y, [wasm_f64x2_splat(2147483647.0)]
VMINPD xmm_y, xmm_x, xmm_y
VCVTTPD2DQ xmm_y, xmm_y
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
y = f32x4.demote_f64x2_zero(x)
is lowered toVCVTPD2PS xmm_y, xmm_x
y = f64x2.promote_low_f32x4(x)
is lowered toVCVTPS2PD xmm_y, xmm_x
x86/x86-64 processors with SSE4.1 instruction set
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:MOVAPD xmm_y, xmm_x
XORPD xmm_tmp, xmm_tmp
MAXPD xmm_y, xmm_tmp
MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
ROUNDPD xmm_y, xmm_y, 0x0B
ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
SHUFPS xmm_y, xmm_xmp, 0x88
x86/x86-64 processors with SSE2 instruction set
y = f64x2.convert_low_i32x4_s(x)
is lowered toCVTDQ2PD xmm_y, xmm_x
y = f64x2.convert_low_i32x4_u(x)
is lowered to:MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
MOVAPS xmm_y, xmm_x
ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
MINPD xmm_y, xmm_tmp
CVTTPD2DQ xmm_y, xmm_y
y = f32x4.demote_f64x2_zero(x)
is lowered toCVTPD2PS xmm_y, xmm_x
y = f64x2.promote_low_f32x4(x)
is lowered toCVTPS2PD xmm_y, xmm_x
ARM64 processors
y = f64x2.convert_low_i32x4_s(x)
is lowered to:SSHLL Vy.2D, Vx.2S, 0
SCVTF Vy.2D, Vy.2D
y = f64x2.convert_low_i32x4_u(x)
is lowered to:USHLL Vy.2D, Vx.2S, 0
UCVTF Vy.2D, Vy.2D
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:FCVTZS Vy.2D, Vx.2D
SQXTN Vy.2S, Vy.2D
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:FCVTZU Vy.2D, Vx.2D
UQXTN Vy.2S, Vy.2D
y = f32x4.demote_f64x2_zero(x)
is lowered toFCVTN Vy.2S, Vx.2D
y = f64x2.promote_low_f32x4(x)
is lowered toFCVTL Vy.2D, Vx.2S
ARMv7 processors with NEON
y = f64x2.convert_low_i32x4_s(x)
(x
inq0
-q7
) is lowered to:VCVT.F64.S32 Dy_lo, Sx_ll
VCVT.F64.S32 Dy_hi, Sx_lh
y = f64x2.convert_low_i32x4_u(x)
(x
inq0
-q7
) is lowered to:VCVT.F64.U32 Dy_lo, Sx_ll
VCVT.F64.U32 Dy_hi, Sx_lh
y = i32x4.trunc_sat_f64x2_s_zero(x)
(y
inq0
-q7
) is lowered to:VCVT.S32.F64 Sy_ll, Dx_lo
VCVT.S32.F64 Sy_lh, Dx_lo
VMOV.I32 Dy_hi, 0
y = i32x4.trunc_sat_f64x2_u_zero(x)
(y
inq0
-q7
) is lowered to:VCVT.U32.F64 Sy_ll, Dx_lo
VCVT.U32.F64 Sy_lh, Dx_lo
VMOV.I32 Dy_hi, 0
y = f32x4.demote_f64x2_zero(x)
is lowered to:VCVT.F32.F64 Sy_ll, Dx_lo
VCVT.F32.F64 Sy_lh, Dx_hi
VMOV.I32 Dy_hi, 0
y = f64x2.promote_low_f32x4(x)
is lowered to:VCVT.F64.F32 Dy_lo, Sx_ll
VCVT.F64.F32 Dy_hi, Sx_lh