WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
527 stars 43 forks source link

Constant Pool Optimization on X64 and Related Benchmarks #498

Open omnisip opened 3 years ago

omnisip commented 3 years ago

Hi everyone,

@ngzhian has encouraged me to share the discussion we're having on building an x64 constant pool and related optimizations in V8 since it might be helpful to other people implementing compilers. Specifically, we're putting together a design document that shows how a constant pool will improve the performance of select operations -- the first being shuffle. However, it seems unlikely that shuffle would be the only thing that would receive a performance benefit. There's probably a handful of other cases that stand out with a constant intermediate being reused more than one time. For instance, if someone were calling extended multiply i32x4->i64x2 in a loop.

If you have specific benchmarks or places, which might be well addressed by the constant pool and/or constant optimization, please post them here. I would love to include them and your input in the design proposal.

lemaitre commented 3 years ago

One function where I did need constants is with the fast square root reciprocal.

I basically use the "fast square root reciprocal" from quake, but extended to double precision, and with faster convergence. This function requires 4 constants.

__m256d rsqrt(__m256d x) {
  const __m256i vmagic = _mm256_set1_epi64x(0x5fe6eb50c7b537a9ull);
  const __m256d v35over16 = _mm256_set1_pd(35./16.), v21over16 = _mm256_set1_pd(21./16.), v5over16 = _mm256_set1_pd(5./16.);

  __m256d a, y;
  y = _mm256_castsi256_pd(_mm256_sub_epi64(vmagic, _mm256_srli_epi64(_mm256_castpd_si256(x), 1)));

  // Householder's iterations (quartic convergence)
  a = _mm256_mul_pd(_mm256_mul_pd(y, y), x);
  y = _mm256_mul_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(a, v5over16, v21over16), a, v35over16), a, v35over16), y);
  a = _mm256_mul_pd(_mm256_mul_pd(y, y), x);
  y = _mm256_mul_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(a, v5over16, v21over16), a, v35over16), a, v35over16), y);

  return y;
}

If you are interested in more details about that, you can look at my paper at JSA.

omnisip commented 3 years ago

@lemaitre This could be a really good use case. Do you have a pre-built benchmark we can use?

I'd love to see the wasm code generation and the native assembly generation.

lemaitre commented 3 years ago

The benchmark I had at the time is super large, but this was part of a Cholesky factorization kernel on tiny matrices (like 3x3 or 5x5). In fact, to evaluate the performance of this code, you don't even need actual data as there is no ifs and all instructions used are constant time.

If you are interested, I could generate a tiny benchmark for this code later. But for now, I'm a bit busy.

omnisip commented 3 years ago

Possibly. The goal here is to find cases where there is a performance penalty for not being able to reuse constants. In my case, I have one where there is no efficient way to build the constant, and it has to be manipulated and regenerated repeatedly for the shuffle mask. For something like extended multiply, it could be that it's looping and multiplying by a constant value that it has to shuffle for architectural reasons.