Closed game-stop closed 7 years ago
start with sqrt function using SSE,e.g:
For single value, using the intrinsic _mm_sqrt_ps
is twice slower than math"s lib sqrt(double )
.
Finally, _mm_sqrt_sd
seems to be the way to go :
#if defined(HAVE_ASM_SSE2)
static inline double sqrt_sd(double value) {
double ret;
__m128d v = _mm_load_sd(&value);
_mm_store_sd(&ret, _mm_sqrt_sd(v, v));
return ret;
}
#define fast_sqrt( res,x ) res = sqrt_sd(x)
#else
#define fast_sqrt( res,x ) res = sqrt(x)
#endif
added lookup table for sqrt (color values), added _mm_sqrt_sd , removed fsqrt
instead of using fast_sqrt (defined in libvje/effects/common.h), create SIMD optimized code for all FX that call fast_sqrt in the inner loop
start with sqrt function using SSE,e.g:
in fisheye_malloc(), where the polar mapping is created, use ifdef HAVE_ASM_SSE and fill the polar_map with 4 values at each iteration
then, optimize color_distance function to iterate over at least 4 CbCr values at once for all FX that use it (rgbkey, alphaselect2, etc) do not forget to multiply the final result with 255 before storing