sse optimizations - Githubissues

game-stop commented 8 years ago

instead of using fast_sqrt (defined in libvje/effects/common.h), create SIMD optimized code for all FX that call fast_sqrt in the inner loop

start with sqrt function using SSE,e.g:

#include <xmmintrin.h>
float tmpvec[4] = { 0 };
__m128 in  = _mm_load1_ps( &input );
__m128 res = _mm_sqrt_ps( in );
_mm_store_ps1( tmpvec, res );
output = tmpvec[0];

in fisheye_malloc(), where the polar mapping is created, use ifdef HAVE_ASM_SSE and fill the polar_map with 4 values at each iteration

then, optimize color_distance function to iterate over at least 4 CbCr values at once for all FX that use it (rgbkey, alphaselect2, etc) do not forget to multiply the final result with 255 before storing

/*
 * originally from http://gc-films.com/chromakey.html
 */
static inline double color_distance( uint8_t Cb, uint8_t Cr, int Cbk, int Crk, double dA, double dB )
{
        double tmp = 0.0; 
        fast_sqrt( tmp, (Cbk - Cb) * (Cbk-Cb) + (Crk - Cr) * (Crk - Cr) );

        if( tmp < dA ) { /* near color key == bg */
            return 0.0; /* near */
        }
        if( tmp < dB ) { /* middle region */
            return (tmp - dA)/(dB - dA); /* distance to key color */
        }
        return 1.0; /* far from color key == fg */
}

d-j-a-y commented 7 years ago

start with sqrt function using SSE,e.g:

For single value, using the intrinsic _mm_sqrt_ps is twice slower than math"s lib sqrt(double ) .

Finally, _mm_sqrt_sd seems to be the way to go :

#if defined(HAVE_ASM_SSE2)
static inline double sqrt_sd(double value) {
    double ret;
    __m128d v = _mm_load_sd(&value);
    _mm_store_sd(&ret, _mm_sqrt_sd(v, v));
    return ret;
}
#define fast_sqrt( res,x ) res = sqrt_sd(x)
#else
#define fast_sqrt( res,x ) res = sqrt(x)
#endif

game-stop commented 7 years ago

added lookup table for sqrt (color values), added _mm_sqrt_sd , removed fsqrt

game-stop / veejay

sse optimizations #113