alecazam commented 4 years ago

I turned off SSE optimizations in the encoder, but the implementations are very poor substitutions for high-precision calls. The rcp/rsqrt calls have to be called at least twice with Newton-Raphson iteration and some mads to equal the equivalent system call. At least that's what I had to do with the float4 calls. The simd calls are fast (around 3-4 SSE ops), but care must be taken or you get 12-bit precision from each iteration. Also these are typically best done on a full float4 operator.

Also consider porting these to Neon with SSE2Neon.h. The Neon versions of these are also very low precision, and require similar iteration.

static inline float rsqrt(float val) {

if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE

// FIXME: setting val = 99 causes a crash, which it really shouldn't.
return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ss(val)));

else

return 1.0f / std::sqrt(val);

endif

}

static inline float recip(float val) {

if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE

return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(val)));

else

return 1.0f / val;

endif

}

solidpixel commented 4 years ago

These are only there for test purposes, wouldn't try to use them in production. Even if you solve the precision issues, USE_SCALAR_SSE is hard defined to zero because for any real-world use naive scalar use of SIMD costs more than it saves due to the need to bounce things across to SIMD registers.

alecazam commented 4 years ago

Here's some reference implementations for Neon and SSE for rcp/rsqrt if you find them helpful. They do the Newton-Raphson iteration. With SSE2Neon, you could have astcencoder SIMD optimized for both Neon and ARM, but only have to use the SSE intrinsics. I haven't looked to see if that has these calls or just exposes the low-precision intrinsics only.

https://github.com/alecazam/sseneonmath

ARM-software / astc-encoder

Use of SSE ops in ASTCEncoder are low precision. #162

if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE

else

endif

if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE

else

endif