Closed alecazam closed 4 years ago
These are only there for test purposes, wouldn't try to use them in production. Even if you solve the precision issues, USE_SCALAR_SSE
is hard defined to zero because for any real-world use naive scalar use of SIMD costs more than it saves due to the need to bounce things across to SIMD registers.
Here's some reference implementations for Neon and SSE for rcp/rsqrt if you find them helpful. They do the Newton-Raphson iteration. With SSE2Neon, you could have astcencoder SIMD optimized for both Neon and ARM, but only have to use the SSE intrinsics. I haven't looked to see if that has these calls or just exposes the low-precision intrinsics only.
I turned off SSE optimizations in the encoder, but the implementations are very poor substitutions for high-precision calls. The rcp/rsqrt calls have to be called at least twice with Newton-Raphson iteration and some mads to equal the equivalent system call. At least that's what I had to do with the float4 calls. The simd calls are fast (around 3-4 SSE ops), but care must be taken or you get 12-bit precision from each iteration. Also these are typically best done on a full float4 operator.
Also consider porting these to Neon with SSE2Neon.h. The Neon versions of these are also very low precision, and require similar iteration.
static inline float rsqrt(float val) {
if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE
else
endif
}
static inline float recip(float val) {
if (ASTCENC_SSE >= 20) && USE_SCALAR_SSE
else
endif
}