precision issue with rsqrt/sqrt/rcp/div

shord commented 2 years ago

Hi, I been developing with SSE for many may years, and now got into porting some code to native ARM. sse2neon works great, and I'm very happy with it, but I realized there are some precision incompatibility. To my knowledge there are only 2 non exact instruction in SSE: rsqrt and rcp (p and s versions), both got 11 bit accuracy. Neon got vrsqrteq_f32 and vrecpeq_f32, that got 8 bit accuracy, 1% and the vrsqrtsq_f32 and vrecpsq_f32 used to improve accuracy. one Netwon-Raphson step will get about 16 correct bits and two steps seems to get to full 24 bits.

_mm_rsqrt_ps currently with SSE2NEON_PRECISE_SQRT=0is not doing any nr iterations, and with SSE2NEON_PRECISE_SQRT=1 two, it should to one. now we have options for 8 bits or 24, not the 16 that will match the expected 11 bits on Intel.

_mm_sqrt_ps is expected to be exact and do 2 Netwon-Raphson iterations in non __aarch64__ path, and use the __aarch64__ vsqrtq_f32 if available. currently its ok for __aarch64__ with default SSE2NEON_PRECISE_SQRT=0, but return only 8 bits accuracy for non __aarch64__ arch. with SSE2NEON_PRECISE_SQRT=1 its will use the long vrsqrteq_f32 + 2*vrsqrtsq_f32 even in __aarch64__ is available.

_mm_rcp_ps is correct with default SSE2NEON_PRECISE_DIV off, and doing pointless additional step with SSE2NEON_PRECISE_DIV on.

_mm_div_ps is that same situation as in _mm_sqrt_ps, it expected to be exact, and use __aarch64__ vdivq_f32 if available, or do 2 Netwon-Raphson iterations.

unit test to bu updated too, 0.1% for rsqrt/rcp, and probably resultFLT_EPS(maybe 2-3) for div/sqrt

Option to use lower precision can be added, but it shouldn't be on by default, and probably need to be more fine grained, for example rsqrt_8_BITS rcp_8_BITS div_16_bits, and sqrt16_bits. but I don't bee them that useful.

here is some code from Skia that seems to do it correct https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_neon.h and see reference https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_sse.h

jserv commented 1 year ago

See WebRTC's vector_math.h for possible implementation.

jserv commented 1 year ago

Since Armv8.2, instruction FRSQRTE is provided to calculate an approximate square root for each vector element in the source SIMD and FP register.

DLTcollab / sse2neon

precision issue with rsqrt/sqrt/rcp/div #526