Hi,
I been developing with SSE for many may years, and now got into porting some code to native ARM.
sse2neon works great, and I'm very happy with it, but I realized there are some precision incompatibility.
To my knowledge there are only 2 non exact instruction in SSE: rsqrt and rcp (p and s versions), both got 11 bit accuracy.
Neon got vrsqrteq_f32 and vrecpeq_f32, that got 8 bit accuracy, 1% and the vrsqrtsq_f32 and vrecpsq_f32 used to improve accuracy. one Netwon-Raphson step will get about 16 correct bits and two steps seems to get to full 24 bits.
_mm_rsqrt_ps currently with SSE2NEON_PRECISE_SQRT=0is not doing any nr iterations, and with SSE2NEON_PRECISE_SQRT=1
two, it should to one. now we have options for 8 bits or 24, not the 16 that will match the expected 11 bits on Intel.
_mm_sqrt_ps is expected to be exact and do 2 Netwon-Raphson iterations in non __aarch64__ path, and use the __aarch64__ vsqrtq_f32 if available. currently its ok for __aarch64__ with default SSE2NEON_PRECISE_SQRT=0, but return only 8 bits accuracy for non __aarch64__ arch. with SSE2NEON_PRECISE_SQRT=1 its will use the long vrsqrteq_f32 + 2*vrsqrtsq_f32 even in __aarch64__ is available.
_mm_rcp_ps is correct with default SSE2NEON_PRECISE_DIV off, and doing pointless additional step with SSE2NEON_PRECISE_DIV on.
_mm_div_ps is that same situation as in _mm_sqrt_ps, it expected to be exact, and use __aarch64__vdivq_f32 if available, or do 2 Netwon-Raphson iterations.
unit test to bu updated too, 0.1% for rsqrt/rcp, and probably resultFLT_EPS(maybe 2-3) for div/sqrt
Option to use lower precision can be added, but it shouldn't be on by default, and probably need to be more fine grained, for example rsqrt_8_BITSrcp_8_BITSdiv_16_bits, and sqrt16_bits. but I don't bee them that useful.
Hi, I been developing with SSE for many may years, and now got into porting some code to native ARM. sse2neon works great, and I'm very happy with it, but I realized there are some precision incompatibility. To my knowledge there are only 2 non exact instruction in SSE:
rsqrt
andrcp
(p
ands
versions), both got 11 bit accuracy. Neon gotvrsqrteq_f32
andvrecpeq_f32
, that got 8 bit accuracy, 1% and thevrsqrtsq_f32
andvrecpsq_f32
used to improve accuracy. one Netwon-Raphson step will get about 16 correct bits and two steps seems to get to full 24 bits._mm_rsqrt_ps
currently withSSE2NEON_PRECISE_SQRT=0
is not doing any nr iterations, and withSSE2NEON_PRECISE_SQRT=1
two, it should to one. now we have options for 8 bits or 24, not the 16 that will match the expected 11 bits on Intel._mm_sqrt_ps
is expected to be exact and do 2 Netwon-Raphson iterations in non__aarch64__
path, and use the__aarch64__ vsqrtq_f32
if available. currently its ok for__aarch64__
with defaultSSE2NEON_PRECISE_SQRT=0
, but return only 8 bits accuracy for non__aarch64__
arch. withSSE2NEON_PRECISE_SQRT=1
its will use the longvrsqrteq_f32 + 2*vrsqrtsq_f32
even in__aarch64__
is available._mm_rcp_ps
is correct with defaultSSE2NEON_PRECISE_DIV
off, and doing pointless additional step withSSE2NEON_PRECISE_DIV
on._mm_div_ps
is that same situation as in_mm_sqrt_ps
, it expected to be exact, and use__aarch64__
vdivq_f32
if available, or do 2 Netwon-Raphson iterations.unit test to bu updated too, 0.1% for
rsqrt
/rcp
, and probably resultFLT_EPS(maybe 2-3) fordiv
/sqrt
Option to use lower precision can be added, but it shouldn't be on by default, and probably need to be more fine grained, for example
rsqrt_8_BITS
rcp_8_BITS
div_16_bits
, andsqrt16_bits
. but I don't bee them that useful.here is some code from Skia that seems to do it correct https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_neon.h and see reference https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_sse.h