DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
MIT License
1.31k stars 208 forks source link

Optimizations disabled in `_MM_SET_ROUNDING_MODE` #648

Closed jmather-sesi closed 3 weeks ago

jmather-sesi commented 1 month ago

_MM_SET_ROUNDING_MODE is declared as FORCE_INLINE_OPTNONE, which lowers performance. Unless I'm missing something, wouldn't it be better to define the r union as volatile instead?

howjmay commented 1 month ago

@jserv I think maybe we should remove all the FORCE_INLINE_OPTNONE and using OPTNONE for the failed tests. Warning will be added in the root README to inform users about the potential optimization issues.

How do you think about it

jmather-sesi commented 1 month ago

Looking a bit deeper, I'm not entirely sure if _MM_SET_ROUNDING_MODE is working as expected. I'm working on a math library, and all of the numerical tests pass on x86. On Arm however, if I call _mm_setcsr(*flag) provided by sse2neon, the tests fail. If I instead call fesetexceptflag(flag, FE_ALL_EXCEPT);, it works as expected.

jserv commented 1 month ago

I think maybe we should remove all the FORCE_INLINE_OPTNONE and using OPTNONE for the failed tests. Warning will be added in the root README to inform users about the potential optimization issues.

The OPTNONE changes introduced in #638 have resulted in a known performance regression. To mitigate this, we should maintain the enforcement of static inline for _MM_SET_ROUNDING_MODE where possible. Additionally, it is time to re-evaluate the use of the volatile keyword in conjunction with our supported compilers.

howjmay commented 1 month ago

I think maybe we should remove all the FORCE_INLINE_OPTNONE and using OPTNONE for the failed tests. Warning will be added in the root README to inform users about the potential optimization issues.

The OPTNONE changes introduced in #638 have resulted in a known performance regression. To mitigate this, we should maintain the enforcement of static inline for _MM_SET_ROUNDING_MODE where possible.

Yes. OPTNONE is the keyword that we use in the tests. So removing FORCE_INLINE_OPTNONE and replace it with OPTNONE shifts the responsibility of avoiding optimization issue to the developers side, and mitigate the performance change we made in the https://github.com/DLTcollab/sse2neon/pull/638

howjmay commented 1 month ago

I will check fesetexceptflag with different optimization options too