DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
MIT License
1.3k stars 208 forks source link

_mm_dp_ps does not always match x86_64 #595

Closed ThiagoIze closed 1 year ago

ThiagoIze commented 1 year ago

The _mm_dp_ps implementation will give a perfect match for 0xFF and 0x7F modes, but the other modes will use the Kahan algorithm to get a little bit more precision. This means that changing which lanes you write the result to can produce slightly different results, which is likely unwanted and surprising. It also won't match what x86_64 produces.

Note that the Intel® SSE4 Programming Reference states that this instruction should produce the same result that you get from a standard non-Kahan algorithm implementation.

Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage)

Either this enhanced precision code path should be removed (my preference and it also simplifies the code) or made optional using an SSE2NEON_PRECISE_* define.

Cuda-Chen commented 1 year ago

Hi @ThiagoIze , I will vote on "enhanced precision code path should be removed" to not to let user being confused. Let me make some changes then create a PR to solve this.