_mm_dp_ps does not always match x86_64

The _mm_dp_ps implementation will give a perfect match for 0xFF and 0x7F modes, but the other modes will use the Kahan algorithm to get a little bit more precision. This means that changing which lanes you write the result to can produce slightly different results, which is likely unwanted and surprising. It also won't match what x86_64 produces.

Note that the Intel® SSE4 Programming Reference states that this instruction should produce the same result that you get from a standard non-Kahan algorithm implementation.

Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage)

Either this enhanced precision code path should be removed (my preference and it also simplifies the code) or made optional using an SSE2NEON_PRECISE_* define.

DLTcollab / sse2neon

_mm_dp_ps does not always match x86_64 #595