Closed CarolEidt closed 4 years ago
I guess that this implementation was only chosen as a micro-optimization of sorts, with the idea that using division here would generally be slower than multiplying by the inverse. It's also possible we borrowed the optimization from the XNA source code, which would have often been running on the xbox 360 where the optimization was more important (since it was more resource-constrained and had a PowerPC arch). For us and our current CPU targets, it's probably not a very important optimization to keep.
It's also possible we borrowed the optimization from the XNA source code
Yes, XNA did this. DirectX Math also does this indirectly, via the rsqrtps
SSE instruction.
For us and our current CPU targets, it's probably not a very important optimization to keep.
Well, divss
is a rather expensive instruction at 11 cycles latency vs 4 for mulss
. At the same time the overall performance of Normalize
isn't great anyway because it uses sqrtpd
instead of rsqrtps
(15 cycles vs 4 cycles). So it's probably OK to remove the optimization, at least we'd get a more precise implementation instead of the current one which is neither precise nor fast.
But I think what matters most is that the 2 code paths are identical in this regard. Sometimes differences are unavoidable when different floating point computations are used but this one seems unwarranted.
it uses sqrtpd instead of rsqrtps (15 cycles vs 4 cycles)
I am assuming this is due to the use of Math.Sqrt(double)
, right? Given that MathF.Sqrt(float)
is available, we could potentially avoid that.
I am assuming this is due to the use of Math.Sqrt(double), right? Given that MathF.Sqrt(float) is available, we could potentially avoid that.
MathF.Sqrt
would give you sqrtss
which is slightly faster (12 cycles of latency instead of 15-16) and avoids the need for float->double and double->float conversions (which aren't exactly cheap either). I'm not sure if there's something currently that could give you rsqrtps
, that would have to be a JIT intrinsic.
The divide implementation for the fixed-size vectors uses multiplication by inverse, which doesn't yield the same result as divide. This affects the result in the test https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Regression/JitBlue/GitHub_8220/GitHub_8220.cs, which was modified to allow a variance of ~1 ULP, but which should in theory yield the same result for the two cases.