Vector2, Vector3 and Vector4 divide accuracy

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

15.2k stars 4.72k forks source link

Vector2, Vector3 and Vector4 divide accuracy #19582

Closed CarolEidt closed 4 years ago

CarolEidt commented 7 years ago

The divide implementation for the fixed-size vectors uses multiplication by inverse, which doesn't yield the same result as divide. This affects the result in the test https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Regression/JitBlue/GitHub_8220/GitHub_8220.cs, which was modified to allow a variance of ~1 ULP, but which should in theory yield the same result for the two cases.

mellinoe commented 7 years ago

I guess that this implementation was only chosen as a micro-optimization of sorts, with the idea that using division here would generally be slower than multiplying by the inverse. It's also possible we borrowed the optimization from the XNA source code, which would have often been running on the xbox 360 where the optimization was more important (since it was more resource-constrained and had a PowerPC arch). For us and our current CPU targets, it's probably not a very important optimization to keep.

mikedn commented 7 years ago

It's also possible we borrowed the optimization from the XNA source code

Yes, XNA did this. DirectX Math also does this indirectly, via the rsqrtps SSE instruction.

For us and our current CPU targets, it's probably not a very important optimization to keep.

Well, divss is a rather expensive instruction at 11 cycles latency vs 4 for mulss. At the same time the overall performance of Normalize isn't great anyway because it uses sqrtpd instead of rsqrtps (15 cycles vs 4 cycles). So it's probably OK to remove the optimization, at least we'd get a more precise implementation instead of the current one which is neither precise nor fast.

But I think what matters most is that the 2 code paths are identical in this regard. Sometimes differences are unavoidable when different floating point computations are used but this one seems unwarranted.

mellinoe commented 7 years ago

it uses sqrtpd instead of rsqrtps (15 cycles vs 4 cycles)

I am assuming this is due to the use of Math.Sqrt(double), right? Given that MathF.Sqrt(float) is available, we could potentially avoid that.

mikedn commented 7 years ago

I am assuming this is due to the use of Math.Sqrt(double), right? Given that MathF.Sqrt(float) is available, we could potentially avoid that.

MathF.Sqrt would give you sqrtss which is slightly faster (12 cycles of latency instead of 15-16) and avoids the need for float->double and double->float conversions (which aren't exactly cheap either). I'm not sure if there's something currently that could give you rsqrtps, that would have to be a JIT intrinsic.