dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

TensorPrimitive: Consider to optimize integer divisions #105204

Open huoyaoyuan opened 1 month ago

huoyaoyuan commented 1 month ago

TensorPrimitive by default delegates simple operators to vector intrinsics. This is fine for most operations, but IDIV is an exception.

First, most (if not all) ISAs lack support for IDIV in vector. I've checked AVX512/Avx2 and Sve/AdvSimd but don't find it. Thus our intrinsic vector will use software simulation. On my CPU with AVX2, it's about 2.5x slower comparing to naive for-loop on int[1024] / int(scalar).

When dividing with a common divisor, there is also the widely-used preinv algorithm to turn the division into cheaper multiplication, which is supported for vectorization on various ISAs.

I'm not sure if integer division is popular enough for this optimization. But we should at least disable DivideOperator.Vectorizable for integer types, because it ends up uses software simulation.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors See info in area-owners.md if you want to be subscribed.

jeffhandley commented 1 month ago

@michaelgsharp / @tannergooding -- Can the two of you chat about this please and decide:

  1. Do we want to tackle this in .NET 9 since it's in TensorPrimitives and not Tensor<T>?
  2. Which of you can take the assignment?
tannergooding commented 1 month ago

This is optimization, not correctness, and is a fairly involved change (especially with relevant perf testing).

Given that TensorPrimitives is stable since .NET 8, I'd leave this as is and optimize it for .NET 10 instead.

michaelgsharp commented 1 month ago

I agree with Tanner that we should push the majority of this back to .NET 10. Its trivial to disable the vectorization for int cases, and that will give us a few wins, so we should do that part in .NET 9. That will still leave many cases running un-optimally, and those we should tackle in .NET 10.

tannergooding commented 1 month ago

Moving to 10, have put up #106288 to avoid vectorization for types that aren't float or double. Called out cases where a manual for loop is likely to remain faster until .NET 10 as well (particularly for when the divisor is a constant).