TensorPrimitives improvements in .NET 9.0

stephentoub commented 11 months ago

Regardless of any additional types we may want to add to System.Numerics.Tensors, we would like to expand the set of APIs exposed on the TensorPrimitives static class in a few ways (beyond the work done in .NET 8 in https://github.com/dotnet/runtime/issues/92219):

[ ] https://github.com/dotnet/runtime/issues/97193
[ ] Alignment improvements for ConvertXx, CosineSimilarity, IndexOfMin, IndexOfMax, IndexOfMinMagnitude, IndexOfMaxMagnitude
[x] https://github.com/dotnet/runtime/issues/98861
[ ] Additional operations defined in BLAS / LAPACK that don't currently have representation on TensorPrimitives
[ ] Perform a broader scan of ML.NET APIs, seeking more methods that should be on the post-GA backlog - @michaelgsharp
- We've already covered all of the shared methods, but there are one-off implementations that might be worth productizing into TensorPrimitives
[ ] Additional operations that would enable completely removing the internal CpuMath class from ML.NET, e.g. Add (with indices), AddScale (with indices), DotProductSparse, MatrixTimesSource, ScaleAdd improvement via AddMultiply or MultipleAdd overloads, SdcaL1UpdateDense, SdcaL1UpdateSparse, and ZeroMatrixItems (might exist in System.Memory).
[ ] Double-check the flow of XML docs -> https://github.com/dotnet/dotnet-api-docs -> docs.microsoft.com/
[ ] Add conceptual docs for TensorPrimitives, maybe near https://github.com/dotnet/docs/blob/main/docs/standard/numerics.md
[x] Generic overloads of all relevant methods, constrained to the appropriate numerical types
[x] Get benchmarks added into dotnet/performance
- Collect baseline results from the time between RC2 and GA right before our alignment improvements went in
- Collect new results from main after all of the alignment
[x] Improve performance of Min, Max, MinMagnitude, MaxMagnitude with relation to NaN handling
[x] Determine for lengths of 0 if we want to throw or return NaN (we consistently throw today when non-0 is required; ML.NET apparently returns 0?) - @tannergooding
- We currently throw; if we decide not to throw, this could be changed in a minor release in a non-breaking way.
[x] Additional operations from Math{F} that don't currently have representation on TensorPrimitives, e.g. CopySign, Reciprocal{Sqrt}{Estimate}, Sqrt, Ceiling, Floor, Truncate, Log10, Log(x, y) (with y as both span and scalar), Pow(x, y) (with y as both span and scalar), Cbrt, IEEERemainder, Acos, Acosh, Cos, Asin, Asinh, Sin, Atan. This unmerged commit has a sketch, but it's out-of-date with improvements that have been made to the library since, and all of the operations should be vectorized.
[x] Refactor the generic TP implementation into multiple source files.
[x] Additional operations defined in the numerical interfaces that don't currently have representation on TensorPrimitives, e.g. BitwiseAnd, BitwiseOr, BitwiseXor, Exp10, Exp10M1, Exp2, Exp2M1, ExpM1, Atan2, Atan2Pi, ILogB, Lerp, ScaleB, Round, Log10P1, Log2P1, LogP1, Hypot, RootN, AcosPi, AsinPi, AtanPi, CosPi, SinPi, TanPi

We plan to update the System.Numerics.Tensors package alongside .NET 8 servicing releases. When there are bug fixes and performance improvements only, the patch number part of the version will be incremented. When there are new APIs added, the minor version will be bumped. For guidance on how we bump minor/major package versions, see this example.

ghost commented 11 months ago

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors See info in area-owners.md if you want to be subscribed.

Issue Details

Regardless of any additional types we may want to add to `System.Numerics.Tensors`, we would like to expand the set of APIs exposed on the `TensorPrimitives` static class in a few ways: - Additional operations from `Math{F}` that don't currently have representation on `TensorPrimitives`, e.g. `CopySign`, `Reciprocal{Sqrt}{Estimate}`, `Sqrt`, `Ceiling`, `Floor`, `Truncate`, `Log10`, `Log(x, y)` (with y as both span and scalar), `Pow(x, y)` (with y as both span and scalar), `Cbrt`, `IEEERemainder`, `Acos`, `Acosh`, `Cos`, `Asin`, `Asinh`, `Sin`, `Atan`. [This unmerged commit](https://github.com/dotnet/runtime/commit/ada9b18f16ab6c248fe10deedb22404802334309) has a sketch, but it's out-of-date with improvements that have been made to the library since, and all of the operations should be vectorized. - Additional operations defined in the numerical interfaces that don't currently have representation on `TensorPrimitives`, e.g. `BitwiseAnd`, `BitwiseOr`, `BitwiseXor`, `Exp10`, `Exp10M1`, `Exp2`, `Exp2M1`, `ExpM1`, `Atan2`, `Atan2Pi`, `ILogB`, `Lerp`, `ScaleB`, `Round`, `Log10P1`, `Log2P1`, `LogP1`, `Hypot`, `RootN`, `AcosPi`, `AsinPi`, `AtanPi`, `CosPi`, `SinPi`, `TanPi` - Additional operations defined in BLAS / LAPACK that don't currently have representation on `TensorPrimitives` - Additional operations that would enable completely removing the internal `CpuMath` class from ML.NET, e.g. `Add` (with indices), `AddScale` (with indices), `DotProductSparse`, `MatrixTimesSource`, `ScaleAdd` improvement via `AddMultiply` or `MultipleAdd` overloads, `SdcaL1UpdateDense`, `SdcaL1UpdateSparse`, and `ZeroMatrixItems` (might exist in System.Memory). - Generic overloads of all relevant methods, constrained to the appropriate numerical types Concrete proposal to follow.

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Numerics.Tensors`
Milestone:	9.0.0

Szer commented 11 months ago

Could you please elaborate on the advantages of having these APIs in a BCL rather than in a specialized NuGet package (like numpy in Python)? This could provide a valuable perspective for further discussion.

stephentoub commented 11 months ago

Could you please elaborate on the advantages of having these APIs in a BCL rather than in a specialized NuGet package

It is a nuget package today. It's currently not part of netcoreapp. If it were to be pulled into netcoreapp as well, it would be because we'd be using it from elsewhere in netcoreapp, e.g. using it from APIs like Enumerable.Average, BitArray.And, ManagedWebSocket.ApplyMask, etc., which we very well may do in the future (that has no impact on it continuing to be available as a nuget package).

xoofx commented 11 months ago

Hey @stephentoub,

Would it be possible to expose the low level parts of the API instead of only providing Span versions?

e.g

public static Vector128<float> Log2(Vector128<float> value);
public static Vector256<float> Log2(Vector256<float> value);
public static Vector512<float> Log2(Vector512<float> value);
//...etc.

I did that for a prototype for a similar API and it's working great. One reason to expose these APIs is that you can actually build higher level functions (e.g for tensors, the zoo of the activation functions) and build Span versions on top of them.

These API can then be used for other kind of custom Span batching (not related to tensors), where the packing of the vector is different (e.g 4xfloat chuncked xxxx, yyyy, zzzz)

tannergooding commented 11 months ago

Would it be possible to expose the low level parts of the API instead of only providing Span versions?

Yes, but it needs to be its own proposal and cover all 5 vector types (Vector, Vector64/128/256/512) and consider whether its applicable to Vector2/3/4 as well.

xoofx commented 11 months ago

Yes, but it needs to be its own proposal and cover all 5 vector types (Vector, Vector64/128/256/512)

Cool, I will try to write something.

xoofx commented 11 months ago

Would it be possible to expose the low level parts of the API instead of only providing Span versions?

Follow-up, created the proposal #93513

msedi commented 10 months ago

@stephentoub:

If it were to be pulled into netcoreapp as well, it would be because we'd be using it from elsewhere in netcoreapp

if brought to the BCL wouldn't it make sense to rename TensorPrimitives to lets say ArrayMath, VectorMath or VectorPrimitives. Tensor seems a bit exaggerated for what it does, namely doing some math on arrays.

tannergooding commented 10 months ago

@msedi that would be a breaking change. Additionally, the intent is to expand it to the full set of BLAS support, so Tensor is a very apt and appropriate name that was already scrutinized, reviewed, and approved by API review

msedi commented 10 months ago

@tannergooding: Sure you right, I was just under the impression that there could be something more primitive. The tensor ist something, lets say higher level whereas the vector/array methods are on a lower level. But I'm completely fine with it whenever I know where to find it,

BTW. When looking at the code and the effort for the TensorPrimitives are there any efforts the JIT will some day manage to do the SIMD unfolding for us?

tannergooding commented 10 months ago

the JIT will some day manage to do the SIMD unfolding for us?

The JIT is unlikely to get auto-vectorization in the near future as such support is complex and quite expensive to do. Additionally, outside of particular domains, such support does not often light up and has measurable impact to real world apps even less frequently. Especially for small workloads it can often have the opposite effect and slow down your code. In the domains where it does light up, and particularly where it would be beneficial to do, you are often going to get better perf by writing your own SIMD code directly.

It is therefore my opinion that our efforts would be better spent providing APIs from the BCL that provide this acceleration for you. Such as all the APIs on Span<T>, accelerating LINQ, the new APIs on TensorPrimitives, etc. It may likewise be beneficial to expose some SIMD infrastructure helpers like we've defined internally for TensorPrimitives; that is expose some public form of InvokeSpanSpanIntoSpan and friends, which would allow developers to only worry about providing the inner kernel and to have the rest of the SIMD logic (leading/trailing elements, alignment, unrolling, etc) handled internally. Efforts like ISimdVector<TSelf, T> also fit the bill of making it simpler for devs to write SIMD code.

msedi commented 10 months ago

@tannergooding : Thanks for the info. That makes sense For our case we wrote source generators to generate all the array primitives, currently with Vector but I wanted to benchmark against your implementations. I assume yours is better ;-)

tannergooding commented 1 month ago

Remaining work is for .NET 10

dotnet / runtime

TensorPrimitives improvements in .NET 9.0 #93286