[API Proposal]: FMA for Vector2/3/4 specifically

BreyerW commented 1 year ago

Background and motivation

Currently FMA is exposed for primitives (double & floats) and full blown SIMD vectors but nothing for convenience primititves like Vector2/3/4 AFAIK which sits between them. FMA isnt about only perf (on hardware that has built-in FMA/SIMD FMA ofc) but also about avoiding intermediate rounding.

API Proposal

namespace System.Numerics;

public struct Vector2
{
+    public static Vector2 MultiplyAddEstimate(Vector2 x,Vector2 y, Vector2 z);
}
public struct Vector3
{
+    public static Vector3 MultiplyAddEstimate(Vector3 x,Vector3 y, Vector3 z);
}
public struct Vector4
{
+    public static Vector4 MultiplyAddEstimate(Vector4 x,Vector4 y, Vector4 z);
}

Under the hood Vector2 could use Vector64<float> or MathF.FusedMultiplyAdd where applicable and faster. Vector3 i imagine would likely widen to Vector128<float> and set 0 to last element since it will be discarded when returning while Vector4 would be used as-is as Vector128<float>.

Software fallback would be simple (a * b) + c component-wise for perf reasons hence Estimate suffix since software fallback would differ in rounding behaviour for very large components.

API Usage

var x = Vector3.UnitX;
var y = Vector3.UnitY;
var z = Vector3.UnitZ;

var fma = Vector3.MultiplyAddEstimate(x,y,z);

Alternative Designs

Alternative would be to write platform-agnostic SIMD FMA (which currently would use S.R.I.x86.FMA and S.R.I.ARM + software fallback under the hood) at which point handrolling FMA for Vector2/3/4 wouldnt be too bad.

Another alternative is to handroll on your own FMA for each component but that becomes ugly the more components there are and adding SIMD FMA support for perf makes this even worse, especially since theres no platform-agnostic SIMD FMA AFAIK.

Risks

Estimate behaviour in face of different hardware support for FMA could be suprising but thats mostly documentation exercise and Estimate suffix already points out its not exactly FusedMultiplyAdd.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation Currently FMA is exposed for primitives (`double` & `floats`) and full blown SIMD vectors but nothing for convenience primititves like `Vector2/3/4` AFAIK which sits between them. FMA isnt about only perf (on hardware that has built-in FMA/SIMD FMA ofc) but also about avoiding intermediate rounding. ### API Proposal ```csharp namespace System.Numerics; public struct Vector2 { + public static Vector2 FusedMultiplyAdd(Vector2 x,Vector2 y, Vector2 z); } public struct Vector3 { + public static Vector3 FusedMultiplyAdd(Vector3 x,Vector3 y, Vector3 z); } public struct Vector4 { + public static Vector4 FusedMultiplyAdd(Vector4 x,Vector4 y, Vector4 z); } ``` Under the hood `Vector2` could use `Vector64` or `MathF.FusedMultiplyAdd` where applicable and faster. `Vector3` i imagine would likely widen to `Vector128` and set 0 to last element since it will be discarded when returning while `Vector4` would be used as-is as `Vector4` Note: for 1st version it would be fine to just expose FMA as component-wise `MathF.FusedMultiplyAdd` without fancy SIMD support. The idea here is to enable simple FMA for `System.Numerics` ### API Usage ```csharp var x = Vector3.UnitX; var y = Vector3.UnitY; var z = Vector3.UnitZ; var fma = Vector3.FusedMultiplyAdd(x,y,z); ``` ### Alternative Designs Alternative would be to write platform-agnostic SIMD FMA (which currently would use S.R.I.x86.FMA and S.R.I.ARM + software fallback under the hood) at which point handrolling FMA for `Vector2/3/4` wouldnt be too bad. Another alternative is to handroll on your own FMA for each component but that becomes ugly the more components there are and adding SIMD FMA support for perf makes this even worse, especially since theres no platform-agnostic SIMD FMA AFAIK. ### Risks None AFAIK

Author:	BreyerW
Assignees:	-
Labels:	`api-suggestion`, `area-System.Numerics`
Milestone:	-

tannergooding commented 1 year ago

This API is very difficult to provide since not all hardware has FMA support and it needs to behave the same whether that support exists or not. Therefore, it would resolve to a very slow implementation on older hardware which may be unexpected.

It would likely be better to expose MultiplyAddEstimate which is then free to do (a * b) + c -or- fma(a, b, c) depending on what the hardware supports. Such a name follows the existing convention we've established.

If the proposal is updated to follow that, we should consider exposing a similar API to float/double and the corresponding INumberBase interface.

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

BreyerW commented 1 year ago

@tannergooding done, let me know if i need to tweak proposal further.

BTW is there API that checks FMA support specifically? (not SIMD) or good enough approximate check in SIMD? Cause some folks may want to know that MAE is going to differ for very large inputs on unsupported hardware.

Also maybe we should add FusedMultiplyAdd along with Estimate variant anyway since im pretty sure there would be cases where correctness would trump any perf concerns (im referring to rounding behaviour difference). Software fallback would be just component-wise MathF.FusedMultiplyAdd which already has proper semantics but slow execution in face of lacking hardware support no?

And just food for thought: whats the newest hardware that does NOT support FMA? Im not hardware expert but maybe last hardware is old enough its no longer real concern?

dotnet / runtime