Open BreyerW opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.
Author: | BreyerW |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Numerics` |
Milestone: | - |
This API is very difficult to provide since not all hardware has FMA
support and it needs to behave the same whether that support exists or not. Therefore, it would resolve to a very slow implementation on older hardware which may be unexpected.
It would likely be better to expose MultiplyAddEstimate
which is then free to do (a * b) + c
-or- fma(a, b, c)
depending on what the hardware supports. Such a name follows the existing convention we've established.
If the proposal is updated to follow that, we should consider exposing a similar API to float
/double
and the corresponding INumberBase
interface.
This issue has been marked needs-author-action
and may be missing some important information.
@tannergooding done, let me know if i need to tweak proposal further.
BTW is there API that checks FMA support specifically? (not SIMD) or good enough approximate check in SIMD? Cause some folks may want to know that MAE is going to differ for very large inputs on unsupported hardware.
Also maybe we should add FusedMultiplyAdd
along with Estimate
variant anyway since im pretty sure there would be cases where correctness would trump any perf concerns (im referring to rounding behaviour difference). Software fallback would be just component-wise MathF.FusedMultiplyAdd
which already has proper semantics but slow execution in face of lacking hardware support no?
And just food for thought: whats the newest hardware that does NOT support FMA? Im not hardware expert but maybe last hardware is old enough its no longer real concern?
Background and motivation
Currently FMA is exposed for primitives (
double
&floats
) and full blown SIMD vectors but nothing for convenience primititves likeVector2/3/4
AFAIK which sits between them. FMA isnt about only perf (on hardware that has built-in FMA/SIMD FMA ofc) but also about avoiding intermediate rounding.API Proposal
Under the hood
Vector2
could useVector64<float>
orMathF.FusedMultiplyAdd
where applicable and faster.Vector3
i imagine would likely widen toVector128<float>
and set 0 to last element since it will be discarded when returning whileVector4
would be used as-is asVector128<float>
.Software fallback would be simple
(a * b) + c
component-wise for perf reasons henceEstimate
suffix since software fallback would differ in rounding behaviour for very large components.API Usage
Alternative Designs
Alternative would be to write platform-agnostic SIMD FMA (which currently would use S.R.I.x86.FMA and S.R.I.ARM + software fallback under the hood) at which point handrolling FMA for
Vector2/3/4
wouldnt be too bad.Another alternative is to handroll on your own FMA for each component but that becomes ugly the more components there are and adding SIMD FMA support for perf makes this even worse, especially since theres no platform-agnostic SIMD FMA AFAIK.
Risks
Estimate behaviour in face of different hardware support for FMA could be suprising but thats mostly documentation exercise and
Estimate
suffix already points out its not exactlyFusedMultiplyAdd
.