dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

[API Proposal]: Support for hardware intrinsics to control treatment of subnormal numbers/denormals #88525

Closed BasTossings closed 1 year ago

BasTossings commented 1 year ago

Background and motivation

The following macros supplied by Intel can be used to set the FTZ and DAZ flags of the MXCSR register on x86/x64:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON)

(See this document for more details: Set the FTZ and DAZ Flags)

Whereas ARM has the FZ bit in the FPSCR register that has a similar function.

These flags are thread-local from what I can gather (at least on x86/x64).

As far as I can tell, these hardware intrinsics are not yet exposed by the .NET runtime.

Settings these flags may greatly improve floating point performance in some specific cases where performance is preferrable to accuracy: Subnormal number

It would be nice if we could have acces to these flags from within .NET.

API Usage

System.Runtime.Intrinsics.X86.Sse.SetThreadFTZ ( true );
System.Runtime.Intrinsics.X86.Sse.SetThreadDAZ ( true );

System.Runtime.Intrinsics.Arm.Foo.SetThreadFZ ( true );

Alternative Designs

No response

Risks

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation The following macros supplied by Intel can be used to set the `FTZ` and `DAZ` flags of the `MXCSR` register on x86/x64: ``` _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON) _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON) ``` (See this document for more details: [Set the FTZ and DAZ Flags](https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/set-the-ftz-and-daz-flags.html)) Whereas ARM has the `FZ` bit in the `FPSCR` register that has a similar function. These flags are thread-local from what I can gather (at least on x86/x64). As far as I can tell, these hardware intrinsics are not yet exposed by the .NET runtime. Settings these flags may greatly improve floating point performance in some specific cases where performance is preferrable to accuracy: [Subnormal number](https://en.wikipedia.org/wiki/Subnormal_number) It would be nice if we could have acces to these flags from within .NET. ### API Usage ```csharp System.Runtime.Intrinsics.X86.Sse.SetThreadFTZ ( true ); System.Runtime.Intrinsics.X86.Sse.SetThreadDAZ ( true ); System.Runtime.Intrinsics.Arm.Foo.SetThreadFZ ( true ); ``` ### Alternative Designs _No response_ ### Risks _No response_
Author: BasTossings
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged`
Milestone: -
tannergooding commented 1 year ago

Such APIs are incredibly dangerous if used incorrectly and risk causing issues for the JIT, GC, or other code in the application.

I couldn't see us adding this functionality as defined because of this, particularly with the complexity around callbacks, exceptions, and other considerations basically causing the need to track and insert "transitions" to ensure that we're in a well-defined state for other methods. Such transitions cause these APIs to become quite a bit more expensive (and they already were not "cheap") to actually call and so they will often not pay for themselves with needing to toggle the bit everywhere.

There's potentially alternatives we could do, such as exposing some static helper APIs that allow normalizing floats/doubles such that subnormal values become 0 and where its done efficiently.


Most of the cost for these comes from the IEEE 754 exception handling, which means that hardware would fault on use. .NET doesn't support that exception handling (disabling it on startup) and so we do not throw for such cases which significantly reduces the cost. This penalty is also significantly reduced on newer hardware (namely Sandy Bridge, circa 2011, and later) and is effectively non-existent for add, subtract, multiply, divide, and convert.

The main place they do end up being beneficial is when you have a lot of code that frequently does generate denormal values, such as doing low-lighting computations for 3D graphics. In that scenario you can, particularly on older hardware, see measurable performance wins. However, there are also many alternatives to handling that same scenario in a manner that avoids the perf penalty.

For the cases I called out above (add, subtract, multiply, divide, and convert), you'll see the same performance on essentially any hardware that's been released in the last 12 years. For other cases, like Sqrt, you'll see about a 15% perf penalty but only for the such cases that are actually handling a denormal value. Not only will this often be a subset of the overall algorithm (and so it will typically be a much smaller regression), but there are many ways that it can be mitigated such as by integrating manual flushing of the subnormal values into the algorithm (either at the key points where they can be produced or before they get used with an instruction where there's a penalty).

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

BasTossings commented 1 year ago

Thank you for your in-depth response. Right now I handle these cases by using double.IsSubnormal() to conditionally set values to 0. Which adds branching and is not ideal from a performance standpoint. My specific use case is in an IIR convolution filter where the response tends to slowly decay toward 0. I could omit the check altogether but I expect it to be used on a wide range of hardware (old and new) and predictable performance is a big plus. Hence my question.

I understand it has some potentially risky side effects that the user of such an api needs to be thoroughly aware of. But could that not be mitigated with clear documentation? I would say that with all hardware intrinsic it is the responsibility of the caller to use it correctly.

tannergooding commented 1 year ago

I would say that with all hardware intrinsic it is the responsibility of the caller to use it correctly.

It's a lot different from most intrinsics which are simply doing basic operations on SIMD values (not impacting any global state/etc) or which are doing explicit memory accesses and where the risk of "doing the wrong thing" is fairly minimal/restricted

But could that not be mitigated with clear documentation

This is a scenario where invalid use can trivially corrupt the whole program in unexpected ways and where toggling the state on/off would require correctly handling if the JIT/GC needs to run on that thread, in cases like callbacks or continuations, etc

Not correctly transitioning back could easily cause WPF or any of the float or System.Math APIs to start doing the wrong thing in their own code, for example.

Which adds branching and is not ideal from a performance standpoint

Branching itself, particularly when its predictable, is often not a problem. There also exist a range of hardware intrinsic APIs that would allow someone to handle these in a branch free manner. This could be done in SIMD via something like:

Vector128<int> tmp = Vector128.CreateScalarUnsafe(x).AsInt32();
Vector128<int> abs = tmp & Vector128.CreateScalarUnsafe(int.MaxValue);
Vector128<int> msk = Vector128.GreaterThan(abs, Vector128.CreateScalarUnsafe(0x007FFFFF));
return (tmp & msk).AsSingle().ToScalar();

Which generates on x64:

vpand xmm1, xmm0, [CNS1]
vpcmpgtd xmm1, xmm1, [CNS2]
vpand xmm0, xmm0, xmm1

So it handles the subnormal values (flushing them to zero) at a base 3 cycle penalty, preserving normals, infinites, and NaNs. In a loop, the two constants will be hoisted and otherwise they'll frequently be in the cache somewhere.

There's probably other more clever ways to handle this as well, particularly on even newer hardware (say Avx512F capable, possibly using vfpclassify and vfixupimm), but its a pretty good balance.

tannergooding commented 1 year ago

Going to close this as not-actionable in its current state.

As stated above, I would be open to a proposal which requested we expose something more like (and equivalent APIs for other IFloatingPointIeee754<T> like types and Vectors of those types):

public partial struct Double
{
    public static double FlushSubnormalToZero(double value);
}

The name could probably use some work, but as a general premise this would cover the functionality in whatever way was deemed "most efficient" for the platform. A compiler could utilize embedded rounding control (such as exists on AVX512), it could flush using the branch free logic I gave above, it could opt to selectively do floating-point control word transitions, etc.