dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Intel HW Intrinsics classes should use inheritance #25953

Closed CarolEidt closed 4 years ago

CarolEidt commented 6 years ago

A good portion of X86 ISAs all follow a hierarchy of support. If SSE2 is supported, then implicitly SSE is supported. Similarly, if AVX2 is supported, then AVX is supported as well.

We should design our HW Intrinsics classes in a similar fashion. This allows for simplicity in coding calling functions from 2 (or more) instruction sets:

if (SSE3.IsSupported)
{
    var v1 = SSE3.SetAllVector128(0.5f);   // comes from SSE
    var v2 = SSE3.HorizontalAdd(v1, v1);   // comes from SSE3
}

This means we should have our classes inherit from each other to model this inheritance in the ISAs:

public class Sse
{ ...
}
public class Sse2 : Sse
{ ...
}
public class Sse3 : Sse2
{ ...
}

The IsSupported hierarchy checks are doc'd in https://software.intel.com/en-us/articles/intel-sdm, for example:

12.4.4 Programming SSE3 with SSE/SSE2 Extensions SIMD instructions in SSE3 extensions are intended to complement the use of SSE/SSE2 in programming SIMD applications. Application software that intends to use SSE3 instructions should also check for the availability of SSE/SSE2 instructions.

The hierarchy is also checked in the .NET runtime: https://github.com/dotnet/coreclr/blob/6bf04a47badd74646e21e70f4e9267c71b7bfd08/src/vm/codeman.cpp#L1307.

There is a bit of an open question: What should AVX inherit from? Should it inherit from SSE/SSE2/SSE4.1/SSE4.2? The Intel docs don't seem to indicate any SSE is necessary to support AVX, but the .NET runtime will only support AVX if SSE2 is present.

NOTE Taking this proposal would mean we will need to drop static class from all our classes, since static class cannot form an inheritance hierarchy.

For the original problem of SetZeroVector128, we have the following options:

  1. Explode the method out, but since there is no argument, we would have to append the type names to each method: SetZeroVector128Float SetZeroVector128Double, SetZeroVector128Int32, etc.
  2. Leave the API as it is, but add support Sse2.SetZeroVector128<float>(), which would match the behavior of Sse.SetZeroVector128().
  3. Since it is a helper, and not a real intrinsic, move the API to another class. Other helper methods could follow suit: like SetAllVector128, SetVector128, etc.
    1. Potentially Vector128<T>.Zero;, new Vector128<T>();, or default.
  4. Exposing a generic SetZeroVector128 in Sse class, and generate different code for underlying hardware support
    1. SSE: emitxorps for all base type that provides correct semantics but maybe with a bit data-bypass penalty on integer base types.
    2. SSE2: emit xorpd, xorps, or pxor for different base types.

Original proposal:

This was discussed during the API design (https://github.com/dotnet/corefx/pull/23489#issuecomment-326433485) but has come under discussion again in the context of https://github.com/dotnet/coreclr/pull/17691.

A related issue is, if the APIs remain separate, how to deal with an API, e.g. SetZeroVector128 that provides support for float vectors in SSE, but supports the full range of generic types in SSE2.

CarolEidt commented 6 years ago

cc @eerhardt @fiigii @mikedn @4creators @tannergooding

tannergooding commented 6 years ago

Copying my response from the mentioned thread...

The original reasons for splitting this, as I recall, were:

There were several discussions around how to make these APIs the "most usable"/"friendly". In the end, it was originally decided that some things (like software fallbacks) could be provided separately. I believe a shared wrapper API could be provided in the same way (or even as part of the same software fallback library).

I do think that merging the APIs may make it "easier to use", but so would merging "AVX" and "AVX2". However, we can't merge "AVX" and "AVX2" because there are still plenty of CPUs and operating systems that only support one of them. Additionally, users can trivially "merge" these APIs themselves with a using static Sse; using static Sse2;

fiigii commented 6 years ago

I prefer to merge SSE and SSE2 into one class.

From the hardware product perspective, all the 64-bit x86 processors support SSE2 as the SIMD ISA bottom-line. I do not think any existing MSIL platform will consistently support the hardware released before 2001 (right?)

From the user experience perspective, merging SSE and SSE2 makes a lot of 128-bit APIs generic and more convenient to use. using static Sse; using static Sse2; just solve parts of the problem.

One of my concern of merging SSE and SSE2 is that finding a good name of the class may be hard, Sse, Sse2, or SseSse2?

fiigii commented 6 years ago

SetZeroVector128 that provides support for float vectors in SSE, but supports the full range of generic types in SSE2.

Actually, I do not think this is a problem. The codegen of helper intrinsic does not need to be constrained to ISAs. Sse2.SetZeroVector128<float> can generate any code to provide a "zero vector". Similarly, Sse2.SetAllVector128 also can be generic to support the full range of generic types in SSE2.

Merging SSE and SSE2 can avoid this kind of "duplicated" APIs, such as Sse2.SetZeroVector128<T> and Sse.SetZeroVector128.

tannergooding commented 6 years ago

It may be worth noting that other languages/runtimes that are also bringing up HWIntrinsic support are also keeping the ISAs separated.

For example:

4creators commented 6 years ago

other languages/runtimes that are also bringing up HWIntrinsic support are also keeping the ISAs separated

It was main argument behind design of HW APIs. We have done split based on hardware support detection to give fine grained control over implementation.

all the 64-bit x86 processors support SSE2 as the SIMD ISA bottom-line

I do not think it is true as after some additional data checks I have verified that latest Celeron N, Pentium N and Atom C series processors like:

https://ark.intel.com/products/95592/Intel-Pentium-Processor-N4200-2M-Cache-up-to-2_5-GHz or https://ark.intel.com/products/97935/Intel-Atom-Processor-C3308-4M-Cache-up-to-2_10-GHz

do not support any type of SIMD instructions except for AES NI.

fiigii commented 6 years ago

I do not think it is true as after some additional data checks I have verified that latest Celeron N, Pentium N and Atom C series processors like: do not support any type of SIMD instructions except for AES NI.

No, I am pretty sure that Goldmont architecture has SSE -SSE4.2 instructions.

4creators commented 6 years ago

No, I am pretty sure that Goldmont architecture has SSE -SSE4.2 instructions.

Can you point us into any docs confirming SIMD support in Atom.

Another problem is with Celeron N and Pentium N which are also known for lack of SIMD support.

fiigii commented 6 years ago

Please look for the docs of micro-architecture instead of processor product. For example, the newest ATOM/Celeron/Pentium should base on the architecture code-named Goldmont or Goldmont Plus.

gfoidl commented 6 years ago

One of my concern of merging SSE and SSE2 is that finding a good name of the class may be hard, Sse, Sse2, or SseSse2?

SseX?

For usage it is way more convenient to have these merged to one class.

ghost commented 6 years ago

SseX?

No thanks 😀

To keep the both worlds happy, can we have class SseX in addition to those for separate ISAs?

fiigii commented 6 years ago

Made a PR https://github.com/dotnet/corefx/pull/29447 to show the effect of merging SSE and SSE2 classes, which may help the discussion here.

eerhardt commented 6 years ago

@CarolEidt @tannergooding @terrajobst - I've updated the original comment to follow the API proposal process, and to capture the proposal to use inheritance in the X86 classes.

All - Please take a look at the proposal and provide feedback.

tannergooding commented 6 years ago

I like the proposal and think it works well. It allows easy access to the various methods from a given class while still keeping the IsSupported checks logical/etc. It also becomes easy to discover the relationship via source control (i.e. Is using Sse okay if Sse3.IsSupported returns true).

Props to @terrajobst, who I believe was the original proposer.

tannergooding commented 6 years ago

For the SetZeroVector128 issue. I am in favor of 1 (rename and explode the APIs) or 3 (move them to a shared class).

3 may be slightly better as it is also applicable to ARM (cc. @sdmaclea) and may simplify some things (such as StaticCast implementation, where the plan is to just fold it away in the importer).

fiigii commented 6 years ago

For the SetZeroVector128 issue, I prefer 2, but there is 4th solution.

Note: we (and C++ compilers) already use the similar approach for optimizing some other helpers (e.g., Avx.SetAllVector256).

mikedn commented 6 years ago

For SetZero there's also a variant of the 3rd option - just rely on new Vector128<T>() and default to do the right thing. Requiring a method call to produce a 0 value is counterintuitive in provides no advantages.

eerhardt commented 6 years ago

Thanks @fiigii and @mikedn, I've added those options to the list.

tannergooding commented 6 years ago

Closing this as resolved, since we are now using inheritance to expose the intrinsics.

Any other issues tracked here should be moved to their own new issue.