Closed fiigii closed 4 years ago
cc: @russellhadley @mellinoe @CarolEidt @terrajobst
Overall I love this proposal. I do have a few questions/comments:
Each vector type exposes an IsSupported method to check if the current hardware supports
I think this can be a property, as it is in Vector<T>
.
Does this take the type of T into account? For example, will IsSupported
return true for Vector128<float>
but false for Vector128<CustomStruct>
(or is it expected to throw in this case)?
What about formats that may be supported on some processors, but not others? As an example, lets say there is instruction set X which only supports Vector128<float>
and later comes instruction set Y which supports Vector128<double>
. If the CPU currently only supports X would it return true for Vector128<float>
and false for Vector128<double>
with Vector128<double>
only returning true when instruction set Y is supported?
In addition, this namespace would contain conversion functions between the existing SIMD type (Vector
) and new Vector128 and Vector256 types.
My concern here is the target layering for each component. I would hope that System.Runtime.CompilerServices.Intrinsics
are part of the lowest layer, and therefore consumable by all other APIs in CoreFX. While Vector<T>
, on the other hand, is part of one of the higher layers and is therefore not consumable.
Would it be better here to either have the conversion operators on Vector<T>
or to expect the user to perform an explicit load/store (as they will likely be expected to do with other custom types)?
SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)
I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.
Other.cs (includes LZCNT, POPCNT, BMI1, BMI2, PCLMULQDQ, and AES)
For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.
Some of intrinsics benefit from C# generic and get simpler APIs
I didn't see any examples of scalar floating-point APIs (_mm_rsqrt_ss
). How would these fit in with the Vector based APIs (naming wise, etc)?
Looks good and in line with the suggestions I have made. The only thing that probably do not resonate with me (maybe because we deal with pointers on a regular basis on our codebase) is having to use Load(type*)
instead of supporting also the ability to call the function with a void*
as the semantics of the operation are very clear. Probably it is me, but with the exception of special operations like a non-temporal store (where you would need to use a Store/Load operation explicitely) not having support for arbitrary pointer types would only add bloat to the algorithm without any actual improvement in readability/understandability.
Therefore, CoreCLR also requires the immediate argument guard from C# compiler.
Going to tag @jaredpar here explicitly. We should get a formal proposal up.
I think that we can do this without language support (@jaredpar, tell me if I'm crazy here) if the compiler can recognize something like System.Runtime.CompilerServices.IsLiteralAttribute
and emits it as modreq isliteral
.
Having a new recognized keyword (const
) here is likely more complicated as it requires formal spec'ing in the language etc.
Thanks for posting this @fiigii. I'm very eager to hear everyone's thoughts on the design.
IMM Operands
One thing that came up in a recent discussion is that some immediate operands have stricter constraints than just "must be constant". The examples given use a FloatComparisonMode
enum, and functions accepting it apply a const
modifier to the parameter. But there is no way to prevent someone from passing a non-enum value, still a constant, to a method accepting that parameter.
`AVX.CompareVector256(left, right, (FloatComparisonMode)255);
EDIT: This warning is emitted in a VC++ project if you use the above code.
Now, this may not be a problem for this particular example (I'm not familiar with its exact semantics), but it's something to keep in mind. There were also other, more esoteric examples given, like an immediate operand which must be a power of two, or which satisfies some other obscure relation to the other operands. These constraints will be much more difficult, most likely impossible, to enforce at the C# level. The "const" enforcement feels more reasonable and achievable, and seems to cover most instances of the problem.
SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)
I'll echo what @tannergooding said -- I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.
💭 Most of my initial thoughts go to the use of pointers in a few places. Knowing what we know about ref structs and Span<T>
, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.
❓ In the following code, would the generic method actually be expanded to each of the forms allowed by the processor, or would it be defined in coed as a generic?
// __m128i _mm_add_epi8 (__m128i a, __m128i b)
// __m128i _mm_add_epi16 (__m128i a, __m128i b)
// __m128i _mm_add_epi32 (__m128i a, __m128i b)
// __m128i _mm_add_epi64 (__m128i a, __m128i b)
// __m128 _mm_add_ps (__m128 a, __m128 b)
// __m128d _mm_add_pd (__m128d a, __m128d b)
[Intrinsic]
public static Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right) where T : struct { throw new NotImplementedException(); }
❓ If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions? If we choose the former, would it make sense to rename IsSupported
to IsHardwareAccelerated
?
Knowing what we know about ref structs and Span
, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.
Personally, I am fine with the unsafe code. I don't believe this is meant to be a feature that app designers use and is instead meant to be something framework designers use to squeeze extra performance and also to simplify overhead on the JIT.
People using intrinsics are likely already doing a bunch of unsafe things already and this just makes it more explicit.
If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?
The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.
I am of the opinion that all of these methods should be declared as extern
and should never have software fallbacks. Users would be expected to implement a software fallback themselves or have a PlatformNotSupportedException
thrown by the JIT at runtime.
This will help ensures the consumer is being aware of the underlying platforms they are targeting and that they are writing code that is "suited" for the underlying hardware (running vectorized algorithms on hardware without vectorization support can cause performance degradation).
If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?
The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.
These are the raw CPU platform intrinsics e.g. X86.SSE
so PNS is probably fine; and will help get them out quicker.
Assuming the detection is branch eliminated; it should be easy to build a library on top that then does software fallbacks, which can be iterated on (either coreclr/corefx or 3rd party)
Personally, I am fine with the unsafe code.
I am not against unsafe code. However, given the choice between safe code and unsafe code that perform the same, I would choose the former.
I am of the opinion that all of these methods should be declared as extern and should never have software fallbacks.
The biggest advantage of this is the runtime can avoid shipping software fallback code that never needs to execute.
The biggest disadvantage of this is test environments for the various possibilities are not easy to come by. Fallbacks provide a functionality safety net in case something gets missed.
The biggest disadvantage of this is test environments for the various possibilities are not easy to come by.
@sharwell, what possibilities are you envisioning?
The way these are currently structured, proposed, the user would code:
public static double Cos(double x)
{
if (x86.FMA3.IsSupported)
{
// Do FMA3
}
else if (x86.SSE2.IsSupported)
{
// Do SSE2
}
else if (Arm.Neon.IsSupported)
{
// Do ARM
}
else
{
// Do software fallback
}
}
Under this, the only way a user is faulted is if they write a bad algorithm or if they forget to provide any kind of software fallback (and an analyzer to detect this should be fairly trivial).
running vectorized algorithms on hardware without vectorization support can cause performance degradation.
I would rephrase @tannergooding thought into: "running vectorized algorithms on hardware without vectorization support will with utmost certainty cause performance degradation."
For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.
@tannergooding We defined an individual class for each instruction set (except SSE and SSE2) but put certain small classes into the Other.cs
file. I will update the proposal to clarify.
// Other.cs
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
public static class LZCNT
{
......
}
public static class POPCNT
{
......
}
public static class BMI1
{
.....
}
public static class BMI2
{
......
}
public static class PCLMULQDQ
{
......
}
public static class AES
{
......
}
}
AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).
I don't think this needs to be true all the time. In some cases, the AOT can drop the check altogether, depending on the target operating system (Win8 and above require SSE and SSE2 support, for example).
In other cases, the AOT can/should drop the check from each method and should instead aggregate them into a single check at the highest entry point.
Ideally, the AOT would run CPUID once during startup and cache the results as globals (honestly, if the AOT didn't do this, I would log a bug). The IsSupported
check then becomes essentially a lookup of the cached value (just like a property normally behaves). This behavior is what the CRT implementations do to ensure that things like cos(double)
remain performant and that they can still run FMA3 code where supported.
For AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).
The implication would be from a usage perspective:
For Jit we could be quite granular on the checks as they are no-cost branch eliminated.
For AOT we'd need to be quite course on the checks and perform it at algorithm or library level, to offset the cost of CPUID; which may push it much higher than intended e.g. you wouldn't use a vectorized IndexOf; unless your strings were huge because CPUID would dominate.
Probably could still cache on AOT in startup, so it would set the property; it wouldn't branch eliminate, but would be fairly low cost?
I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.
I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.
@tannergooding @mellinoe The current design intent of class SSE2
is to make more intrinsic functions friendly to users. If we had two classes SSE
and SSE2
, certain intrinsics would loose the generic signature. For example, SIMD addition only supports float
in SSE, and SSE2 complements other types.
public static class SSE
{
// __m128 _mm_add_ps (__m128 a, __m128 b)
public static Vector128<float> Add(Vector128<float> left, Vector128<float> right);
}
public static class SSE2
{
// __m128i _mm_add_epi8 (__m128i a, __m128i b)
public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right);
public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
// __m128i _mm_add_epi16 (__m128i a, __m128i b)
public static Vector128<short> Add(Vector128<short> left, Vector128<short> right);
public static Vector128<ushort> Add(Vector128<ushort> left, Vector128<ushort> right);
// __m128i _mm_add_epi32 (__m128i a, __m128i b)
public static Vector128<int> Add(Vector128<int> left, Vector128<int> right);
public static Vector128<uint> Add(Vector128<uint> left, Vector128<uint> right);
// __m128i _mm_add_epi64 (__m128i a, __m128i b)
public static Vector128<long> Add(Vector128<long> left, Vector128<long> right);
public static Vector128<ulong> Add(Vector128<uint> left, Vector128<ulong> right);
// __m128d _mm_add_pd (__m128d a, __m128d b)
public static Vector128<double> Add(Vector128<double> left, Vector128<double> right);
}
Comparing to SSE2.Add<T>
, the above design looks complex, and users have to remember SSE.Add(float, float)
and SSE2.Add(int, int)
. Additionally, SSE2 is the bottom-line of RyuJIT code generation for x86/x86-64, seperating SSE from SSE2 has no advatage on functionality or convenience.
Although the current design (class SSE2 including SSE and SSE2 intrinsics) hurts API consistency, there is a trade-off between design consistency and user experience, which is worth discussing.
Rather than X86
maybe x86x64
as x86 is often used to donate 32-bit only?
Very excited we are finally seeing a proposal for this. My initial thoughts below.
AVX-512 is missing, probably since it is not that widespread yet, but I think it would be good to at least give this some thought and how to structure these because AVX-512 feature set is very fragmented. In this case I would assume we need to have a class for each set i.e. (see https://en.wikipedia.org/wiki/AVX-512):
public static class AVX512F {} // Foundation
public static class AVX512CD {} // Conflict Detection
public static class AVX512ER {} // Exponential and Reciprocal
public static class AVX512PF {} // Prefetch Instructions
public static class AVX512BW {} // Byte and Word
public static class AVX512DQ {} // Doubleword and Quadword
public static class AVX512VL {} // Vector Length
public static class AVX512IFMA {} // Integer Fused Multiply Add (Future)
public static class AVX512VBMI {} // Vector Byte Manipulation Instructions (Future)
public static class AVX5124VNNIW {} // Vector Neural Network Instructions Word variable precision (Future)
public static class AVX5124FMAPS {} // Fused Multiply Accumulation Packed Single precision (Future)
and add a struct Vector512<T>
type, of course. Note that the latter two AVX5124VNNIW
and AVX5124FMAPS
are hard to read due to number 4
.
Some of these can have a huge impact for deep learning, sorting etc.
Regarding Load
I have some concerns as well. As @redknightlois I think void*
should be considered too, but more importantly also load from/store to ref
. Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size. So something like (not sure where we could put this, and how naming should be done, if it can be moved to platform agnostic type.
[Intrinsic]
public static unsafe Vector256<sbyte> Load(sbyte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadSByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(byte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException(); }
// Etc.
The most important thing here is if ref
can be supported as it would be essential for supporting generic algorithms. Naming should be revised no doubt, but just trying to make a point. If we want to support load from void*
method name needs to include return type or method needs to be on type specific static class.
It's great we are discussing a concrete proposal right now. 😄
The above linked const keyword usage
language proposal was created explicitly to provide support for some of SIMD instructions requiring immediate parameters. I think it will be straightforward to implement but since it may delay introduction of intrinsics there were strong arguments in favor of going with simple attribute implementation first and later expand C# syntax and API by including support for const method parameters
.
IMO we have to discuss in parallel forward looking designs which comprise two different areas:
System.Numerics API
which can be partially implemented
with support of discussed here x86 intrinsicsIntrinsics API
which should comprise other architectures as well as this will have an impact on final shape of the intrinsics APII would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.
System.Intrinsics
general top level namespace for all intrinsics
System.Intrinsics.X86
x86 ISA extensions and separate assembly
System.Intrinsics.Arm
ARM ISA extensions and separate assembly
System.Intrinsics.Power
Power ISA extensions and separate assembly
System.Intrinsics.RiscV
RiscV ISA extensions and separate assembly
Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small.
Current implementations support limited set of register sizes:
However, ARM recently published:
ARM SVE - Scalable Vector Extensions
see: The Scalable Vector Extension (SVE), for ARMv8-A published on 31 March 2017 with status Non-Confidential Beta.
This specification is quite important as it introduces new register sizes - altogether there are 16 register sizes which are multiples of 128 bits. Details are on page 21 of the specification (table is below).
Maximum vector length: 2048 bits
Required vector lengths: 128, 256, 512, 1024 bits
Permitted vector lengths: 384, 640, 768, 896, 1152, 1280, 1408, 1536, 1664, 1792, 1920
It would be necessary to design API which is capable to support in near future 16 different register sizes and several thousands (or tens of thousands) of opcodes/functions (counting with overloads). Predictions of not having 2048 bit SIMD instructions in couple of years seems to have been falsified to anyone's surprise by ARM this year. Looking at history (ARM published public beta of ARMv8 ISA on 04 September 2013 and first processor implementing it was available to users globally in October 2014 - Samsung Galaxy Note 4) I would expect that first silicon with SVE extensions will be available in 2018. I suppose this would be most probably very close in time to public availability of DotNet SIMD intrinsics.
I would like to propose:
Implement basic Vectors supporting all register sizes in System.CoreLib.Private
namespace System.Numerics
{
[StructLayour(LayoutKind.Explicit)]
public unsafe struct Register128
{
[FieldOffset(0)]
public fixed byte [16];
.....
// accessors for other types
}
// ....
[StructLayour(LayoutKind.Explicit)]
public unsafe struct Register2048
{
[FieldOffset(0)]
public fixed byte [256];
.....
// accessors for other types
}
public struct Vector<T, R> where T, R: struct
{
}
public struct Vector128<T> : Vector<T, Register128>
{
}
// ....
public struct Vector2048<T> : Vector<T, Register2048>
{
}
}
All safe APIs would be exposed via Vector
All vector APIs will use System.Numerics.VectorXXX
public static Vector128<byte> MultiplyHigh<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);
public static Vector128<byte> MultiplyLow<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);
Intrinsics APIs will be placed in separate classes according to functionality detection patterns provided by processors. In case of x86 ISA this would be one to one correspondence between CPUID detection and supported functions. This would allow for easy to understand programming pattern where one would use functions from given group in way consistent with platform support.
Main reason for that kind of division is a requirement set by silicon manufacturers to use instructions only if they are detected in hardware. This allows for example to ship processor with support matrix comprising SSE3 but not SSSE3, or comprising PCLMULQDQ and SHA and not AESNI. This direct class - hardware support detection correspondence is the only safe way of having IsHardwareSupported detection and be compliant with Intel/AMD instruction usage restrictions. Otherwise kernel will have to catch for us #UD exception 😸
Intrinsics abstract usually in 1 to 1 way ISA opcodes however there are some intrinsics which map to several instructions. I would prefer to abstract opcodes (using nice names) and implement multi opcode intrinsics as functions on VectorXxx
@nietras
Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size.
The best place would be System.Numerics.VetorXxx<T>
all platforms should support load/store for a supported vector size
Is the platform agnostic Load/Store
any different from the existing Unsafe.Read/Write
?
Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?
@jkotas I had the same thought, how do those tie in with Unsafe
? I assume these would be unaligned then, and we can only use aligned via LoadAligned/StoreAligned
...
Or could we add Unsafe.ReadAligned/WriteAligned
and have the JIT recognize these for the vector types?
IsSupported
should be a property (or a static readonly
field) like IntPtr.Size
or BitConverter.IsLittleEndian
.
Combining SSE and SSE2 into a single class looks like a good trade-off for a simpler Add
function.
Like @redknightlois and @nietras I'm also concerned about the Load/Store API. ref
support is needed to avoid fixed
references. For void* Load/Store
generics could help:
[Intrinsic]
public static extern unsafe Vector256<T> Load<T>(void* mem) where T : struct;
[Intrinsic]
public static extern unsafe Vector256<sbyte> Load(sbyte* mem);
[Intrinsic]
public static extern Vector256<sbyte> Load(ref sbyte mem);
[Intrinsic]
public static extern unsafe Vector256<byte> Load(byte* mem);
[Intrinsic]
public static extern Vector256<byte> Load(ref byte mem);
// Etc.
Looking forward to using PDEP/PEXT
!
I would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.
Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small.
@4creators, I am vehemently against moving this feature higher in the hierarchy.
For starters, the runtime itself has to support any and all intrinsics (including the strings to identify them, etc) regardless of where we put them in the hierarchy. If the runtime doesn't support them, then you can't use them.
I also want to be able to consume these intrinsics from all layers of the stack, including System.Private.CoreLib
. I want to be able to write managed implementations of System.Math
, System.MathF
, various System.String
functions, etc. Not only does this increase maintainability of the code (since most of these are FCALLS or hand tuned assembly today) but it also increases cross-platform consistency (where the resulting FCALL or assembly is part of the underlying C runtime).
@pentp
Combining SSE and SSE2 into a single class looks like a good trade-off for a simpler Add function.
I do not think that intrinsics should abstract anything - instead simple add can be created on Vector128 - Vector2048. On the other hand it would be openly against Intel usage recommendations.
I also want to be able to consume these intrinsics from all layers of the stack, including System.Private.CoreLib. I want to be able to write managed implementations of System.Math, System.MathF, various System.String functions, etc.
@tannergooding Agree that it has to be available from System.Private.CoreLib
However it doesn't mean that it has to be low in hierarchy. No one will ship runtime (vm, gc, jit) which will support all intrinsics for all architectures. Division line goes through ISA plane - x86, Arm, Power. There is no reason to ship ARM intrinsics on x86 runtime. Having it in separate platform assembly in coreclr which could be referenced (circularly) by System.Private.CoreLib could be a solution (I think that a bit better than ifdefing everything)
The current design intent of class SSE2 is to make more intrinsic functions friendly to users. If we had two classes SSE and SSE2, certain intrinsics would loose the generic signature.
@fiigii, why does separating these out mean we lose the generic signature?
The way I see it, we have two options:
Vector128<float> Add(Vector128<float> left, Vector128<float> right)
Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right)
<T, U>
, and this can potentially become even more complex elsewhere)I see no reason why we can't have SSE
and SSE2
and why we can't just have both expose Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right)
.
That being said, I personally prefer the enforced form that requires additional APIs to be listed. Not only does this help enforce that the user is passing the right things down to the API, but it also decreases the number of checks the JIT must do.
Vector128<float>
means T
has already been enforced/validated as part of the API contract, Vector128<T>
means the JIT must validate T
is of a correct/supported type. This could potentially change from one runtime to the next (depending on the exact set of intrinsics the runtime was built to support) which can make this even more confusing.
However it doesn't mean that it has to be low in hierarchy. No one will ship runtime (vm, gc, jit) which will support all intrinsics for all architectures. Division line goes through ISA plane - x86, Arm, Power. There is no reason to ship ARM intrinsics on x86 runtime. Having it in separate platform assembly in coreclr which could referenced (circularly) by System.Private.CoreLib could be a solution.
I could get behind this. The caveats being that:
Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?
@jkotas, I think the primary difference is that Load/Store
will compile down to a SIMD instruction and will likely go directly into a register for most cases.
Having it in separate platform assembly in coreclr which could be referenced (circularly) by System.Private.CoreLib could be a solution
Circular references are non-starter. The existing solution for this problem is to have a subset required by CoreLib in CoreLib as internal, and the full blown (duplicate) implementation in separate assembly. Though, it is questionable whether this duplication in the sake of layering is really worth it.
Another thought about naming. The runtime/codegen has many intrinsics today all over the place, for example methods on System.Threading.Interlocked
or System.Runtime.CompilerServices.RuntimeHelpers
are implemented as intrinsics.
Should the namespace name be more specific to capture what actually goes into it, say System.Runtime.HardwareIntrinsics
?
Providing we would like to have direct access to numeric types encoded in RegisterXxx
structures - similar to current System.Numerics.Register implementation which is IMO a good design - one would need to create (rather generate) total of 10 064 fields with the following pattern:
namespace System.Numerics
{
[StructLayout(LayoutKind.Explicit)]
public unsafe struct Register128
{
public fixed byte Reg[16];
// System.Byte Fields
[FieldOffset(0)]
public byte byte_0;
[FieldOffset(1)]
public byte byte_1;
[FieldOffset(2)]
public byte byte_2;
// System.SByte Fields
// etc.
Specifically due to this problem there exists solution proposal based on extended generics syntax: Const blittable parameter as a generic type parameter (https://github.com/dotnet/csharplang/issues/749)
namespace System.Numerics
{
public unsafe struct Register<T, const int N>
{
public fixed T Reg[N];
}
public struct Vector128<T> : Vector<T, Register<T, 16>> {}
Later by specialising generics one can easily create required struct tree.
Load/Store will compile down to a SIMD instruction and will likely go directly into a register for most cases.
Unsafe.Load/Store
compiles into a SIMD instruction for the right sized structs today.
Circular references are non-starter. The existing solution for this problem is to have a subset required by CoreLib in CoreLib as internal, and the full blown (duplicate) implementation in separate assembly. Though, it is questionable whether this duplication in the sake of layering is really worth it.
@jkotas @tannergooding This settles this problem since duplicate implementation for API comprising roughly 10k functions ...
Unsafe.Load/Store compiles into a SIMD instruction for the right sized structs today.
This may be the case implicitly
, but it is not explicit in the API (which is the case for Vector128<float> SSE.Load(float* address)
). It is also implicit
on whether this is an aligned read/write or if it is unaligned.
One of my favorite features of this proposal is that the APIs are very explicit. If I say LoadAligned
, I know that I am going to get the MOVAPS
instruction (with no "ifs" "ands", or "buts" about it). If I say LoadUnaligned
, I know I am going to get the MOVUPS
instruction.
Should the namespace name be more specific to capture what actually goes into it, say System.Runtime.HardwareIntrinsics
Simple calculation for assembly size difference for functions defined as
public static void System.Runtime.CompilerServices.Intrinsics.AVX2::ShiftLeft
public static void System.Intrinsics.AVX2::ShiftLeft
for 5 000 functions is 250 KB.
duplicate implementation for API comprising roughly 10k functions ...
The stuff duplicated in CoreLib would be just say the 50 functions that are actually needed in CoreLib.
for 5 000 functions is 250 KB.
How did you come up with this number? The namespace name is stored in the managed binary just once. The difference between ShortNameSpace
and VeryLoooooooooooooooooongNameSpace
should be always ~20 bytes, independent on how many functions are contained in the namespace.
The stuff duplicated in CoreLib would be just say the 50 functions that are actually needed in CoreLib.
This would solve the problem of shipping all architectures together 😄
As to all the statements around things like exposing ref
or void*
(@pentp, @nietras, @redknightlois) and also as to whether or not a software fallback should be provided.
ref
might be worth exposing
Unsafe.ToPointer
solves part of this but requires users to take a separate dependency. It also means that corlib
has more trouble dealing with ref
void*
is probably not worth exposing. Just cast to the appropriate type (float*)((void*)(p))
.
void*
means we either need unique method names or we have to use <T>
and have the JIT perform validationIt may already be obvious by my existing statements, but I believe these APIs should be explicit but also simple:
Vector128<float>
instead of Vector128<T>
as this enforces compile time checks and removes JIT overheadLoad
and Store
as part of this and not rely on things elsewhere (System.Runtime.CompilerServices.Unsafe
).
extern
CoreFX
itself would/should never use themCoreFXExtensions
or third party repo) that provides software fallbacks for each instructionHow did you come up with this number?
@jkotas from CIL spec which states that CIL does not have implementation of namespaces and recognises methods by their full name, however, I understand I should check PE file specs - my bad.
Rather than X86 maybe x86x64 as x86 is often used to donate 32-bit only?
@benaadams, In the same veign x86-64
is sometimes used to denote the 64-bit
only version of the x86
instruction set, so this would be confusing as well (https://en.wikipedia.org/wiki/X86-64)
I think that x86 makes the most sense and is used most frequently to refer to the entire platform.
At least for Wikipedia:
It seems it won't be simple API and it would require multiple design decisions - is it possible to start working on details of it in CoreFXLabs or separate branch in coreclr/corefx?
Separate repo would support issue tracking system which IMO would be needed to get it done fast and efficiently.
It seems it won't be a simple API and it would require multiple design decisions - is it possible to start working on details of it in CoreFXLabs or separate branch in coreclr/corefx?
I'm going to second this. I think it would be worthwhile to get the basic API shape (as proposed) up in CoreFXLabs and to "use" it in a real-scenario.
I would propose we take Vector2
, Vector3
, and Vector4
and reimplement them to call the APIs as per https://github.com/Microsoft/DirectXMath and potentially do the same for Cos
, Sin
, and Tan
in Math
/MathF
.
Although we won't get any perf numbers from this and we won't be able to run the code, it will let us view the use case in "real-world" scenarios to get a better feel for what makes the most sense and what the strengths/deficiencies of the proposal (and any suggested modifications to the proposal).
Although we won't get any perf numbers
To get perf numbers, it should be fine to add some support for this in the JIT (without exposing it in the stable shipping profile) and experiment with the API shape in corefxlab.
Unsafe.ToPointer solves part of this
@tannergooding leaving a GC hole or requiring pinning, which is specifically what we want to avoid ;) ref
is essential for generic Span<T>
based algorithms, without the need for pinning. Unsafe.Read/Write
should work too. I want both apples ;)
We should have APIs like Load and Store as part of this and not rely on things elsewhere (System.Runtime.CompilerServices.Unsafe).
Agreed, and I am not saying that. But Unsafe.Read/Write<Vector128<T>>
should still work. That is a must in my view. Otherwise, generic code becomes very difficult, which can handle different vector registers, basic types etc.
💭 ❓ Would these new vector types be candidates for being ref struct
instead of just struct
?
void is probably not worth exposing. Just cast to the appropriate type (float)((void*)(p)).
@tannergooding you can't do that in generic code. I think it would be good to consider algorithms that are generic too, lots of things could be done here in a generic way exposing many numerical operations on say images without the need for a hand tailored loop for each operation. There are many many cases where generic code could be made with this.
I don't see any issue with an API with static methods for void*
e.g.
public class Vector128<T>
{
public static Vector128<T> Load(void* p);
}
The JIT of course has to handle this, but shouldn't that be rather straightforward. My assumption here is that if Vector128<T>.IsSupported
then you must be able to Load
and Store
so these do not have to be in platform specific places.
If they do, then yes we need something like Vector<128> SSE2.LoadInt(void* p)
and in some cases even AVX512VL.LoadInt256(void* p)
maybe... ugly naming aside. Otherwise, out
could be a fallback although it makes code cumbersome, less so with C# 7.
void* p = ...;
AVX512VL.LoadAligned(p, out Vector256<int> v);
It is not that much more cumbersome when viewed from this. And hopefully has no perf issues.
Don't think void is needed? Just a ref
version. Can convert void to ref with Unsafe.AsRef
e.g.
void* input;
ref Unsafe.AsRef<Vector<short>>(input);
Don't think void* is needed? Just a ref version.
Yes I could live with that, in fact I would go as far as say why have any pointer versions at all. These should solely be based on ref
. A pointer can easily be converted to a ref
and this way all scenarios are supported (pointers, span, ref
s, Unsafe
etc.). And without any perf issues I imagine.
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
public static class AVX
{
......
// __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
[Intrinsic]
public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
// __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
[Intrinsic]
public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException();
......
}
Usage with pointer would be a little more cumbersome, but not a big deal for me.
This proposal adds intrinsics that allow programmers to use managed code (C#) to leverage Intel® SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1/2, PCLMULQDQ, and AES instructions.
Rationale and Proposed API
Vector Types
Currently, .NET provides
System.Numerics.Vector<T>
and related intrinsic functions as a cross-platform SIMD interface that automatically matches proper hardware support at JIT-compile time (e.g.Vector<T>
is size of 128-bit on SSE2 machines or 256-bit on AVX2 machines). However, there is no way to simultaneously use different sizeVector<T>
, which limits the flexibility of SIMD intrinsics. For example, on AVX2 machines, XMM registers are not accessible fromVector<T>
, but certain instructions have to work on XMM registers (i.e. SSE4.2). Consequently, this proposal introducesVector128<T>
andVector256<T>
in a new namespaceSystem.Runtime.Intrinsics
This namespace is platform agnostic, and other hardware could provide intrinsics that operate over them. For instance,
Vector128<T>
could be implemented as an abstraction of XMM registers on SSE capable processor or as an abstraction of Q registers on NEON capable processors. Meanwhile, other types may be added in the future to support newer SIMD architectures (i.e. adding 512-bit vector and mask vector types for AVX-512).Intrinsic Functions
The current design of
System.Numerics.Vector
abstracts away the specifics of processor details. While this approach works well in many cases, developers may not be able to take full advantage of the underlying hardware. Intrinsic functions allow developers to access full capability of processors on which their programs run.One of the design goals of intrinsics APIs is to provide one-to-one correspondence to Intel C/C++ intrinsics. That way, programmers already familiar with C/C++ intrinsics can easily leverage their existing skills. Another advantage of this approach is that we leverage the existing body of documentation and sample code written for C/C++ instrinsics.
Intrinsic functions that manipulate
Vector128/256<T>
will be placed in a platform-specific namespaceSystem.Runtime.Intrinsics.X86
. Intrinsic APIs will be separated to several static classes based-on the instruction sets they belong to.Some of intrinsics benefit from C# generic and get simpler APIs:
Each instruction set class contains an
IsSupported
property which stands for whether the underlying hardware supports the instruction set. Programmers use these properties to ensure that their code can run on any hardware via platform-specific code path. For JIT compilation, the results of capability checking are JIT time constants, so dead code path for the current platform will be eliminated by JIT compiler (conditional constant propagation). For AOT compilation, compiler/runtime executes the CPUID checking to identify corresponding instruction sets. Additionally, the intrinsics do not provide software fallback and calling the intrinsics on machines that has no corresponding instruction sets will causePlatformNotSupportedException
at runtime. Consequently, we always recommend developers to provide software fallback to remain the program portable. Common pattern of platform-specific code path and software fallback looks like below.The scope of this API proposal is not limited to SIMD (vector) intrinsics, but also includes scalar intrinsics that operate over scalar types (e.g. int, short, long, or float, etc.) from the instruction sets mentioned above. As an example, the following code segment shows
Crc32
intrinsic functions fromSse42
class.Intended Audience
The intrinsics APIs bring the power and flexibility of accessing hardware instructions directly from C# programs. However, this power and flexibility means that developers have to be cognizant of how these APIs are used. In addition to ensuring that their program logic is correct, developers must also ensure that the use of underlying intrinsic APIs are valid in the context of their operations.
For example, developers who use certain hardware intrinsics should be aware of their data alignment requirements. Both aligned and unaligned memory load and store intrinsics are provided, and if aligned loads and stores are desired, developers must ensure that the data are aligned appropriately. The following code snippet shows the different flavors of load and store intrinsics proposed:
IMM Operands
Most of the intrinsics can be directly ported to C# from C/C++, but certain instructions that require immediate parameters (i.e. imm8) as operands deserve additional consideration, such as
pshufd
,vcmpps
, etc. C/C++ compilers specially treat these intrinsics which throw compile-time errors when non-constant values are passed into immediate parameters. Therefore, CoreCLR also requires the immediate argument guard from C# compiler. We suggest an addition of a new "compiler feature" into Roslyn which placesconst
constraint on function parameters. Roslyn could then ensure that these functions are invoked with "literal" values on theconst
formal parameters.Semantics and Usage
The semantic is straightforward if users are already familiar with Intel C/C++ intrinsics. Existing SIMD programs and algorithms that are implemented in C/C++ can be directly ported to C#. Moreover, compared to
System.Numerics.Vector<T>
, these intrinsics leverage the whole power of Intel SIMD instructions and do not depend on other modules (e.g.Unsafe
) in high-performance environments.For example, SoA (structure of array) is a more efficient pattern than AoS (array of structure) in SIMD programming. However, it requires dense
shuffle
sequences to convert data source (usually stored in AoS format), which is not provided byVector<T>
. UsingVector256<T>
with AVX shuffle instructions (including shuffle, insert, extract, etc.) can lead to higher throughput.Furthermore, conditional code is enabled in vectorized programs. Conditional path is ubiquitous in scalar programs (
if-else
), but it requires specific SIMD instructions in vectorized programs, such as compare, blend, or andnot, etc.As previously stated, traditional scalar algorithms can be accelerated as well. For example, CRC32 is natively supported on SSE4.2 CPUs.
Implementation Roadmap
Implementing all the intrinsics in JIT is a large-scale and long-term project, so the current plan is to initially implement a subset of them with unit tests, code quality test, and benchmarks.
The first step in the implementation would involve infrastructure related items. This step would involve wiring the basic components, including but not limited to internal data representations of
Vector128<T>
andVector256<T>
, intrinsics recognition, hardware support checking, and external support from Roslyn/CoreFX. Next steps would involve implementing subsets of intrinsics in classes representing different instruction sets.Complete API Design
Add Intel hardware intrinsic APIs to CoreFX dotnet/corefx#23489 Add Intel hardware intrinsic API implementation to mscorlib dotnet/corefx#13576
Update
08/17/2017
System.Runtime.CompilerServices.Intrinsics
toSystem.Runtime.Intrinsics
andSystem.Runtime.CompilerServices.Intrinsics.X86
toSystem.Runtime.Intrinsics.X86
.Avx
instead ofAVX
.address
instead ofmem
.IsSupport
as properties.Span<T>
overloads to the most common memory-access intrinsics (Load
,Store
,Broadcast
), but leave other alignment-aware or performance-sensitive intrinsics with original pointer version.Sse2
class design and separate small calsses (e.g.,Aes
,Lzcnt
, etc.) into individual source files (e.g.,Aes.cs
,Lzcnt.cs
, etc.).CompareVector*
toCompare
and get rid ofCompare
prefix fromFloatComparisonMode
.08/22/2017
Span<T>
overloads byref T
overloads.09/01/2017
12/21/2018