API Proposal: Add Intel hardware intrinsic functions and namespace

fiigii commented 7 years ago

This proposal adds intrinsics that allow programmers to use managed code (C#) to leverage Intel® SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1/2, PCLMULQDQ, and AES instructions.

Rationale and Proposed API

Vector Types

Currently, .NET provides System.Numerics.Vector<T> and related intrinsic functions as a cross-platform SIMD interface that automatically matches proper hardware support at JIT-compile time (e.g. Vector<T> is size of 128-bit on SSE2 machines or 256-bit on AVX2 machines). However, there is no way to simultaneously use different size Vector<T>, which limits the flexibility of SIMD intrinsics. For example, on AVX2 machines, XMM registers are not accessible from Vector<T>, but certain instructions have to work on XMM registers (i.e. SSE4.2). Consequently, this proposal introduces Vector128<T> and Vector256<T> in a new namespace System.Runtime.Intrinsics

namespace System.Runtime.Intrinsics
{
    // 128 bit types
    [StructLayout(LayoutKind.Sequential, Size = 16)]
    public struct Vector128<T> where T : struct {}

    // 256 bit types
    [StructLayout(LayoutKind.Sequential, Size = 32)]
    public struct Vector256<T> where T : struct {}
}

This namespace is platform agnostic, and other hardware could provide intrinsics that operate over them. For instance, Vector128<T> could be implemented as an abstraction of XMM registers on SSE capable processor or as an abstraction of Q registers on NEON capable processors. Meanwhile, other types may be added in the future to support newer SIMD architectures (i.e. adding 512-bit vector and mask vector types for AVX-512).

Intrinsic Functions

The current design of System.Numerics.Vector abstracts away the specifics of processor details. While this approach works well in many cases, developers may not be able to take full advantage of the underlying hardware. Intrinsic functions allow developers to access full capability of processors on which their programs run.

One of the design goals of intrinsics APIs is to provide one-to-one correspondence to Intel C/C++ intrinsics. That way, programmers already familiar with C/C++ intrinsics can easily leverage their existing skills. Another advantage of this approach is that we leverage the existing body of documentation and sample code written for C/C++ instrinsics.

Intrinsic functions that manipulate Vector128/256<T> will be placed in a platform-specific namespace System.Runtime.Intrinsics.X86. Intrinsic APIs will be separated to several static classes based-on the instruction sets they belong to.

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        public static bool IsSupported {get;}

        // __m256 _mm256_add_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> Add(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_add_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> Add(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        // __m256 _mm256_addsub_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> AddSubtract(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_addsub_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> AddSubtract(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        ......
    }
}

Some of intrinsics benefit from C# generic and get simpler APIs:

// Sse2.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse
    {
        public static bool IsSupported {get;}

        // __m128 _mm_castpd_ps (__m128d a)
        // __m128i _mm_castpd_si128 (__m128d a)
        // __m128d _mm_castps_pd (__m128 a)
        // __m128i _mm_castps_si128 (__m128 a)
        // __m128d _mm_castsi128_pd (__m128i a)
        // __m128 _mm_castsi128_ps (__m128i a)
        [Intrinsic]
        public static Vector128<U> StaticCast<T, U>(Vector128<T> value) where T : struct where U : struct { throw new NotImplementedException(); }

        ......
    }
}

Each instruction set class contains an IsSupported property which stands for whether the underlying hardware supports the instruction set. Programmers use these properties to ensure that their code can run on any hardware via platform-specific code path. For JIT compilation, the results of capability checking are JIT time constants, so dead code path for the current platform will be eliminated by JIT compiler (conditional constant propagation). For AOT compilation, compiler/runtime executes the CPUID checking to identify corresponding instruction sets. Additionally, the intrinsics do not provide software fallback and calling the intrinsics on machines that has no corresponding instruction sets will cause PlatformNotSupportedException at runtime. Consequently, we always recommend developers to provide software fallback to remain the program portable. Common pattern of platform-specific code path and software fallback looks like below.

if (Avx2.IsSupported)
{
    // The AVX/AVX2 optimizing implementation for Haswell or above CPUs  
}
else if (Sse41.IsSupported)
{
    // The SSE optimizing implementation for older CPUs  
}
......
else
{
    // Scalar or software-fallback implementation
}

The scope of this API proposal is not limited to SIMD (vector) intrinsics, but also includes scalar intrinsics that operate over scalar types (e.g. int, short, long, or float, etc.) from the instruction sets mentioned above. As an example, the following code segment shows Crc32 intrinsic functions from Sse42 class.

// Sse42.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse42
    {
        public static bool IsSupported {get;}

        // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
        [Intrinsic]
        public static uint Crc32(uint crc, byte data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v)
        [Intrinsic]
        public static uint Crc32(uint crc, ushort data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v)
        [Intrinsic]
        public static uint Crc32(uint crc, uint data) { throw new NotImplementedException(); }
        // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
        [Intrinsic]
        public static ulong Crc32(ulong crc, ulong data) { throw new NotImplementedException(); }

        ......
    }
}

Intended Audience

The intrinsics APIs bring the power and flexibility of accessing hardware instructions directly from C# programs. However, this power and flexibility means that developers have to be cognizant of how these APIs are used. In addition to ensuring that their program logic is correct, developers must also ensure that the use of underlying intrinsic APIs are valid in the context of their operations.

For example, developers who use certain hardware intrinsics should be aware of their data alignment requirements. Both aligned and unaligned memory load and store intrinsics are provided, and if aligned loads and stores are desired, developers must ensure that the data are aligned appropriately. The following code snippet shows the different flavors of load and store intrinsics proposed:

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......

        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> Load(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> Load(byte* address) { throw new NotImplementedException(); }
        ......
        [Intrinsic]
        public static Vector256<T> Load<T>(ref T vector) where T : struct { throw new NotImplementedException(); }

        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadAligned(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadAligned(byte* address) { throw new NotImplementedException(); }
        ......

        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadDqu(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadDqu(byte* address) { throw new NotImplementedException(); }
        ......

        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......
        public static void Store<T>(ref T vector, Vector256<T> source) where T : struct { throw new NotImplementedException(); }

        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......

    // void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }

        ......
    }
}

IMM Operands

Most of the intrinsics can be directly ported to C# from C/C++, but certain instructions that require immediate parameters (i.e. imm8) as operands deserve additional consideration, such as pshufd, vcmpps, etc. C/C++ compilers specially treat these intrinsics which throw compile-time errors when non-constant values are passed into immediate parameters. Therefore, CoreCLR also requires the immediate argument guard from C# compiler. We suggest an addition of a new "compiler feature" into Roslyn which places const constraint on function parameters. Roslyn could then ensure that these functions are invoked with "literal" values on the const formal parameters.

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......

        // __m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
        [Intrinsic]
        public static Vector256<float> Blend(Vector256<float> left, Vector256<float> right, const byte control) { throw new NotImplementedException(); }
        // __m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
        [Intrinsic]
        public static Vector256<double> Blend(Vector256<double> left, Vector256<double> right, const byte control) { throw new NotImplementedException(); }

        // __m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
        [Intrinsic]
        public static Vector128<float> Compare(Vector128<float> left, Vector128<float> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }

        // __m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
        [Intrinsic]
        public static Vector128<double> Compare(Vector128<double> left, Vector128<double> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }

        ......
    }
}

// Enums.cs
namespace System.Runtime.Intrinsics.X86
{
    public enum FloatComparisonMode : byte
    {
        EqualOrderedNonSignaling,
        LessThanOrderedSignaling,
        LessThanOrEqualOrderedSignaling,
        UnorderedNonSignaling,
        NotEqualUnorderedNonSignaling,
        NotLessThanUnorderedSignaling,
        NotLessThanOrEqualUnorderedSignaling,
        OrderedNonSignaling,
        ......
    }

    ......
}

Semantics and Usage

The semantic is straightforward if users are already familiar with Intel C/C++ intrinsics. Existing SIMD programs and algorithms that are implemented in C/C++ can be directly ported to C#. Moreover, compared to System.Numerics.Vector<T>, these intrinsics leverage the whole power of Intel SIMD instructions and do not depend on other modules (e.g. Unsafe) in high-performance environments.

For example, SoA (structure of array) is a more efficient pattern than AoS (array of structure) in SIMD programming. However, it requires dense shuffle sequences to convert data source (usually stored in AoS format), which is not provided by Vector<T>. Using Vector256<T> with AVX shuffle instructions (including shuffle, insert, extract, etc.) can lead to higher throughput.

public struct Vector256Packet
{
    public Vector256<float> xs {get; private set;}
    public Vector256<float> ys {get; private set;}
    public Vector256<float> zs {get; private set;}

    // Convert AoS vectors to SoA packet
    public unsafe Vector256Packet(float* vectors)
    {
        var m03 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[0])); // load lower halves
        var m14 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[4]));
        var m25 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[8]));
        m03 = Avx.Insert(m03, &vectors[12], 1);  // load higher halves
        m14 = Avx.Insert(m14, &vectors[16], 1);
        m25 = Avx.Insert(m25, &vectors[20], 1);

        var xy = Avx.Shuffle(m14, m25, 2 << 6 | 1 << 4 | 3 << 2 | 2);
        var yz = Avx.Shuffle(m03, m14, 1 << 6 | 0 << 4 | 2 << 2 | 1);
        var _xs = Avx.Shuffle(m03, xy, 2 << 6 | 0 << 4 | 3 << 2 | 0);
        var _ys = Avx.Shuffle(yz, xy,  3 << 6 | 1 << 4 | 2 << 2 | 0);
        var _zs = Avx.Shuffle(yz, m25, 3 << 6 | 0 << 4 | 3 << 2 | 1);

        xs = _xs;
        ys = _ys;
        zs = _zs; 
    }
    ......
}

public static class Main
{
    static unsafe int Main(string[] args)
    {
        var data = new float[Length];
        fixed (float* dataPtr = data)
        {
            if (Avx2.IsSupported)
            {
                var vector = new Vector256Packet(dataPtr);
                ......
                // Using AVX/AVX2 intrinsics to compute eight 3D vectors.
            }
            else if (Sse41.IsSupported)
            {
                var vector = new Vector128Packet(dataPtr);
                ......
                // Using SSE intrinsics to compute four 3D vectors.
            }
            else
            {
                // scalar algorithm
            }
        }
    }
}

Furthermore, conditional code is enabled in vectorized programs. Conditional path is ubiquitous in scalar programs (if-else), but it requires specific SIMD instructions in vectorized programs, such as compare, blend, or andnot, etc.

public static class ColorPacketHelper
{
    public static IntRGBPacket ConvertToIntRGB(this Vector256Packet colors)
    {
        var one = Avx.Set1<float>(1.0f);
        var max = Avx.Set1<float>(255.0f);

        var rsMask = Avx.Compare(colors.xs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var gsMask = Avx.Compare(colors.ys, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var bsMask = Avx.Compare(colors.zs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);

        var rs = Avx.BlendVariable(colors.xs, one, rsMask);
        var gs = Avx.BlendVariable(colors.ys, one, gsMask);
        var bs = Avx.BlendVariable(colors.zs, one, bsMask);

        var rsInt = Avx.ConvertToVector256Int(Avx.Multiply(rs, max));
        var gsInt = Avx.ConvertToVector256Int(Avx.Multiply(gs, max));
        var bsInt = Avx.ConvertToVector256Int(Avx.Multiply(bs, max));

        return new IntRGBPacket(rsInt, gsInt, bsInt);
    }
}

public struct IntRGBPacket
{
    public Vector256<int> Rs {get; private set;}
    public Vector256<int> Gs {get; private set;}
    public Vector256<int> Bs {get; private set;}

    public IntRGBPacket(Vector256<int> _rs, Vector256<int> _gs, Vector256<int>_bs)
    {
        Rs = _rs;
        Gs = _gs;
        Bs = _bs;
    }
}

As previously stated, traditional scalar algorithms can be accelerated as well. For example, CRC32 is natively supported on SSE4.2 CPUs.

public static class Verification
{
    public static bool VerifyCrc32(ulong acc, ulong data, ulong res)
    {
        if (Sse42.IsSupported)
        {
            return Sse42.Crc32(acc, data) == res;
        }
        else
        {
            return SoftwareCrc32(acc, data) == res;
            // The software implementation of Crc32 provided by developers or other libraries
        }
    }
}

Implementation Roadmap

Implementing all the intrinsics in JIT is a large-scale and long-term project, so the current plan is to initially implement a subset of them with unit tests, code quality test, and benchmarks.

The first step in the implementation would involve infrastructure related items. This step would involve wiring the basic components, including but not limited to internal data representations of Vector128<T> and Vector256<T>, intrinsics recognition, hardware support checking, and external support from Roslyn/CoreFX. Next steps would involve implementing subsets of intrinsics in classes representing different instruction sets.

Complete API Design

Add Intel hardware intrinsic APIs to CoreFX dotnet/corefx#23489 Add Intel hardware intrinsic API implementation to mscorlib dotnet/corefx#13576

Update

08/17/2017

Change namespace System.Runtime.CompilerServices.Intrinsics to System.Runtime.Intrinsics and System.Runtime.CompilerServices.Intrinsics.X86 to System.Runtime.Intrinsics.X86.
Change ISA class name to match CoreFX naming convention, e.g., using Avx instead of AVX.
Change certain pointer parameter names, e.g., using address instead of mem.
Define IsSupport as properties.
Add Span<T> overloads to the most common memory-access intrinsics (Load, Store, Broadcast), but leave other alignment-aware or performance-sensitive intrinsics with original pointer version.
Clarify that these intrinsics will not provide software fallback.
Clarify Sse2 class design and separate small calsses (e.g., Aes, Lzcnt, etc.) into individual source files (e.g., Aes.cs, Lzcnt.cs, etc.).
Change method name CompareVector* to Compare and get rid of Compare prefix from FloatComparisonMode.

08/22/2017

Replace Span<T> overloads by ref T overloads.

09/01/2017

Minor changes from API code review.

12/21/2018

All the proposed APIs are enabled in .NET Core runtime.

fiigii commented 7 years ago

cc: @russellhadley @mellinoe @CarolEidt @terrajobst

tannergooding commented 7 years ago

Overall I love this proposal. I do have a few questions/comments:

Each vector type exposes an IsSupported method to check if the current hardware supports

I think this can be a property, as it is in Vector<T>.

Does this take the type of T into account? For example, will IsSupported return true for Vector128<float> but false for Vector128<CustomStruct> (or is it expected to throw in this case)?

What about formats that may be supported on some processors, but not others? As an example, lets say there is instruction set X which only supports Vector128<float> and later comes instruction set Y which supports Vector128<double>. If the CPU currently only supports X would it return true for Vector128<float> and false for Vector128<double> with Vector128<double> only returning true when instruction set Y is supported?

In addition, this namespace would contain conversion functions between the existing SIMD type (Vector) and new Vector128 and Vector256 types.

My concern here is the target layering for each component. I would hope that System.Runtime.CompilerServices.Intrinsics are part of the lowest layer, and therefore consumable by all other APIs in CoreFX. While Vector<T>, on the other hand, is part of one of the higher layers and is therefore not consumable.

Would it be better here to either have the conversion operators on Vector<T> or to expect the user to perform an explicit load/store (as they will likely be expected to do with other custom types)?

SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)

I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.

Other.cs (includes LZCNT, POPCNT, BMI1, BMI2, PCLMULQDQ, and AES)

For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.

Some of intrinsics benefit from C# generic and get simpler APIs

I didn't see any examples of scalar floating-point APIs (_mm_rsqrt_ss). How would these fit in with the Vector based APIs (naming wise, etc)?

redknightlois commented 7 years ago

Looks good and in line with the suggestions I have made. The only thing that probably do not resonate with me (maybe because we deal with pointers on a regular basis on our codebase) is having to use Load(type*) instead of supporting also the ability to call the function with a void* as the semantics of the operation are very clear. Probably it is me, but with the exception of special operations like a non-temporal store (where you would need to use a Store/Load operation explicitely) not having support for arbitrary pointer types would only add bloat to the algorithm without any actual improvement in readability/understandability.

tannergooding commented 7 years ago

Therefore, CoreCLR also requires the immediate argument guard from C# compiler.

Going to tag @jaredpar here explicitly. We should get a formal proposal up.

I think that we can do this without language support (@jaredpar, tell me if I'm crazy here) if the compiler can recognize something like System.Runtime.CompilerServices.IsLiteralAttribute and emits it as modreq isliteral.

Having a new recognized keyword (const) here is likely more complicated as it requires formal spec'ing in the language etc.

mellinoe commented 7 years ago

Thanks for posting this @fiigii. I'm very eager to hear everyone's thoughts on the design.

IMM Operands

One thing that came up in a recent discussion is that some immediate operands have stricter constraints than just "must be constant". The examples given use a FloatComparisonMode enum, and functions accepting it apply a const modifier to the parameter. But there is no way to prevent someone from passing a non-enum value, still a constant, to a method accepting that parameter.

`AVX.CompareVector256(left, right, (FloatComparisonMode)255);

EDIT: This warning is emitted in a VC++ project if you use the above code.

Now, this may not be a problem for this particular example (I'm not familiar with its exact semantics), but it's something to keep in mind. There were also other, more esoteric examples given, like an immediate operand which must be a power of two, or which satisfies some other obscure relation to the other operands. These constraints will be much more difficult, most likely impossible, to enforce at the C# level. The "const" enforcement feels more reasonable and achievable, and seems to cover most instances of the problem.

SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)

I'll echo what @tannergooding said -- I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.

sharwell commented 7 years ago

💭 Most of my initial thoughts go to the use of pointers in a few places. Knowing what we know about ref structs and Span<T>, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.

❓ In the following code, would the generic method actually be expanded to each of the forms allowed by the processor, or would it be defined in coed as a generic?

// __m128i _mm_add_epi8 (__m128i a,  __m128i b)
// __m128i _mm_add_epi16 (__m128i a,  __m128i b)
// __m128i _mm_add_epi32 (__m128i a,  __m128i b)
// __m128i _mm_add_epi64 (__m128i a,  __m128i b)
// __m128 _mm_add_ps (__m128 a,  __m128 b)
// __m128d _mm_add_pd (__m128d a,  __m128d b)
[Intrinsic]
public static Vector128<T> Add<T>(Vector128<T> left,  Vector128<T> right) where T : struct { throw new NotImplementedException(); }

sharwell commented 7 years ago

❓ If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions? If we choose the former, would it make sense to rename IsSupported to IsHardwareAccelerated?

tannergooding commented 7 years ago

Knowing what we know about ref structs and Span, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.

Personally, I am fine with the unsafe code. I don't believe this is meant to be a feature that app designers use and is instead meant to be something framework designers use to squeeze extra performance and also to simplify overhead on the JIT.

People using intrinsics are likely already doing a bunch of unsafe things already and this just makes it more explicit.

If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?

The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.

I am of the opinion that all of these methods should be declared as extern and should never have software fallbacks. Users would be expected to implement a software fallback themselves or have a PlatformNotSupportedException thrown by the JIT at runtime.

This will help ensures the consumer is being aware of the underlying platforms they are targeting and that they are writing code that is "suited" for the underlying hardware (running vectorized algorithms on hardware without vectorization support can cause performance degradation).

benaadams commented 7 years ago

If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?

The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.

These are the raw CPU platform intrinsics e.g. X86.SSE so PNS is probably fine; and will help get them out quicker.

Assuming the detection is branch eliminated; it should be easy to build a library on top that then does software fallbacks, which can be iterated on (either coreclr/corefx or 3rd party)

sharwell commented 7 years ago

Personally, I am fine with the unsafe code.

I am not against unsafe code. However, given the choice between safe code and unsafe code that perform the same, I would choose the former.

I am of the opinion that all of these methods should be declared as extern and should never have software fallbacks.

The biggest advantage of this is the runtime can avoid shipping software fallback code that never needs to execute.

The biggest disadvantage of this is test environments for the various possibilities are not easy to come by. Fallbacks provide a functionality safety net in case something gets missed.

tannergooding commented 7 years ago

The biggest disadvantage of this is test environments for the various possibilities are not easy to come by.

@sharwell, what possibilities are you envisioning?

The way these are currently structured, proposed, the user would code:

public static double Cos(double x)
{
    if (x86.FMA3.IsSupported)
    {
        // Do FMA3
    }
    else if (x86.SSE2.IsSupported)
    {
        // Do SSE2
    }
    else if (Arm.Neon.IsSupported)
    {
        // Do ARM
    }
    else
    {
        // Do software fallback
    }
}

Under this, the only way a user is faulted is if they write a bad algorithm or if they forget to provide any kind of software fallback (and an analyzer to detect this should be fairly trivial).

redknightlois commented 7 years ago

running vectorized algorithms on hardware without vectorization support can cause performance degradation.

I would rephrase @tannergooding thought into: "running vectorized algorithms on hardware without vectorization support will with utmost certainty cause performance degradation."

fiigii commented 7 years ago

For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.

@tannergooding We defined an individual class for each instruction set (except SSE and SSE2) but put certain small classes into the Other.cs file. I will update the proposal to clarify.

// Other.cs
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
    public static class LZCNT
    {
     ......
    }

    public static class POPCNT
    {
    ......
    }

    public static class BMI1
    {
     .....
    }

    public static class BMI2
    {
     ......
    }

    public static class PCLMULQDQ
    {
     ......
    }

    public static class AES 
    {
    ......
    }
}

tannergooding commented 7 years ago

AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).

I don't think this needs to be true all the time. In some cases, the AOT can drop the check altogether, depending on the target operating system (Win8 and above require SSE and SSE2 support, for example).

In other cases, the AOT can/should drop the check from each method and should instead aggregate them into a single check at the highest entry point.

Ideally, the AOT would run CPUID once during startup and cache the results as globals (honestly, if the AOT didn't do this, I would log a bug). The IsSupported check then becomes essentially a lookup of the cached value (just like a property normally behaves). This behavior is what the CRT implementations do to ensure that things like cos(double) remain performant and that they can still run FMA3 code where supported.

benaadams commented 7 years ago

For AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).

The implication would be from a usage perspective:

For Jit we could be quite granular on the checks as they are no-cost branch eliminated.

For AOT we'd need to be quite course on the checks and perform it at algorithm or library level, to offset the cost of CPUID; which may push it much higher than intended e.g. you wouldn't use a vectorized IndexOf; unless your strings were huge because CPUID would dominate.

Probably could still cache on AOT in startup, so it would set the property; it wouldn't branch eliminate, but would be fairly low cost?

fiigii commented 7 years ago

I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.

I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.

@tannergooding @mellinoe The current design intent of class SSE2 is to make more intrinsic functions friendly to users. If we had two classes SSE and SSE2, certain intrinsics would loose the generic signature. For example, SIMD addition only supports float in SSE, and SSE2 complements other types.

public static class SSE
{
    // __m128 _mm_add_ps (__m128 a,  __m128 b)
    public static Vector128<float> Add(Vector128<float> left,  Vector128<float> right);
}

public static class SSE2
{
    // __m128i _mm_add_epi8 (__m128i a,  __m128i b)
    public static Vector128<byte> Add(Vector128<byte> left,  Vector128<byte> right);
    public static Vector128<sbyte> Add(Vector128<sbyte> left,  Vector128<sbyte> right);

    // __m128i _mm_add_epi16 (__m128i a,  __m128i b)
    public static Vector128<short> Add(Vector128<short> left,  Vector128<short> right);
    public static Vector128<ushort> Add(Vector128<ushort> left,  Vector128<ushort> right);

    // __m128i _mm_add_epi32 (__m128i a,  __m128i b)
    public static Vector128<int> Add(Vector128<int> left,  Vector128<int> right);
    public static Vector128<uint> Add(Vector128<uint> left,  Vector128<uint> right);

    // __m128i _mm_add_epi64 (__m128i a,  __m128i b)
    public static Vector128<long> Add(Vector128<long> left,  Vector128<long> right);
    public static Vector128<ulong> Add(Vector128<uint> left,  Vector128<ulong> right);

    // __m128d _mm_add_pd (__m128d a, __m128d b)
    public static Vector128<double> Add(Vector128<double> left,  Vector128<double> right);
}

Comparing to SSE2.Add<T>, the above design looks complex, and users have to remember SSE.Add(float, float) and SSE2.Add(int, int). Additionally, SSE2 is the bottom-line of RyuJIT code generation for x86/x86-64, seperating SSE from SSE2 has no advatage on functionality or convenience.

Although the current design (class SSE2 including SSE and SSE2 intrinsics) hurts API consistency, there is a trade-off between design consistency and user experience, which is worth discussing.

benaadams commented 7 years ago

Rather than X86 maybe x86x64 as x86 is often used to donate 32-bit only?

nietras commented 7 years ago

Very excited we are finally seeing a proposal for this. My initial thoughts below.

AVX-512 is missing, probably since it is not that widespread yet, but I think it would be good to at least give this some thought and how to structure these because AVX-512 feature set is very fragmented. In this case I would assume we need to have a class for each set i.e. (see https://en.wikipedia.org/wiki/AVX-512):

public static class AVX512F {} // Foundation 
public static class AVX512CD {} // Conflict Detection
public static class AVX512ER {} // Exponential and Reciprocal
public static class AVX512PF {} // Prefetch Instructions
public static class AVX512BW {} // Byte and Word
public static class AVX512DQ {} // Doubleword and Quadword
public static class AVX512VL {} // Vector Length
public static class AVX512IFMA {} // Integer Fused Multiply Add (Future)
public static class AVX512VBMI {} // Vector Byte Manipulation Instructions (Future)
public static class AVX5124VNNIW {} // Vector Neural Network Instructions Word variable precision (Future)
public static class AVX5124FMAPS {} // Fused Multiply Accumulation Packed Single precision (Future)

and add a struct Vector512<T> type, of course. Note that the latter two AVX5124VNNIW and AVX5124FMAPS are hard to read due to number 4.

Some of these can have a huge impact for deep learning, sorting etc.

Regarding Load I have some concerns as well. As @redknightlois I think void* should be considered too, but more importantly also load from/store to ref. Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size. So something like (not sure where we could put this, and how naming should be done, if it can be moved to platform agnostic type.

[Intrinsic]
public static unsafe Vector256<sbyte> Load(sbyte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadSByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(byte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException(); }
// Etc.

The most important thing here is if ref can be supported as it would be essential for supporting generic algorithms. Naming should be revised no doubt, but just trying to make a point. If we want to support load from void* method name needs to include return type or method needs to be on type specific static class.

4creators commented 7 years ago

It's great we are discussing a concrete proposal right now. 😄

The above linked const keyword usage language proposal was created explicitly to provide support for some of SIMD instructions requiring immediate parameters. I think it will be straightforward to implement but since it may delay introduction of intrinsics there were strong arguments in favor of going with simple attribute implementation first and later expand C# syntax and API by including support for const method parameters.
IMO we have to discuss in parallel forward looking designs which comprise two different areas:
- System.Numerics API which can be partially implemented with support of discussed here x86 intrinsics
- Intrinsics API which should comprise other architectures as well as this will have an impact on final shape of the intrinsics API

Intrinsics

Namespace and assembly

I would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.

System.Intrinsics general top level namespace for all intrinsics System.Intrinsics.X86 x86 ISA extensions and separate assembly System.Intrinsics.Arm ARM ISA extensions and separate assembly System.Intrinsics.Power Power ISA extensions and separate assembly System.Intrinsics.RiscVRiscV ISA extensions and separate assembly

Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small.

Register sizes (currently XMM, YMM, ZMM - 128, 256, 512 bits in x86)

Current implementations support limited set of register sizes:

128, 256, 512 bits in x86
128 in ARM Neon and IBM Power 8 and Power 9 ISA

However, ARM recently published:

ARM SVE - Scalable Vector Extensions

see: The Scalable Vector Extension (SVE), for ARMv8-A published on 31 March 2017 with status Non-Confidential Beta.

This specification is quite important as it introduces new register sizes - altogether there are 16 register sizes which are multiples of 128 bits. Details are on page 21 of the specification (table is below).

armv8_sve_beta

Maximum vector length: 2048 bits
Required vector lengths: 128, 256, 512, 1024 bits
Permitted vector lengths: 384, 640, 768, 896, 1152, 1280, 1408, 1536, 1664, 1792, 1920

It would be necessary to design API which is capable to support in near future 16 different register sizes and several thousands (or tens of thousands) of opcodes/functions (counting with overloads). Predictions of not having 2048 bit SIMD instructions in couple of years seems to have been falsified to anyone's surprise by ARM this year. Looking at history (ARM published public beta of ARMv8 ISA on 04 September 2013 and first processor implementing it was available to users globally in October 2014 - Samsung Galaxy Note 4) I would expect that first silicon with SVE extensions will be available in 2018. I suppose this would be most probably very close in time to public availability of DotNet SIMD intrinsics.

I would like to propose:

Vectors

Implement basic Vectors supporting all register sizes in System.CoreLib.Private

namespace System.Numerics
{
    [StructLayour(LayoutKind.Explicit)]
    public unsafe struct Register128
    {
        [FieldOffset(0)]
        public fixed byte [16];
        .....
        // accessors for other types    
    }

    // ....

    [StructLayour(LayoutKind.Explicit)]
    public unsafe struct Register2048
    {
        [FieldOffset(0)]
        public fixed byte [256];
        .....
        // accessors for other types    
    }

    public struct Vector<T, R> where T, R: struct
    {
    }

    public struct Vector128<T>  :  Vector<T, Register128>
    {
    }

    // ....

    public struct Vector2048<T>  :  Vector<T, Register2048>
    {
    }
}

System.Numerics

All safe APIs would be exposed via Vector and VectorXXX structures and implemented with support of intrinsics.

System.Intrinsics

All vector APIs will use System.Numerics.VectorXXX.

public static Vector128<byte> MultiplyHigh<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);
public static Vector128<byte> MultiplyLow<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);

Intrinsics APIs will be placed in separate classes according to functionality detection patterns provided by processors. In case of x86 ISA this would be one to one correspondence between CPUID detection and supported functions. This would allow for easy to understand programming pattern where one would use functions from given group in way consistent with platform support.

Main reason for that kind of division is a requirement set by silicon manufacturers to use instructions only if they are detected in hardware. This allows for example to ship processor with support matrix comprising SSE3 but not SSSE3, or comprising PCLMULQDQ and SHA and not AESNI. This direct class - hardware support detection correspondence is the only safe way of having IsHardwareSupported detection and be compliant with Intel/AMD instruction usage restrictions. Otherwise kernel will have to catch for us #UD exception 😸

Mapping APIs to C/C++ intrinsics or to ISA opcodes

Intrinsics abstract usually in 1 to 1 way ISA opcodes however there are some intrinsics which map to several instructions. I would prefer to abstract opcodes (using nice names) and implement multi opcode intrinsics as functions on VectorXxx.

4creators commented 7 years ago

@nietras

Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size.

The best place would be System.Numerics.VetorXxx<T>

jkotas commented 7 years ago

all platforms should support load/store for a supported vector size

Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?

nietras commented 7 years ago

Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?

@jkotas I had the same thought, how do those tie in with Unsafe? I assume these would be unaligned then, and we can only use aligned via LoadAligned/StoreAligned...

Or could we add Unsafe.ReadAligned/WriteAligned and have the JIT recognize these for the vector types?

pentp commented 7 years ago

IsSupported should be a property (or a static readonly field) like IntPtr.Size or BitConverter.IsLittleEndian.

Combining SSE and SSE2 into a single class looks like a good trade-off for a simpler Add function.

Like @redknightlois and @nietras I'm also concerned about the Load/Store API. ref support is needed to avoid fixed references. For void* Load/Store generics could help:

[Intrinsic]
public static extern unsafe Vector256<T> Load<T>(void* mem) where T : struct;
[Intrinsic]
public static extern unsafe Vector256<sbyte> Load(sbyte* mem);
[Intrinsic]
public static extern Vector256<sbyte> Load(ref sbyte mem);
[Intrinsic]
public static extern unsafe Vector256<byte> Load(byte* mem);
[Intrinsic]
public static extern Vector256<byte> Load(ref byte mem);
// Etc.

Looking forward to using PDEP/PEXT!

tannergooding commented 7 years ago

I would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.

Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small.

@4creators, I am vehemently against moving this feature higher in the hierarchy.

For starters, the runtime itself has to support any and all intrinsics (including the strings to identify them, etc) regardless of where we put them in the hierarchy. If the runtime doesn't support them, then you can't use them.

I also want to be able to consume these intrinsics from all layers of the stack, including System.Private.CoreLib. I want to be able to write managed implementations of System.Math, System.MathF, various System.String functions, etc. Not only does this increase maintainability of the code (since most of these are FCALLS or hand tuned assembly today) but it also increases cross-platform consistency (where the resulting FCALL or assembly is part of the underlying C runtime).

4creators commented 7 years ago

@pentp

Combining SSE and SSE2 into a single class looks like a good trade-off for a simpler Add function.

I do not think that intrinsics should abstract anything - instead simple add can be created on Vector128 - Vector2048. On the other hand it would be openly against Intel usage recommendations.

4creators commented 7 years ago

I also want to be able to consume these intrinsics from all layers of the stack, including System.Private.CoreLib. I want to be able to write managed implementations of System.Math, System.MathF, various System.String functions, etc.

@tannergooding Agree that it has to be available from System.Private.CoreLib

However it doesn't mean that it has to be low in hierarchy. No one will ship runtime (vm, gc, jit) which will support all intrinsics for all architectures. Division line goes through ISA plane - x86, Arm, Power. There is no reason to ship ARM intrinsics on x86 runtime. Having it in separate platform assembly in coreclr which could be referenced (circularly) by System.Private.CoreLib could be a solution (I think that a bit better than ifdefing everything)

tannergooding commented 7 years ago

The current design intent of class SSE2 is to make more intrinsic functions friendly to users. If we had two classes SSE and SSE2, certain intrinsics would loose the generic signature.

@fiigii, why does separating these out mean we lose the generic signature?

The way I see it, we have two options:

Explicitly lists the types out Vector128<float> Add(Vector128<float> left, Vector128<float> right)
- This enforces type safety but increases the number of exposed APIs
Use generics Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right)
- This decreases the number of exposed APIs but loses the enforced compiler type safety
- Some functions will require multiple generics (casts, for example, require <T, U>, and this can potentially become even more complex elsewhere)

I see no reason why we can't have SSE and SSE2 and why we can't just have both expose Vector128<T> Add<T>(Vector128<T> left, Vector128<T> right).

That being said, I personally prefer the enforced form that requires additional APIs to be listed. Not only does this help enforce that the user is passing the right things down to the API, but it also decreases the number of checks the JIT must do.

Vector128<float> means T has already been enforced/validated as part of the API contract, Vector128<T> means the JIT must validate T is of a correct/supported type. This could potentially change from one runtime to the next (depending on the exact set of intrinsics the runtime was built to support) which can make this even more confusing.

tannergooding commented 7 years ago

However it doesn't mean that it has to be low in hierarchy. No one will ship runtime (vm, gc, jit) which will support all intrinsics for all architectures. Division line goes through ISA plane - x86, Arm, Power. There is no reason to ship ARM intrinsics on x86 runtime. Having it in separate platform assembly in coreclr which could referenced (circularly) by System.Private.CoreLib could be a solution.

I could get behind this. The caveats being that:

The reference assembly has to list all the APIs regardless
The JIT probably needs special support so that it doesn't throw an exception when it tries to compile my function (which has paths for x86 and ARM) on one of the architectures and it doesn't find the APIs for the other architecture.

tannergooding commented 7 years ago

Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?

@jkotas, I think the primary difference is that Load/Store will compile down to a SIMD instruction and will likely go directly into a register for most cases.

jkotas commented 7 years ago

Having it in separate platform assembly in coreclr which could be referenced (circularly) by System.Private.CoreLib could be a solution

Circular references are non-starter. The existing solution for this problem is to have a subset required by CoreLib in CoreLib as internal, and the full blown (duplicate) implementation in separate assembly. Though, it is questionable whether this duplication in the sake of layering is really worth it.

Another thought about naming. The runtime/codegen has many intrinsics today all over the place, for example methods on System.Threading.Interlocked or System.Runtime.CompilerServices.RuntimeHelpers are implemented as intrinsics.

Should the namespace name be more specific to capture what actually goes into it, say System.Runtime.HardwareIntrinsics?

4creators commented 7 years ago

Code bloat due to Register128 ... Register2048 design

Providing we would like to have direct access to numeric types encoded in RegisterXxx structures - similar to current System.Numerics.Register implementation which is IMO a good design - one would need to create (rather generate) total of 10 064 fields with the following pattern:

namespace System.Numerics
{
    [StructLayout(LayoutKind.Explicit)]
    public unsafe struct Register128
    {
        public fixed byte Reg[16];
        // System.Byte Fields
        [FieldOffset(0)]
        public byte byte_0;
        [FieldOffset(1)]
        public byte byte_1;
        [FieldOffset(2)]
        public byte byte_2;
        // System.SByte Fields
        // etc.

Specifically due to this problem there exists solution proposal based on extended generics syntax: Const blittable parameter as a generic type parameter (https://github.com/dotnet/csharplang/issues/749)

namespace System.Numerics
{
    public unsafe struct Register<T, const int N>
    {
        public fixed T Reg[N];
    }

    public struct Vector128<T> : Vector<T, Register<T, 16>> {}

Later by specialising generics one can easily create required struct tree.

jkotas commented 7 years ago

Load/Store will compile down to a SIMD instruction and will likely go directly into a register for most cases.

Unsafe.Load/Store compiles into a SIMD instruction for the right sized structs today.

4creators commented 7 years ago

Circular references are non-starter. The existing solution for this problem is to have a subset required by CoreLib in CoreLib as internal, and the full blown (duplicate) implementation in separate assembly. Though, it is questionable whether this duplication in the sake of layering is really worth it.

@jkotas @tannergooding This settles this problem since duplicate implementation for API comprising roughly 10k functions ...

tannergooding commented 7 years ago

Unsafe.Load/Store compiles into a SIMD instruction for the right sized structs today.

This may be the case implicitly, but it is not explicit in the API (which is the case for Vector128<float> SSE.Load(float* address)). It is also implicit on whether this is an aligned read/write or if it is unaligned.

One of my favorite features of this proposal is that the APIs are very explicit. If I say LoadAligned, I know that I am going to get the MOVAPS instruction (with no "ifs" "ands", or "buts" about it). If I say LoadUnaligned, I know I am going to get the MOVUPS instruction.

4creators commented 7 years ago

Should the namespace name be more specific to capture what actually goes into it, say System.Runtime.HardwareIntrinsics

Simple calculation for assembly size difference for functions defined as

public static void System.Runtime.CompilerServices.Intrinsics.AVX2::ShiftLeft
public static void System.Intrinsics.AVX2::ShiftLeft

for 5 000 functions is 250 KB.

jkotas commented 7 years ago

duplicate implementation for API comprising roughly 10k functions ...

The stuff duplicated in CoreLib would be just say the 50 functions that are actually needed in CoreLib.

jkotas commented 7 years ago

for 5 000 functions is 250 KB.

How did you come up with this number? The namespace name is stored in the managed binary just once. The difference between ShortNameSpace and VeryLoooooooooooooooooongNameSpace should be always ~20 bytes, independent on how many functions are contained in the namespace.

4creators commented 7 years ago

The stuff duplicated in CoreLib would be just say the 50 functions that are actually needed in CoreLib.

This would solve the problem of shipping all architectures together 😄

tannergooding commented 7 years ago

As to all the statements around things like exposing ref or void* (@pentp, @nietras, @redknightlois) and also as to whether or not a software fallback should be provided.

ref might be worth exposing

Unsafe.ToPointer solves part of this but requires users to take a separate dependency. It also means that corlib has more trouble dealing with ref

void* is probably not worth exposing. Just cast to the appropriate type (float*)((void*)(p)).

We can't override based on return type so void* means we either need unique method names or we have to use <T> and have the JIT perform validation

It may already be obvious by my existing statements, but I believe these APIs should be explicit but also simple:

We should use Vector128<float> instead of Vector128<T> as this enforces compile time checks and removes JIT overhead
We should have APIs like Load and Store as part of this and not rely on things elsewhere (System.Runtime.CompilerServices.Unsafe).
- The other APIs, where applicable, should be updated to call the intrinsic functions instead
We should enforce all functions to be extern
- If software fallbacks were provided, CoreFX itself would/should never use them
- Due to perf and other reasons, consumers should never rely on or use the software fallbacks anyways
- This enforces the JIT/AOT to understand the method or have it fail
- We can always expose a wrapper API at a higher level (read as CoreFXExtensions or third party repo) that provides software fallbacks for each instruction

4creators commented 7 years ago

How did you come up with this number?

@jkotas from CIL spec which states that CIL does not have implementation of namespaces and recognises methods by their full name, however, I understand I should check PE file specs - my bad.

tannergooding commented 7 years ago

Rather than X86 maybe x86x64 as x86 is often used to donate 32-bit only?

@benaadams, In the same veign x86-64 is sometimes used to denote the 64-bit only version of the x86 instruction set, so this would be confusing as well (https://en.wikipedia.org/wiki/X86-64)

I think that x86 makes the most sense and is used most frequently to refer to the entire platform.

At least for Wikipedia:

x86 refers to the 16, 32, and 64-bit implementations (https://en.wikipedia.org/wiki/X86)
IA-32 or i386 refers to the 32-bit implementation (https://en.wikipedia.org/wiki/IA-32)
- It is sometimes referred to as x86
x86-64, x64, x86_64, AMD64, and Intel64 are used to refer to the 64-bit implementation (https://en.wikipedia.org/wiki/X86-64)

4creators commented 7 years ago

It seems it won't be simple API and it would require multiple design decisions - is it possible to start working on details of it in CoreFXLabs or separate branch in coreclr/corefx?

Separate repo would support issue tracking system which IMO would be needed to get it done fast and efficiently.

tannergooding commented 7 years ago

It seems it won't be a simple API and it would require multiple design decisions - is it possible to start working on details of it in CoreFXLabs or separate branch in coreclr/corefx?

I'm going to second this. I think it would be worthwhile to get the basic API shape (as proposed) up in CoreFXLabs and to "use" it in a real-scenario.

I would propose we take Vector2, Vector3, and Vector4 and reimplement them to call the APIs as per https://github.com/Microsoft/DirectXMath and potentially do the same for Cos, Sin, and Tan in Math/MathF.

Although we won't get any perf numbers from this and we won't be able to run the code, it will let us view the use case in "real-world" scenarios to get a better feel for what makes the most sense and what the strengths/deficiencies of the proposal (and any suggested modifications to the proposal).

jkotas commented 7 years ago

Although we won't get any perf numbers

To get perf numbers, it should be fine to add some support for this in the JIT (without exposing it in the stable shipping profile) and experiment with the API shape in corefxlab.

nietras commented 7 years ago

Unsafe.ToPointer solves part of this

@tannergooding leaving a GC hole or requiring pinning, which is specifically what we want to avoid ;) ref is essential for generic Span<T> based algorithms, without the need for pinning. Unsafe.Read/Write should work too. I want both apples ;)

We should have APIs like Load and Store as part of this and not rely on things elsewhere (System.Runtime.CompilerServices.Unsafe).

Agreed, and I am not saying that. But Unsafe.Read/Write<Vector128<T>> should still work. That is a must in my view. Otherwise, generic code becomes very difficult, which can handle different vector registers, basic types etc.

sharwell commented 7 years ago

💭 ❓ Would these new vector types be candidates for being ref struct instead of just struct?

nietras commented 7 years ago

void is probably not worth exposing. Just cast to the appropriate type (float)((void*)(p)).

@tannergooding you can't do that in generic code. I think it would be good to consider algorithms that are generic too, lots of things could be done here in a generic way exposing many numerical operations on say images without the need for a hand tailored loop for each operation. There are many many cases where generic code could be made with this.

nietras commented 7 years ago

I don't see any issue with an API with static methods for void* e.g.

public class Vector128<T>
{
    public static Vector128<T> Load(void* p);
}

The JIT of course has to handle this, but shouldn't that be rather straightforward. My assumption here is that if Vector128<T>.IsSupported then you must be able to Load and Store so these do not have to be in platform specific places.

If they do, then yes we need something like Vector<128> SSE2.LoadInt(void* p) and in some cases even AVX512VL.LoadInt256(void* p) maybe... ugly naming aside. Otherwise, out could be a fallback although it makes code cumbersome, less so with C# 7.

void* p = ...;
AVX512VL.LoadAligned(p, out Vector256<int> v);

It is not that much more cumbersome when viewed from this. And hopefully has no perf issues.

benaadams commented 7 years ago

Don't think void is needed? Just a ref version. Can convert void to ref with Unsafe.AsRef

e.g.

void* input;
ref Unsafe.AsRef<Vector<short>>(input);

nietras commented 7 years ago

Don't think void* is needed? Just a ref version.

Yes I could live with that, in fact I would go as far as say why have any pointer versions at all. These should solely be based on ref. A pointer can easily be converted to a ref and this way all scenarios are supported (pointers, span, refs, Unsafe etc.). And without any perf issues I imagine.

namespace System.Runtime.CompilerServices.Intrinsics.X86
{
    public static class AVX
    {
        ......        
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException();         
        ......
}

Usage with pointer would be a little more cumbersome, but not a big deal for me.

dotnet / runtime