Consider refactor out SIMD support from System.Numerics into System.Unsafe.Simd

redknightlois commented 8 years ago

TL'DR: The design guidelines for System.Numerics conflict with general SIMD needs

As stated at https://github.com/dotnet/corefx/issues/10931 until that happen, support for SIMD will be constrained by design decisions made when XNA was still current (like 8 years ago). While Numerics is a good way to support the math oriented most common vector operations the SIMD world is far bigger.

That SIMD support is tied to System.Numerics is an artificial design decision that hampered the ability of supporting true SIMD with many of the most important instructions like Advance Bits Manipulation, Bits Packing, Shuffle, Permutations, etc.

My proposal is to leave System.Numerics alone and separate the raw SIMD support on a different namespace and follow a simplier approach for it. Then System.Numerics can use that support as anyone else.

There are many issues already opened and discussed that exist because SIMD support lives with System.Numerics instead of as standalone primitives that can be used by any library (like it is done in C/C++ through compiler intrisics).

The ones I track are:

The upside of supporting SIMD as primitives is that if the plumbing is there for the JIT to do that, and we need a new instruction, we can just go and add it in an up-for-grabs task.

cc @mellinoe @CarolEidt @terrajobst

terrajobst commented 8 years ago

@redknightlois

Can you provide some code snippets of the code you'd like to write that our System.Numerics.Vectors doesn't support today?

The key goals of our SIMD support were:

Make it easy for people from the gaming/graphics space to use pre-canned types.
Have a general purpose, hardware-independent SIMD representation that allows advanced developers to exploit the hardware they run on.

Before we introduce a new assembly with a different shape I'd like to understand if we can grow the current implementation to become more flexible/powerful.

redknightlois commented 8 years ago

@terrajobst Thats the problem I build databases, not games :)

Examples of the thing I personally can't implement today (performance issues without SIMD):

Better bitmap performance with Roaring bitmaps (http://arxiv.org/pdf/1402.6407.pdf).. without popcnt performance is paltry
Adapting Tree Structures for Processing with SIMD Instructions (https://openproceedings.org/2014/conf/edbt/ZeuchFH14.pdf)
Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture (http://www.opencirrus.intel-research.net/publications/sorting_vldb08.pdf)
Space-Efficient, High-Performance Rank & Select Structures on Uncompressed Bit Sequences (http://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf)
FAST: fast architecture sensitive tree search on modern CPUs and GPUs (http://www.cs.toronto.edu/~ryanjohn/teaching/csc2531-f11/slides/Andy-FAST.pdf or http://dl.acm.org/citation.cfm?doid=1807167.1807206)

The main issue is that "Make it easy for people from the gaming/graphics space to use pre-canned types" while a loable design goal conflicts with "Have a general purpose, hardware-independent SIMD representation that allows advanced developers to exploit the hardware they run on.". You just cannot achieve both, because most of the hardware-independent SIMD instructions are not designed for gaming/graphics.

And I didnt add prefetching at the different levels of the memory hierarchy in the links (that is needed too) and or non-temporal writes, etc. Those two would need to be addressed eventually too. Nor do I have enough experience on why it doesnt work very well either for hardcore math code but you can ask @cdrnet I am sure he has his own understanding of the short-comings of the library for general math work too.

Those goals are entirely different beasts and require to be addressed individually. However, judging from what happened at the C/C++ world of games, if you provide the second, each advanced developer will have the tools to create advance gaming/graphics space libraries and types anyways. I wont say trust me, because it doesn't add anything when said on the internet, but I have been programming GPUs since implementing bump-mapping required the use of register combiners on the Geforce 2 with the TL Engine back in 2001 ( I feel a bit old already :D ).

terrajobst commented 8 years ago

Before we go into the details please understand that I'm not trying to push back on anything you said. My goal is simply to separate the there is feature missing we should add from the API shape and concepts we provide are inappropriate.

The main issue is that "Make it easy for people from the gaming/graphics space to use pre-canned types" while a loable design goal conflicts with "Have a general purpose, hardware-independent SIMD representation that allows advanced developers to exploit the hardware they run on."

How so? The first is provided by the fixed-size vector types and the matrix types. The second is provided by the generic Vector<T> class. This API has no bias towards gaming, although we know that we didn't expose all the SIMD operations yet, e.g. swizzling.

Those goals are entirely different beasts and require to be addressed individually.

Agreed. That's why both scenarios use independent APIs. S I don't think there is an inherent conflict. But I'm sure our SIMD support is incomplete and needs to be extended :smile:

I saw that you linked a bunch of stuff -- thanks for that -- but I'd be curious to see your take on actual realizations of those. What kind of API shape do you think we should offer to address those? I'm not trying to get you to build the feature here; I'm merely trying to get a handle on your requirements and how you think about the problem space. So sketches are more than fine :smile:

redknightlois commented 8 years ago

Without going in much details not having:

Shuffle & Permutations: kills like 60% of those in the first line of code.
Popcnt: kills the rest (SW popcnt performance is as baaaad as you can think) :)

But there are other algorithms that you just dont even think about because they involve:

Packed minimum/maximum for different integer operand types
Packed Compare Masking
Combined mask-shift instructions
Data shuffle and unpacking
Cache Control instructions like prefetchT0, ... , prefetchNTA
Non-temporal moves.

Also cohersing data-types in an easy way. For example, there are instructions that handle ints but then the output is feeded into an instruction that interprete them as bytes. Today dealing with algorithms that do that is just plain awful.

Call me old school but for me the best blueprint for this support is compiler intrisics as done in VC++. You have functions that work against a memory pointer and/or some very unsafe Register data-type. Then you can implement System.Numerics or whatever using those method calls which will translate almost 1-to-1 into SIMD instructions when available.

redknightlois commented 8 years ago

I know you want examples... so here a few which are relevant today (as in people is actually trying to overcome this issue on CoreCLR source code).

using System.Unsafe

public void MemoryCopy ( void* dest, void* src, int length )
{
    Memory.Prefetch( src, 4096 ); // Will prefetch the next 4096 bytes and fill the entire cache line.  
    {
          Jit.NonTemporal ( dest ); // JIT will assume non-temporal stores until this block end on that pointer

         // Do the actual copy from src to dest...   
    }
}

Or things like this:

using System.Unsafe

public int Popcnt( void* src, int length )
{
    int* ptr = (int*) src;

    Memory.Prefetch( ptr , 4096 ); // Will prefetch the next 4096 bytes and fill the entire cache line.  
    int a = 0; int b = 0; int c = 0; int d = 0;    
    for ( int i = 0; i < length / 4; i++ )
    {
         a += Bits.PopCount(ptr[0]);
         b += Bits.PopCount(ptr[1]);
         c += Bits.PopCount(ptr[2]);
         d += Bits.PopCount(ptr[3]);

         ptr += 4;
    }
    int result = a + b + c + d;

    // Add the remainder. 
}

redknightlois commented 8 years ago

How Vector3<int> gets implemented (lets say .Add())?

using System.Unsafe

public Vector3<T> Add ( Vector3<T> op1, Vector3<T> op2 ) where T : struct
{
     if ( typeof(T) == typeof(int) )
     {
           Vector3<int> result;
           Arithmetics.Add3i( (Register) op1, (Register) op2, (Register) result ); // This is an instrinsic
           return (Vector3<T>)(object)result;
     } 
     .... the other types. 
}

Register is a very unsafe data type that maps straight to memory (in the stack or the backing storage of choice --- aka a pointer ---)

benaadams commented 8 years ago

Currently you can cast the Vector types between each other quite easily without actual conversion via Vector.AsVectorXXX(v).

I'm very keen to have greater access to the hardware intrinsics, with software fallbacks. I don't have a feel for what the right api would be but the C/C++ ones are horribly named, so hopefully the functionality could be captured better.

I think the Vectors currently do a good job for what they cover - though are obviously not feature complete.

Ziflin commented 8 years ago

Can you provide some code snippets of the code you'd like to write that our System.Numerics.Vectors doesn't support today?

I just added several things that our Game Engine's SIMD library would need here (there are probably a few more that I missed): https://github.com/dotnet/corefx/issues/10931#issuecomment-242270800

Unfortunately the Numerics.Vector library assumes everyone uses the same coordinate system, which is (possibly more unfortunately) not the case. The major engines and tools all seem to have their own, just as we have had for the last 15 years. Numerics.Vector should not have gone farther than providing coordinate system agnostic SIMD optimized functions. This would have allowed others to write their own extension methods and use it as they required.

The other option presented is to wrap the Numerics.Vector3/4 types with our own types. I'm fine with this if there are no performance penalties involved, but there are still several important SIMD functions that we need for many of our types to work.

All we (and likely any other engine developers) really want/need is a Vector4 type with enough SIMD operations exposed to convert our existing math libraries.

RussKeldorph commented 8 years ago

@dotnet/jit-contrib

redknightlois commented 8 years ago

Currently you can cast the Vector types between each other quite easily without actual conversion via Vector.AsVectorXXX(v)

That is why it is so ugly to work with... there are several algorithms that switch back and forth between instructions to do so.

I'm very keen to have greater access to the hardware intrinsics, with software fallbacks. I don't have a feel for what the right api would be but the C/C++ ones are horribly named, so hopefully the functionality could be captured better

I think they are horribly named too, but the approach is sound. That's the reason why I sketched how it would work I used more sensible names for them :)

mburbea commented 8 years ago

@benaadams , those vector.AsVectorXXX(V) methods sometimes use multiple registers and don't always generate good quality code.

I'd rather just expose more of the intrinisics.

Tornhoof commented 8 years ago

As I ported RoaringBitmaps to .NET, I can confirm that an intrinsic popcnt operation would be very useful as it is used for pretty much every logical operation on bitsets. @WojciechMula has a great comparison of different algorithms and methods to count bits SSE Popcount and popcnt is several times faster than any competitor, especially for smaller array sizes.

jackmott commented 8 years ago

I would like to be able to use every intrinsic noted here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and in ARM NEON as well.

As a practical example relevant to game development and graphics in general, see this SIMD noise library: https://github.com/Auburns/FastNoiseSIMD

You cannot do that currently in C#, various instructions are missing such as:

gather
floor
blendv? (pretty sure)

Other instructions missing that would be useful for many things:

shuffle
hadd / hsub
extract

We could speed up some .NET core library functions with access to popcnt

mellinoe commented 8 years ago

@benaadams , those vector.AsVectorXXX(V) methods sometimes use multiple registers and don't always generate good quality code. I'd rather just expose more of the intrinisics.

Not trying to be obtuse here, but how would refactoring the support out help this? If we think that the implementation is suboptimal, then we can improve it. It's a separate discussion from how the public interface is exposed, unless you are saying that the interface is limited in such a way that it's impossible to improve the performance.

How Vector3 gets implemented (lets say .Add())?
Arithmetics.Add3i( (Register) op1, (Register) op2, (Register) result ); // This is an instrinsic
Register is a very unsafe data type that maps straight to memory (in the stack or the backing storage of choice --- aka a pointer ---)

I understand this a high-level description and that it's not meant to be the final concrete proposal, but it seems a bit too abstract to me. At a fundamental level, this is not very dissimilar to how Vector<T> works right now, aside from naming. It's not clear to me how the casts work, at which point the JIT/VM intrinsics kick in, how the fallback logic is handled, etc. Once the proposal became more concrete, we might find ourselves stuck with some of the same challenges we have with Vector<T>. I do like the proposed usage and interface, I'm just concerned that it's taking too high-level of a view.

redknightlois commented 8 years ago

@jackmott still that is Game Development oriented where many of the operations still make sense (even if tangentially) to the approach taken by System.Numerics. The real problem and where we are completely stuck is where the instruction set and use is so different that it doesnt even make sense (prefetches, popcnt, etc).

@mellinoe I fail to see how more down to the ground level that can be. That maps straight into the actual operation (paddd) over a memory location. Register is just a pointer in a very unsafe way so you could just cast void, byte, int, long, float* and double* into Register and viceversa without much trouble.

Its also very easy to create an example where Vector<T> simply doesnt work even for its intended use. Just consider how would you write code that could handle a wide range of operations that could easily handle the difference between paddd and the 2 types of vpaddd (the 256bits and the 512bits version). The high level representation simply doesnt cut it, unless you are completely guaranteed that you will have the whole set of instructions supported by Vector<T> for sizes you have available. Which for a very narrow instruction set is doable, but then you have to resort to JIT magic to do the rest, which gets unwildly very fast.

Case in point, just try to figure out how to implement the whole AVX and AVX2 instruction set (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2) as a design exercise. The sheer amount of work would be daunting. However, implementing access for those intrinsics in a design rooted in the ideas I layed out can be done even by external contributors after the support for doing so is provided at the JIT level.

For a more concrete example, based on the kind of work that had to be done to support ror and rol is kinda a baseline level of effort you could actually measure. If you needed ror and rol you could very well have provided an intrinsic like Bits.RotateLeft(ulong x, int shift) and be done with it. Instead there was a need to support 4 different ways to express the very same thing using shift and bitwise operations. You can check it for yourself (https://github.com/dotnet/coreclr/pull/1830) or just ask @erozenfeld, he implemented the morphers in question. Having been among those very few who had actually exploited that optimization, while I applaud the effort it is easy to figure out that trying to do so for every single operation in just say AVX resembles the journeys of Don Quijote (known for fighting windmills). Chances are stacked for the journey to end badly

About the example, you also selected the one that is intended to be used on System.Numerics. For that reason, it is no coincidence it will look like exactly the same. Mainly because there exist a 1-to-1 mapping between the actual hardware operation and the high-level operation provided by Vector<T> abstraction. In that case what I am doing is just providing an extra-indirection toward a more low level representation which also allows to deal with the distinction of paddd and vpaddd which probably Vector<T> would never support and IMHO for a very good reason.

EDIT: Just to make it clear, with the examples I wasnt actually trying to design how it would look; just trying to layout the base ideas from where we could start constructing a library that would make sense.

redknightlois commented 8 years ago

Just looking on an issue I found this 2 more issues that are also related to this issue:

dotnet/runtime#6225 RyuJIT: Allow developers to provide Branch Prediction Information
dotnet/runtime#5869 RyuJIT: Provide access to the CPU prefetch instruction

jackmott commented 8 years ago

@redknightlois actually I don't have any problem with the approach in System.Numerics (as far as I know!) I just would like the extra instructions to be available. You can implement a popcnt fallback if the hardware instruction isn't there. Or do isHardwareAccelerated on a per instruction basis and let the user of the API deal with it.

The SIMD in C# is pretty nice to work with, I would love to just have full access to the CPU.

redknightlois commented 8 years ago

@jackmott Don't get me wrong, I don't have any problem either. It is pretty well designed for its intended use, which is as a foundation to exploit SIMD math in GameDev environments.

jackmott commented 8 years ago

@redknightlois Thinking more about the popcnt thing, I guess you are saying you also want a lib with general access to other intrinsics that don't fall into the Vectorized paradigm. I agree that would be nice. We could speed up the current implementation of F# core library Array.Filter if we could call popcnt.

Ziflin commented 8 years ago

Actually I'm only able to implement about 50% of our game engine's existing math library with what's there now. Without some of the other features I've mentioned most of the real-world performance improvements are impossible. I haven't seen much of a response to that, so I'm not clear what the future plans are still.

redknightlois commented 8 years ago

@jackmott let's keep it as: "You want general access to all the intrisics available in your hardware" or at least a general mechanism that would allow contributor to add access to those hardware instructions via an intrinsic approach. Also, if they are vectorized or no is of no relevance (even though most are).

@Ziflin I didnt want to imply it is complete, I meant that the addition of the functionality it is currently missing can easily fall in line into the design tenets of System.Numerics.

erozenfeld commented 8 years ago

@redknightlois In ror and rol case there was a lot of existing code that was using rotation patterns and we wanted it to benefit from this work. Just adding an intrinsic wouldn't help without re-writing all that code. I agree that in some cases intrinsics are more appropriate.

redknightlois commented 8 years ago

@erozenfeld I know, but the point I wanted to make still holds. Adding the ability to discover those use patterns is an awful hard (bordering on unfeasable) work to scale to the instruction set available to any modern processor out there.

nietras commented 8 years ago

@redknightlois I think having real unsafe "intrinsics" is a great idea. Especially, given the issues we/I have with the current form of Vector<T>. Our issues are among others:

One size, maximum vector length only. This is far from optimal with increasing SIMD register lengths e.g. AVX-512, ARMv8 SVE, many heterogenuous SOCs with register lengths of up to 2048. Not having access to smaller vector lengths (e.g. 128 bit, 256 bit etc. when applicable) means we cannot make optimal code for small kernels, windows or similar in image processing, machine learning. Although, Vector<T> has a design that is ideally suited for Agner Fogs ForwardCom, this simply isn't there yet and most ISAs only support fixed size vector lengths, but almost always with support for smaller than largest vector length.

Lack of good up/down conversions. (as of last time I checked) these are essential doing numerical processing, machine learning, neural nets etc.

The design of intrinsics could be inspired by existing APIs e.g.

An Abstraction Layer for SIMD Extensions
Agner Fogs vector class
Skia SIMD
@joeldevahl simd c++ intrinsics

Many others could be given. Anyone working with high performance code will meet some kind of abstraction at some point. All of these, as far as I can tell, have fixed size register abstractions. I believe we need a design based on this in some way e.g. Reg64<T>, Reg128<T>, Reg256<T> etc. which are completely unsafe, have zero or very low overhead if no direct hardware instruction support, can be queried for whether current processor supports the length, and so forth.

I think the way forward would be to create a requirement specification or wishlist if you will, that could be used by MS and the community to evaluate options. Perhaps from this, one could make one or more design proposals.

I have seen the following issues related to SIMD:

Consider refactor out SIMD support from System.Numerics into System.Unsafe.Simd: https://github.com/dotnet/coreclr/issues/6906
Consider providing SIMD JIT intrinsics for Matrix and Quaternion operations: https://github.com/dotnet/coreclr/issues/4356
Please provide intrinsics for SIMD bit shift operations: https://github.com/dotnet/coreclr/issues/3226
Support for SSE4 intrinsics by RyuJIT: https://github.com/dotnet/corefx/issues/2209
Design initialization API for Vector that supports padding: https://github.com/dotnet/corefx/issues/5360
Vector constructor not recognized for ubyte, byte, short or ushort: https://github.com/dotnet/coreclr/issues/5116
Add support for extracting a bit mask from a Vector: https://github.com/dotnet/corefx/issues/1010
Matrix4x4 Changes - Remove/Move Matrix4x4.CreateWorld, CreateBillboard, etc.: https://github.com/dotnet/corefx/issues/10931

Some of these indicate that the Vector<T> and other Numerics types have had a rather narrow scope and not prioritized types such as byte, sbyte, short, ushort. In a world where everything is being infused with some kind of AI and tricks, such as using 8-bit SIMD in neural nets, are becoming more important, I think having intrinsics in .NET could add to .NET Core being attractive for cross platform machine learning. If, that is, this is something that is prioritized.

@CarolEidt @jkotas @mellinoe @jamesqo

jkotas commented 8 years ago

I think the way forward would be to create a requirement specification or wishlist if you will, that could be used by MS and the community to evaluate options. Perhaps from this, one could make one or more design proposals.

Having requirement/design options proposals written down sounds great to me.

BTW: We got mailing lists created at http://lists.dot.net/mailman/listinfo/dotnet-runtime-dev some time ago. They are meant to be used for deeper design discussions that are interesting to all dotnet runtimes implementations. The mailing lists have been pretty silent so far ... but I think design discussion about SIMD would be a good candidate for them. IIRC, both Mono and Unity IL2CPP have their own flavor of SIMD APIs...

Reg64<T>, Reg128<T>, Reg256<T> etc. which are completely unsafe

What would these do and what makes them unsafe?

nietras commented 8 years ago

What would these do and what makes them unsafe?

I think the requirement for these to have "zero or low overhead" means that they can be used primarily with pointers and refs, and if constructed from say managed arrays this should be done in an "unsafe" way without bounds checking if possible. Perhaps the only reason for calling them "unsafe" will be that they should be compatible with unsafe pointers i.e. we should be able to load/store from "unsafe" memory. This may just be via Unsafe.Read/Write so the type Reg128 itself does not need to be "unsafe" as such, but we need to be able to work with these over pointers and refs; native or managed.

The type itself should support usual primitive operations (add, subtract, multiply etc.), but then have special functions for supporting popcnt, shuffling, up/down conversion, FMA (e.g. for 8-bit to 32-bit int not just floats), etc. These functions should be as low level as possible and try to map directly to a given ISAs instruction. If the function is not supported the "generic" implementation should be as fast as custom user code, so one can avoid having fallbacks.

I don't have all the answers here, and ´Vector` probably fulfils a lot of the above requirements but it is the lack of the special functions that is an issue and not having access to all the possible vector register lengths.

I could ask the question, why most users using Vector<T> will check for IsHardwareAccelerated and the switch to other implementation? Because the overhead of Vector<T> is too big when not hardware accelerated. It would be good to avoid this.

Note that we do, however, want to be able to check (with JIT eliminating these checks) if say Reg256 is hardware accelerated and only use Reg128 and Reg64 in such a case.

redknightlois commented 8 years ago

@jkotas Don't know if we agree with @nietras on what 'unsafe' means in this context because we hadn't discussed it before. But the definition I go with for "unsafe" is: "Man, you are on your own if you screw up" kind of 'unsafe'. So we are speaking usually of fixed memory and/or straight unmanaged memory kind of operations.

@nietras, knowing if certain operations are supported is still necessary, because some algorithms only have an edge if certain operations are hardware accelerated. However, given the target architectures certain instructions like SSE2 are guaranteed because afaik the CoreCLR core depends on those. (I might be wrong though). There are other operations like 'hints' to the CPU (prefetches, non-temporal moves, etc) that are effectively nops/changed when not supported too.

nietras commented 8 years ago

"unsafe" is: "Man, you are on your own if you screw up"

@redknightlois or perhaps I just didn't explain myself very well. I agree. No hands holding. But that does not mean it should necessarily be limited to native memory or pointers. refs and Unsafe manipulation of these together with C# 7 makes managed memory a target as well.

I don't know, perhaps the index operator is a better example for a register that contains a number of elements. No bounds checking should be done. That is, not like https://github.com/dotnet/corefx/blob/master/src/System.Numerics.Vectors/src/System/Numerics/Vector.cs#L1051

Hints overlap with the Unsafe.Assume we have previously discussed e. g. for aligned access, https://github.com/dotnet/coreclr/issues/2725.

All this is also why I don't like the name SIMD or Vector (a terrible name in general since it can be confused with geometry etc.) since a lot of these Intrinsics are unrelated to this. It is for low level close to metal access to CPU instructions, not just SIMD.

tannergooding commented 7 years ago

I think the current issues are as follows:

The current APIs are very geared towards gaming/multimedia based frameworks and a lot of the intrinsics that are available simply won't mesh well.
There are also a large number of APIs currently exposed on the System.Numerics.Vector types that simply don't make sense for general-purpose use (DotProduct, Length, Distance, etc).
In the total set of intrinsics, a single IsHardwareAccelerated property isn't sufficient. Most modern machines have multiple SIMD instruction sets and only the latest hardware supports them all. For example, It may be important for a user to do something different if the hardware supports SSE2 vs if it supports AVX or FMA.
The APIs are exposed in a "higher" layer of CoreFX, so lower level layers that might benefit from these intrinsics cannot readily take advantage of them. For example, SSE4.2 provides instructions that are directly beneficial to processing string and text processing, but those cannot readily be used by any other layer in the library.
The intrinsics supported by x86 and ARM differ with x86 progressing at a much quicker pace (ARM only supports 128-bit SIMD, while some Intel support 512-bit SIMD).

I believe the appropriate fix here (honestly) is to expose the raw intrinsics for each architecture in a lower level layer of the framework (mscorlib in CoreCLR and System.Runtime.Extensions in CoreFX). Each intrinsic would continue to be emulated on hardware that doesn't support it (so things always work) and we would expose, at a much more fine-grained level, whether each instruction (or possibly just each instruction set) is hardware accelerated.

In my opinion, users should be able to write code similar to the following:

if (Intrinsics.Architecture == ARM)
{
    // Use NEON
}
else  // Assume x86
{
    if (Intrinsics.HasHardwareSupport(FMA))
    {
        // Use FMA
    }
    else // Assume SSE2
    {
        // Use SSE2
    }
}

This probably seems fairly backwards to how one would think Managed Code should be written, but it is actually critical to ensure that you can properly optimize your application for the underlying hardware while simultaneously taking pressure of doing such optimizations off the JIT. There are some things that the JIT or even an AOT compiler will never be able to optimize properly (even the C++ compilers have issues, which is why the expose the intrinsics).

When the JIT hits a method using intrinsics it can skip entire regions based on the Intrinsics.Architecture and Intrinsics.HasHardwareSupports clauses. The remaining instructions just come down to operating in the same manner as the System.Numerics.Vector instructions do today (if the hardware supports it, emit the raw intrinsic; otherwise, leave the software call).

Having this pattern also opens up the possibility of implementing APIs that are currently FCALLs into the CRT in actual managed code (and ensuring we maintain perf). It also allows us to improve on some of these implementations to take advantage of intrinsics that might not be available otherwise (SIMD accelerated memcpy, string/text processing, etc).

Additionally, it allows users to ensure their code suits their needs.

As an example, on x64 architecture both to compute the reciprocal square root, you have a couple options: sqrtps followed by divps or just rsqrtps. The former computes a much more accurate result, but is significantly slower, while the latter computes a less accurate result (max error of 1.5 * 2^-12) but is significantly faster. Due to this difference, the Intel Optimization Manual (11.12) recommends that you use the rsqrtps instruction on architectures where sqrtps and divps have high latency and low throughput and where you don't need the increased precision (pretty much anything prior to Skylake). Additionally, they recommend that if you don't need full precision and near-full precision is good enough, a single Newton-Raphson iteration can continue to provide higher-throughput in a number of scenarios.

Being able to detect whether or not the "fast" implementation is appropriate for the user will be impossible and providing both a regular implementation as well as a Fast implementation will just lead to API bloat.

Finally, just this portion of writing the code in C/C++ might not be possible or maintainable. It might also come with increased overhead due to the interop/marshalling calls (as well as calling convention differences, etc). Writing the entire library/app in C/C++ might also not be easily maintainable (especially for cross-plat and cross-architecture scenarios).

svick commented 7 years ago

@tannergooding

Being able to detect whether or not the "fast" implementation is appropriate for the user will be impossible and providing both a regular implementation as well as a Fast implementation will just lead to API bloat.

So, how do you detect which implementation is appropriate with your proposal? You could probably pick some set of intrinsics that are supported only in Skylake and newer, and test for that using Intrinsics.HasHardwareSupport but that doesn't sound like a great approach to me.

tannergooding commented 7 years ago

@svick, specifically for the case of instructions where the "fast" implementation provides different results from the "regular" implementation, it comes down to the user calling the appropriate intrinsic themselves.

I know, with regards to my own code, whether 1.0f / sqrtss is required or whether rsqrtss will be good enough. So, if I have the ability to indicate which to use, I can optimize my code my way (without having to take dependencies on mixed language solutions or dealing with p/invoke and the overhead it comes with).

Another example is the System.Numerics.Vector4.Transform code. Currently, this looks like this:

public static Vector4 Transform(Vector4 vector, Matrix4x4 matrix)
{
    return new Vector4(
        vector.X * matrix.M11 + vector.Y * matrix.M21 + vector.Z * matrix.M31 + vector.W * matrix.M41,
        vector.X * matrix.M12 + vector.Y * matrix.M22 + vector.Z * matrix.M32 + vector.W * matrix.M42,
        vector.X * matrix.M13 + vector.Y * matrix.M23 + vector.Z * matrix.M33 + vector.W * matrix.M43,
        vector.X * matrix.M14 + vector.Y * matrix.M24 + vector.Z * matrix.M34 + vector.W * matrix.M44);
}

The code is currently a strict 'software' implementation (which will hopefully get optimized into somewhat appropriate SIMD instructions).

However, a better approach is here: https://github.com/Microsoft/DirectXMath/blob/master/Inc/DirectXMathVector.inl#L14084, which uses explicit SSE intrinsics if available and an even better implementation is here: https://github.com/Microsoft/DirectXMath/blob/master/Extensions/DirectXMathFMA4.h#L264, which uses explicit FMA intrinsics.

My proposal is that users be able to perform similar coding conventions:

public static Vector4 Transform(Vector4 vector, Matrix4x4 matrix)
{
    if (Intrinsics.Architecture == ARM)
    {
        // Do the NEON implementation
    }
    else // Assume x86
    {
        Debug.Assert(Intrinsics.Architecture == x86);

        if (Intrinsics.HasHardwareSupport(FMA))
        {
            // Do the FMA implementation
        }
        else if (Intrinsics.HasHardwareSupport(SSE))
        {
            // Do the SSE Implementation
        }
        else
        {
            // Software fallback
            return new Vector4(
                vector.X * matrix.M11 + vector.Y * matrix.M21 + vector.Z * matrix.M31 + vector.W * matrix.M41,
                vector.X * matrix.M12 + vector.Y * matrix.M22 + vector.Z * matrix.M32 + vector.W * matrix.M42,
                vector.X * matrix.M13 + vector.Y * matrix.M23 + vector.Z * matrix.M33 + vector.W * matrix.M43,
                vector.X * matrix.M14 + vector.Y * matrix.M24 + vector.Z * matrix.M34 + vector.W * matrix.M44);
        }
    }
}

The JIT would then be able to just emit the correct instructions and things would "work". The caveat being that users need to manually code the software fallback or something like a PlatformNotSupportedException would be thrown by the runtime.

tannergooding commented 7 years ago

The majority of users wouldn't need to use this, but framework authors would want to.

Additionally, it would let us port a number methods which are currently FCALLs into managed code without losing perf. It would also let us take advantage of things before the backing CRT library does (imagine portions of System.String being able to consume the SSE4 string/text processing instructions on hardware where it is available, without needing to worry about FCALLs, P/Invoke, etc...)

redknightlois commented 7 years ago

@svick It doesn't have to be great. I would be more than happy with "just works" ... its definitely not code for everyone, but those that need it (me or even MS people doing platform work) doesn't have it either.

tannergooding commented 7 years ago

Another thing to keep in mind is that intrinsics are guaranteed to be compiled down optimally, while code patterns may not get optimized depending on a lot of factors (method too big/complex, code is not patterned 'just right', etc).

There also might not be reasonable managed wrappers for a lot of the intrinsics that people want and they may have different behaviors on different platform architectures. The min/max SIMD intrinsics are one example. ARM does it one way and x86 does it another.

I am the only one that knows if this architecture difference is acceptable to my code, so I am the only one that can definitely say they can/should be used. Additionally, the overhead cost of and FCALLs and p/invoke also makes it so that using these intructions [through those mechanisms] is not worthwhile. So it turns into a "I can't use this from C#" because the JIT can't emit the code so I am stuck with slightly less performant code.

redknightlois commented 7 years ago

In support of @tannergooding example. For example, roaring bitmaps are just not possible to be coded on C# efficiently, aka you can implement only less performant alternatives on CoreCLR. There are entire classes of algorithms where the optimal solution on CoreCLR (because of the SIMD constraint) is very far from optimal against the state of the art.

Tornhoof commented 7 years ago

Still even without popcnt roaringbitmaps in c# are mostly faster than in Java, but I agree having to use the explicit intrinsics ist fine for many use cases.

nietras commented 7 years ago

I'm still hoping that we can find some way to capture the "method info" of a given intrinsic method and then be able to ask whether this is actually hardware accelerated or not. Allowing for fine grained querying of what functionality is hardware accelerated or not. In fact, this could then be extended to ask any heterogenous compute device (e.g. GPU, Hexagon etc.) if a given method/operation is available and if it is accelerated. I second @tannergooding and his great examples.

tannergooding commented 7 years ago

@nietras, I'm thinking the best way to expose that is to either have a centralized class for hardware intrinsics (as I proposed above) or to possibly extend MethodInfo to have an IsIntrinsic property (where true means the call compiles down to an inlined machine instruction).

There is definitely existing code in the framework that is "sometimes intrinsic" (although intrinsic doesn't always mean 'fast', as is the case for fsin vs emulation with SSE) and it would be useful to know if the code would hit the optimization path.

nietras commented 7 years ago

@tannergooding would a MethodInfo.IsIntrinsic check be possible to elide away by the JIT? I am also wondering about how we would capture the MethodInfos in the face of possible refs or even pointers?

Levels of hardware acceleration could be an option. Other than that your proposal for specific "intrinsic sets" e.g. x86.SSE or special cases like FMA etc. would be straighforward. Users would then have to know which intrinsics are part of each of these sets though. But most users would have this knowledge presumably.

Tornhoof commented 7 years ago

What does a property MethodInfo.IsIntrinsic actually solve? I think that property has a different goal than a class X86.SSE (following @nietras example here). The class allows us to build possibly faster algorithms, the property mainly tells us that something is 'not' fast, but for checks of the 'Is SSE Available yes/no' kind, I imagine that the Intrinsic class needs that anyway, including possibly it to be optimized away by the JIT as @nietras suggests.

If you mean that automatic recognition of e.g. the Rol/Ror pattern and the property telling me that it's optimized that would only work if that pattern is a in a dedicated method and not just a pattern somewhere in my code. I can't think of many use cases where I'm actually interested in knowing that my code is not optimized at runtime, except telling my users to buy a newer/better cpu.

RossNordby commented 7 years ago

I'll also second (third?) a design like the one proposed by @tannergooding. Vector<T> is really nice as-is for a lot of things, but it has some gaps that are very hard to fill at its current level of abstraction. As a library developer, I'd definitely be willing to jump through some extra hoops to improve performance. Doubly so if it makes it quicker and easier for the runtime to adopt newer intrinsics and those which don't have an easy universal hardware-agnostic design. (Shuffles, shifts, gather/scatter, AVX512...)

I'd also fully support an API which has 'unsafe'-ish parts that trust the user to do things properly:

If the point is to expose the hardware, then it's the user's responsibility to, for example, use only immediate values for a bit shift on hardware that can only accelerate immediate values. A performance cliff or exception seem like acceptable fallbacks in the event that the user misuses such an intrinsic.
As a general observation, in every situation where I have resorted to SIMD-level optimization so far, language-level memory safety has been essentially irrelevant or counterproductive. Easy use of arbitrary memory blobs and avoiding bounds checks seems like the right level of safety for this kind of API.

saucecontrol commented 7 years ago

I'd also like to vote in favor of the option @tannergooding proposed. I'm working on imaging software, and while Vector<T> has allowed me to accelerate quite a few things, the absence of basic operations like SHUF, RSQRT, HADD, etc really limit what I can do, and these don't necessarily fit with the way the Vector APIs are laid out.

Even in cases where an API proposal fits the model, we've been held up by the fear that developers will misuse them, as seen in #16835

I also agree with @RossNordby that these are advanced features, and it's reasonable to assume that developers using them have some clue what they're doing. As long as there's an IL implementation to fall back on, the worst that will happen is that the code will be slow.

VladimirAkopyan commented 7 years ago

@RossNordby is absolutely right - by it's very nature SIMD is used in places where arbitrary memory blobs run rampant, and belongs to unsafe namespace. For example I have worked with machine vision cameras and when working with those performance is often paramount - some cameras can produce 12 megapixels at >330 fps. And they come with an API for C#. They don't produce data nicely arranged into float or int - they produce just a stream of bits, 10-12 bits per pixel, tightly packed together. And these cameras are more common than you might think - all kinds of industrial equipment uses them, from almond sorting machines to an autonomous bus I've seen recently.

tannergooding commented 7 years ago

https://github.com/dotnet/designs/issues/13

fiigii commented 7 years ago

Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940

mellinoe commented 7 years ago

@RossNordby @Ziflin @redknightlois @nietras We would very much like to hear everyone's feedback on the proposal and design linked right above -- it was influenced in part by the feedback we have received here and in other similar issues.

Ziflin commented 7 years ago

Thanks @tannergooding , @fiigii, @mellinoe! The proposal is looking really good! Thanks for all the work on it!

dotnet / runtime

Consider refactor out SIMD support from System.Numerics into System.Unsafe.Simd #6556