Vector Shuffling Operations

MikePopoloski commented 9 years ago

As mentioned in dotnet/runtime#14281, a lot of useful SIMD code relies on shuffling data around within registers. There are a whole bunch of instructions available to do this, so we'd like to expose a coherent subset of these via Vector.

Rationale and Usage

Simple stream processing, such as the operations already supported on Vector, are entirely parallel and thus don't benefit from shuffling operations. However, lots of common uses of SIMD do require it, such as doing geometric vector operations like matrix multiplies and vector cross products. These are key scenarios for certain users, such as game developers.

If rearranging elements is required for an algorithm and fast hardware support is not provided, the cost of loading and storing to and from SIMD registers might exceed any benefits gained from using SIMD in the first place.

When looking at existing libraries that wrap SIMD functionality at a low level, it's apparent that there are a few main use cases of shuffle:

Permuting the elements of two different vectors into a result
"Swizzling" the elements within a single vector
"Splatting" or "broadcasting" one element to all others within a single vector

Rather than expose one single shuffle method directly, I think it makes sense to go with this slightly higher level functionality such that the API doesn't become too tied to a particular architecture's instruction set.

Proposed API

New methods on the Vector struct:

// pick elements from a and b to place into result
internal static unsafe Vector<T> Permute2 (Vector<T> a, Vector<T> b, int s0, int s1);
internal static unsafe Vector<T> Permute4 (Vector<T> a, Vector<T> b, int s0, int s1, int s2, int s3);

// rearrange elements within a
internal static unsafe Vector<T> Swizzle2 (Vector<T> a, int s0, int s1);
internal static unsafe Vector<T> Swizzle4 (Vector<T> a, int s0, int s1, int s2, int s3);

// rearrange bytes in a
internal static unsafe Vector<T> ShuffleBytes (Vector<T> a, Vector<byte> mask);

// broadcast specified element to all other elements in a
internal static unsafe Vector<T> Splat (Vector<T> a, int element);

New methods on Vector class that are simple wrappers around the Vector methods.

public static Vector<double> Permute (Vector<double> a, Vector<double> b, int s0, int s1);
public static Vector<long> Permute (Vector<long> a, Vector<long> b, int s0, int s1);
public static Vector<ulong> Permute (Vector<ulong> a, Vector<ulong> b, int s0, int s1);
public static Vector<float> Permute (Vector<float> a, Vector<float> b, int s0, int s1, int s2, int s3);
public static Vector<int> Permute (Vector<int> a, Vector<int> b, int s0, int s1, int s2, int s3);
public static Vector<uint> Permute (Vector<uint> a, Vector<uint> b, int s0, int s1, int s2, int s3);

public static Vector<double> Swizzle (Vector<double> a, int s0, int s1);
public static Vector<long> Swizzle (Vector<long> a, int s0, int s1);
public static Vector<ulong> Swizzle (Vector<ulong> a, int s0, int s1);
public static Vector<float> Swizzle (Vector<float> a, int s0, int s1, int s2, int s3);
public static Vector<int> Swizzle (Vector<int> a, int s0, int s1, int s2, int s3);
public static Vector<uint> Swizzle (Vector<uint> a, int s0, int s1, int s2, int s3);

public static Vector<T> ShuffleBytes (Vector<T> a, Vector<byte> mask);

public static Vector<T> Splat (Vector<T> a, int element);

Details and Questions

There are a lot of tradeoffs when picking method signatures for shuffle. For example, instead of taking many selector parameters, we could instead have one combined integer mask. However, pretty much every single caller will then implement their own helper method to do the bitwise combination of the mask, so all callsites get more complicated. Additionally, we want the JIT to discover and optimize the cases where the parameters are JIT-time constants.
I chose for the selector parameters to be ints instead of, say, bytes. I'm not sure if there's a compelling reason to choose one over the other. In either case, we'd have to define what happens when the parameters are out of range. Since the intrinsics are going to bitwise combine them, in the interests of speed it'd be good if they could say that are not range checked.
We could also have a Permute8 and Swizzle8 and extend the methods to shorts and ushorts. The parameter lists get kind of unwieldy...
ShuffleBytes is an interesting case. pshufb is a very useful instruction to expose, but it's an SSSE3 instruction. It may be worth being more explicit about which features are supported by the hardware via some set of caps flags instead of the single boolean IsHardwareAccelerated we have now.
Shuffle intrinsically breaks the abstraction of Vector of not needing to know the underlying register size. Honestly, I think that's an unworkable abstraction in the long term; it's too difficult to write usage code that can work with arbitrarily sized vectors, so it might be worth exploring having specifically sized Vector128, Vector256, etc. This is clearly a discussion to be had in a separate issue though.

mellinoe commented 9 years ago

+@caroleidt

Shuffle intrinsically breaks the abstraction of Vector of not needing to know the underlying register size. Honestly, I think that's an unworkable abstraction in the long term; it's too difficult to write usage code that can work with arbitrarily sized vectors, so it might be worth exploring having specifically sized Vector128, Vector256, etc. This is clearly a discussion to be had in a separate issue though.

When we have discussed Shuffle operations internally, this was the general roadblock we hit, and never came up with a great solution. Our initial designs had this notion of a "BitVector" type, which could act as a sort of variable-sized mask for operating with different-sized data types. It was part of our very early designs, so I'm struggling to remember how you interacted with it exactly, for example, how you constructed it, how you modified it, etc.

MikePopoloski commented 9 years ago

How attached are you guys to this particular implementation? I feel like dynamically sized vectors are an interesting idea for specialized workloads but not tractable for more general cases. It's easy to imagine running into further issues with trying to expose things like horizontal adds.

It may be worth looking at how different C++ libraries provide abstractions over SIMD; the venerable Agner Fog has a great lib that I think would serve well as an API for .NET: http://www.agner.org/optimize/vectorclass.pdf

I don't mind doing this work if you think it's a good idea; I feel pretty strongly that we get the API right now so that we're not locked out when trying to add other functionality in the future.

CarolEidt commented 9 years ago

@MikePopoloski - There are really two usage models for SIMD vectors - fixed-size types work well for modeling systems with a fixed number of dimensions (e.g. points in space, or colors expressed as RGB). For that we have types like Vector3, etc. Vector is useful in solving large parallel problems, such as in mathematics, finance or physical sciences, where you really want to be able to leverage all the available parallelism, without having to have conditional code for different targets. It would probably make sense for us to introduce the shuffle operators for the fixed types first, and then consider what abstractions could and should be supported for Vector.

tfenise commented 9 years ago

As we have Vector for simple stream processing on doubles, we could have Vector or something alike for complex processing on 2D vectors. However, this means Vector must occupy two xmm registers on x86 without AVX. Maybe a bad idea.

programmatom commented 8 years ago

Anyone mind if I post a real-world example to provide some feedback? This particular example is complex multiply. It's kind of interesting because it has fairly simple requirements, but also a bit peculiar. In essence, it requires some pairwise (i.e. even/odd) permutation. Please see the attached code for a working proof-of-concept (using Vector, but with shuffles done scalar - the code is tested and correct, but slow). For "vAeven", the even elements are copied to the odd positions, and for "vAodd", the odd elements are copied to the even positions. For "vBswap", it's a simple permutation ("swizzle" in the Intel lingo). The algorithm is agnostic to vector length, as long as it is a multiple of 2. In all cases, it would be using the proposed Swizzle2().

Now, it turns out there is a "horizontal add" on AVX (and SSE3) which seems ready-made for complex arithmetic. If that is used, then the two permutations used to generate "vAeven" and "vAodd" can be eliminated - so if you're after the best possible performance, then you'd need to consider exposing the horizontal add as well.

Another possibility is exposing complex multiply as an intrinsic - since the optimal forms on different architectures vary. (It's common enough, in signal processing anyway.) The precedent is that "dot" product is already exposed that way, as well as some matrix operations, and "lerp".

Above, there's some discussion of whether to continue trying to abstract away register length. As far as from the perspective of this testcase, one possibility might be to add Vector8 - which would turn into two Vector4's in SSE2, but be a single register on AVX.

Maybe I've just muddied the water, and complex-multiply should be split off into it's own 'issue'?

complexmultiply.txt

tfenise commented 8 years ago

I think instead of having:

public static Vector<double> Swizzle (Vector<double> a, int s0, int s1);

we could have:

public static Vector<Vector2Double> Swizzle(Vector<Vector2Double> a, int s0, int s1);

The only "difference" between Vector<double> and Vector<Vector2Double> is that a Vector<Vector2Double> is guaranteed to consist of an even number of doubles. This become important if we want to provide Vector<Vector4Double>, which must occupy two 128-bit registers without AVX on x64.

On a machine with SSE2, Swizzle(a, 1, 0) could behave as follow:

(x0, x1) -> (x1, x0)

but on a machine with AVX, it could behave like this:

(x0, x1, x2, x3) -> (x1, x0, x3, x2)

Thus it does not break the abstraction of Vector of not needing to know the underlying register size.

CarolEidt commented 8 years ago

The problem with the proposed APIs (both the one proposed by @MikePopoloski, as well as those proposed by @tfenise) is that the hardware (at least SSE and AVX and their derivatives, but presumably other implementations as well) do not support shuffles or permutations with variable selectors. While one could imagine that the JIT could simply do constant propagation to deal with that, in the event that it could not derive constant values the code generation would be complex, and not give the developer the performance they might expect. That is why it would be more desirable to come up with useful names for the types of permutation to be done. I think it is reasonable to set as a design parameter that Vector<T>.Count will always be an even number. This would make it reasonable to provide methods such as Swap, Even and Odd that address the Complex case (though presumably one might want an Even that compresses the vector - i.e. two Vector<T>s as input would give one Vector<T> with just the even elements, as well as an Even that does what the example @programmatom provided would want - which duplicates the even elements to produce a Vector<T> from a single input.

terrajobst commented 8 years ago

That would seem reasonable to me.

Others?

tfenise commented 8 years ago

What about permutations among four or more elements?

CarolEidt commented 8 years ago

Shuffles and permutations of more than a couple of elements are problematic with Vector<T>, and I'd love to hear of additional use cases. For shuffles of fixed-size vectors, I think a "field name" approach would be good, e.g.

// Shuffle fields of fixed vectors.
internal static Vector3 ZXY(Vector3 value);  // Rotate
...
internal static Vector4 YXWZ(Vector4 value); // Even/Odd shuffle
...

I find it a bit more difficult to come up with good names for the permutes. It could get really cumbersome:

internal static Vector4 XaXbYaYb(Vector4 a, Vector4 b); // Interleave

tfenise commented 8 years ago

There are over 200 shuffles of Vector4. It might be impractical to have over 200 shuffle functions, let alone permutations.

For the complex case, I also suggest the following API or a equivalent one: public static Vector<T> Broadcast<T>(T odd, T even);

There can also be shift or rotation API for Vector<T>, e.g. public static Vector<T> ShiftLeftOnce<T>(Vector<T>); public static Vector<T> ShiftLeftTwice<T>(Vector<T>);//Since Vector<T>.Count is an even number

Maybe I'm asking for too much.

DerpMcDerp commented 8 years ago

For swizzling, the field name (i.e. XX, XY, ..., XXXX, ..., WWWW) approach results in an insane amount of properties. I think an indexing operator should be used instead:

struct Vector2 {
    public float this[int index] { get; } // might as well add this for symmetry
    public Vector2 this[int x, int y] { get; }
    public Vector3 this[int x, int y, int z] { get; }
    public Vector4 this[int x, int y, int z, int w] { get; }
}

The jitter should look to see if the numbers passed in are all constants so it 1. uses the register selection feature of some SIMD processors and 2. skips over the index out of range checks.

For the basic case, it should do fancy tricks like:

// Vector3
public Vector4 this[int x, int y, int z, int w] {
// 1. use | instead of || to avoid branching and to do the comparison in parallel
// 2. cast to uint to avoid the index < 0 test
    if (unchecked((uint)(x|y|z|w)) > 2) {
        throw new ArgumentOutOfRangeException();
    }

// treat the struct as if it were a float array
    fixed (float *as_array = &this.X) {
        return new Vector4(
            as_array[x],
            as_array[y],
            as_array[z],
            as_array[w]
        );
    }
}

redknightlois commented 8 years ago

With some help of Roslyn (analyzers or the compiler itself) we can even do this:

public enum Operator { X, Y, Z, W };

struct Vector2
{
    public float this[int index] { get; }
    public Vector2 this[Operator x, Operator y] { get; }
    public Vector3 this[Operator x, Operator y, Operator z] { get; }
    public Vector4 this[Operator x, Operator y, Operator z, Operator w] { get; }
}

And it would be used like this:

Vector2 t = new Vector2() { X = 1, Y = 2 };
Vector2 res = t[Operator.X, Operator.X, Operator.Y, Operator.Y];

CarolEidt commented 8 years ago

I believe that the following would be just as easy for the JIT to translate into a shuffle: Vector4 res = new Vector4(t.X, t.X, t.Y, t.Y); The enum arguments need not be constants, so in your proposal the JIT would only be able to translate it into a shuffle by determining that they are, indeed, constant, which would be at about the same difficulty of determining that all the arguments to a constructor are fields of the same input Vector - and it doesn't require a new API. Indeed, that is probably what we should do.

redknightlois commented 8 years ago

@CarolEidt I do like that!!!

On a side note, I found this issue because what I needed was to know if PSHUFB was available to optimize IPAddress.NetworkToHost() operation which is quite bad. We have a better one built with inline code, but PSHUFB should be the way to go (it is even faster than BSWAP). How would that work with byte vectors? Would it be even possible?

CarolEidt commented 8 years ago

@redknightlois - the byte vectors gets us back into Vector<T> territory, where we don't have constructors that take per-field values. It is challenging to come up with a general solution for that case, though I was hoping that we could find a reasonably limited subset of cases that would be useful, such as Swap, Even, Odd, and/or Dup.

redknightlois commented 8 years ago

@CarolEidt Couldn't the morpher being able to figure out things like:

Vector<byte> src, dest;
dest[0] = src[1];
dest[1] = src[0];
dest[2] = src[2];
dest[3] = src[3];

It will probably be hard and costly to match, but it would also ensure that we can get the improvement also for byte* and byte[].

EDIT: In the case of Vector the morpher could also look into things like this:

var dest = new Vector<byte>(src[1], src[0], src[2], src[3] );

CarolEidt commented 8 years ago

@redknightlois We don't have such a constructor for Vector<T>, so it would have to be your former example - and then would the JIT look for all possible lengths, up to Vector<T>.Count?

redknightlois commented 8 years ago

@CarolEidt probably I am not getting the question, but let me elaborate how I think it could work.

Vector (when T is instantiated, we know the JIT knows the .Count)
Now here we have 2 options
- We only apply shuffle, gather, scatter when ALL of the values are in the code (complete morph).
- We apply also when there are pass-through, which would require probably a very complex morpher.

I am OK with requiring that you have to code explicitely for each T the appropriate operation. The rationale is why? Because lets say that I now need a XXYY for T instead of for only byte. I can go and build a custom operation with type guards and code every single version of it with the if(type)->then op pattern.

If I need a very complex operation I could just go and build it with the low-level optimized codepath and then compose them. For example, lets say that I want to build a movelh primitive, the code would be something like this:

[Vectorize]
public static Vector4<T> MoveLowerToHigher ( Vector4<T> a, Vector4<T> b )
{
      Vector4<T> dest;
      dest[0] = a[0];
      dest[1] = a[1];
      dest[2] = b[0];
      dest[3] = b[1];
      return dest;
}

You would notice that I put a [Vectorize] there. The reason is I dont care if I have to pay an extra cost to use vectorization morphers (which could be disabled-by-default). The JIT would just go and find the proper sequence of instructions (based on the operations available at the target architecture) that could do that for an extra cost.

Also for the JIT to pick it up there probably would be restrictions in what to put in the parameters. If there is no match, then the code is it is just plain boring serial CPU code. The upside is that if the morphers are able to handle the general cases for the most interesting operations: Load, Store, Shuffle, Rotations, Swizzles, Conditionals, etc you can build the primitive yourself and expect it to be inlined directly in your calling code.

Probably it sounds easier than it is, but I see it as an alternative to having to find "the proper" API to reflect things that are just not friendly to deal with in a high level language. There are a few that would require special APIs but I believe they could be minimal if the coverage can be done and restricted through special purpose morphers.

redknightlois commented 8 years ago

@CarolEidt A potential shuffle selection would also be the four presented in this nvidia presentation: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf They are though for GPU, but they are general enough to provide a sensible API for Shuffle.

public Vector<T> ShuffleDown ( Vector<T> v, int n );
public Vector<T> ShuffleUp ( Vector<T> v, int n );
public Vector<T> ShuffleXor( Vector<T> v )
public Vector<T> Shuffle( Vector<T> v, Vector<int> index);

terrajobst commented 8 years ago

@mellinoe & @CarolEidt we should chat. It seems we need a more holistic plan for exposing more intrinsics from our library.

danmoseley commented 7 years ago

@terrajobst it would be great to nudge this one along.

tannergooding commented 4 years ago

Going to close this for now. This API needs some more thought about how shuffling would work with regards to variable length vectors (when exposed as a general-purpose API).

If someone feels strongly about this, please feel free to open a new issue. Our API review process is detailed here with a "good example" being listed under step 1. Ideally the new issue would provide many of the sections given, but would at a minimum provide the base rationale and proposed API surface sections.

For this API in particular, the exposure of the System.Runtime.Intrinsic types and the methods for converting to/from Vector<T> should unblock users who immediately need this functionality.

dotnet / runtime