Closed MikePopoloski closed 4 years ago
+@caroleidt
- Shuffle intrinsically breaks the abstraction of Vector of not needing to know the underlying register size. Honestly, I think that's an unworkable abstraction in the long term; it's too difficult to write usage code that can work with arbitrarily sized vectors, so it might be worth exploring having specifically sized Vector128, Vector256, etc. This is clearly a discussion to be had in a separate issue though.
When we have discussed Shuffle operations internally, this was the general roadblock we hit, and never came up with a great solution. Our initial designs had this notion of a "BitVector" type, which could act as a sort of variable-sized mask for operating with different-sized data types. It was part of our very early designs, so I'm struggling to remember how you interacted with it exactly, for example, how you constructed it, how you modified it, etc.
How attached are you guys to this particular implementation? I feel like dynamically sized vectors are an interesting idea for specialized workloads but not tractable for more general cases. It's easy to imagine running into further issues with trying to expose things like horizontal adds.
It may be worth looking at how different C++ libraries provide abstractions over SIMD; the venerable Agner Fog has a great lib that I think would serve well as an API for .NET: http://www.agner.org/optimize/vectorclass.pdf
I don't mind doing this work if you think it's a good idea; I feel pretty strongly that we get the API right now so that we're not locked out when trying to add other functionality in the future.
@MikePopoloski - There are really two usage models for SIMD vectors - fixed-size types work well for modeling systems with a fixed number of dimensions (e.g. points in space, or colors expressed as RGB). For that we have types like Vector3, etc. Vector
As we have Vector
Anyone mind if I post a real-world example to provide some feedback? This particular example is complex multiply. It's kind of interesting because it has fairly simple requirements, but also a bit peculiar. In essence, it requires some pairwise (i.e. even/odd) permutation. Please see the attached code for a working proof-of-concept (using Vector
Now, it turns out there is a "horizontal add" on AVX (and SSE3) which seems ready-made for complex arithmetic. If that is used, then the two permutations used to generate "vAeven" and "vAodd" can be eliminated - so if you're after the best possible performance, then you'd need to consider exposing the horizontal add as well.
Another possibility is exposing complex multiply as an intrinsic - since the optimal forms on different architectures vary. (It's common enough, in signal processing anyway.) The precedent is that "dot" product is already exposed that way, as well as some matrix operations, and "lerp".
Above, there's some discussion of whether to continue trying to abstract away register length. As far as from the perspective of this testcase, one possibility might be to add Vector8 - which would turn into two Vector4's in SSE2, but be a single register on AVX.
Maybe I've just muddied the water, and complex-multiply should be split off into it's own 'issue'?
I think instead of having:
public static Vector<double> Swizzle (Vector<double> a, int s0, int s1);
we could have:
public static Vector<Vector2Double> Swizzle(Vector<Vector2Double> a, int s0, int s1);
The only "difference" between Vector<double>
and Vector<Vector2Double>
is that a Vector<Vector2Double>
is guaranteed to consist of an even number of doubles. This become important if we want to provide Vector<Vector4Double>
, which must occupy two 128-bit registers without AVX on x64.
On a machine with SSE2, Swizzle(a, 1, 0) could behave as follow:
(x0, x1) -> (x1, x0)
but on a machine with AVX, it could behave like this:
(x0, x1, x2, x3) -> (x1, x0, x3, x2)
Thus it does not break the abstraction of Vector of not needing to know the underlying register size.
The problem with the proposed APIs (both the one proposed by @MikePopoloski, as well as those proposed by @tfenise) is that the hardware (at least SSE and AVX and their derivatives, but presumably other implementations as well) do not support shuffles or permutations with variable selectors. While one could imagine that the JIT could simply do constant propagation to deal with that, in the event that it could not derive constant values the code generation would be complex, and not give the developer the performance they might expect. That is why it would be more desirable to come up with useful names for the types of permutation to be done.
I think it is reasonable to set as a design parameter that Vector<T>.Count
will always be an even number. This would make it reasonable to provide methods such as Swap
, Even
and Odd
that address the Complex case (though presumably one might want an Even
that compresses the vector - i.e. two Vector<T>
s as input would give one Vector<T>
with just the even elements, as well as an Even
that does what the example @programmatom provided would want - which duplicates the even elements to produce a Vector<T>
from a single input.
That would seem reasonable to me.
Others?
What about permutations among four or more elements?
Shuffles and permutations of more than a couple of elements are problematic with Vector<T>
, and I'd love to hear of additional use cases. For shuffles of fixed-size vectors, I think a "field name" approach would be good, e.g.
// Shuffle fields of fixed vectors.
internal static Vector3 ZXY(Vector3 value); // Rotate
...
internal static Vector4 YXWZ(Vector4 value); // Even/Odd shuffle
...
I find it a bit more difficult to come up with good names for the permutes. It could get really cumbersome:
internal static Vector4 XaXbYaYb(Vector4 a, Vector4 b); // Interleave
There are over 200 shuffles of Vector4. It might be impractical to have over 200 shuffle functions, let alone permutations.
For the complex case, I also suggest the following API or a equivalent one:
public static Vector<T> Broadcast<T>(T odd, T even);
There can also be shift or rotation API for Vector<T>
, e.g.
public static Vector<T> ShiftLeftOnce<T>(Vector<T>);
public static Vector<T> ShiftLeftTwice<T>(Vector<T>);//Since Vector<T>.Count is an even number
Maybe I'm asking for too much.
For swizzling, the field name (i.e. XX
, XY
, ..., XXXX
, ..., WWWW
) approach results in an insane amount of properties. I think an indexing operator should be used instead:
struct Vector2 {
public float this[int index] { get; } // might as well add this for symmetry
public Vector2 this[int x, int y] { get; }
public Vector3 this[int x, int y, int z] { get; }
public Vector4 this[int x, int y, int z, int w] { get; }
}
The jitter should look to see if the numbers passed in are all constants so it 1. uses the register selection feature of some SIMD processors and 2. skips over the index out of range checks.
For the basic case, it should do fancy tricks like:
// Vector3
public Vector4 this[int x, int y, int z, int w] {
// 1. use | instead of || to avoid branching and to do the comparison in parallel
// 2. cast to uint to avoid the index < 0 test
if (unchecked((uint)(x|y|z|w)) > 2) {
throw new ArgumentOutOfRangeException();
}
// treat the struct as if it were a float array
fixed (float *as_array = &this.X) {
return new Vector4(
as_array[x],
as_array[y],
as_array[z],
as_array[w]
);
}
}
With some help of Roslyn (analyzers or the compiler itself) we can even do this:
public enum Operator { X, Y, Z, W };
struct Vector2
{
public float this[int index] { get; }
public Vector2 this[Operator x, Operator y] { get; }
public Vector3 this[Operator x, Operator y, Operator z] { get; }
public Vector4 this[Operator x, Operator y, Operator z, Operator w] { get; }
}
And it would be used like this:
Vector2 t = new Vector2() { X = 1, Y = 2 };
Vector2 res = t[Operator.X, Operator.X, Operator.Y, Operator.Y];
I believe that the following would be just as easy for the JIT to translate into a shuffle:
Vector4 res = new Vector4(t.X, t.X, t.Y, t.Y);
The enum arguments need not be constants, so in your proposal the JIT would only be able to translate it into a shuffle by determining that they are, indeed, constant, which would be at about the same difficulty of determining that all the arguments to a constructor are fields of the same input Vector - and it doesn't require a new API. Indeed, that is probably what we should do.
@CarolEidt I do like that!!!
On a side note, I found this issue because what I needed was to know if PSHUFB was available to optimize IPAddress.NetworkToHost() operation which is quite bad. We have a better one built with inline code, but PSHUFB should be the way to go (it is even faster than BSWAP). How would that work with byte vectors? Would it be even possible?
@redknightlois - the byte vectors gets us back into Vector<T>
territory, where we don't have constructors that take per-field values. It is challenging to come up with a general solution for that case, though I was hoping that we could find a reasonably limited subset of cases that would be useful, such as Swap, Even, Odd, and/or Dup.
@CarolEidt Couldn't the morpher being able to figure out things like:
Vector<byte> src, dest;
dest[0] = src[1];
dest[1] = src[0];
dest[2] = src[2];
dest[3] = src[3];
It will probably be hard and costly to match, but it would also ensure that we can get the improvement also for byte*
and byte[]
.
EDIT: In the case of Vector
var dest = new Vector<byte>(src[1], src[0], src[2], src[3] );
@redknightlois We don't have such a constructor for Vector<T>
, so it would have to be your former example - and then would the JIT look for all possible lengths, up to Vector<T>.Count
?
@CarolEidt probably I am not getting the question, but let me elaborate how I think it could work.
I am OK with requiring that you have to code explicitely for each T the appropriate operation. The rationale is why? Because lets say that I now need a XXYY for T instead of for only byte. I can go and build a custom operation with type guards and code every single version of it with the if(type)->then op
pattern.
If I need a very complex operation I could just go and build it with the low-level optimized codepath and then compose them. For example, lets say that I want to build a movelh primitive, the code would be something like this:
[Vectorize]
public static Vector4<T> MoveLowerToHigher ( Vector4<T> a, Vector4<T> b )
{
Vector4<T> dest;
dest[0] = a[0];
dest[1] = a[1];
dest[2] = b[0];
dest[3] = b[1];
return dest;
}
You would notice that I put a [Vectorize]
there. The reason is I dont care if I have to pay an extra cost to use vectorization morphers (which could be disabled-by-default). The JIT would just go and find the proper sequence of instructions (based on the operations available at the target architecture) that could do that for an extra cost.
Also for the JIT to pick it up there probably would be restrictions in what to put in the parameters. If there is no match, then the code is it is just plain boring serial CPU code. The upside is that if the morphers are able to handle the general cases for the most interesting operations: Load, Store, Shuffle, Rotations, Swizzles, Conditionals, etc you can build the primitive yourself and expect it to be inlined directly in your calling code.
Probably it sounds easier than it is, but I see it as an alternative to having to find "the proper" API to reflect things that are just not friendly to deal with in a high level language. There are a few that would require special APIs but I believe they could be minimal if the coverage can be done and restricted through special purpose morphers.
@CarolEidt A potential shuffle selection would also be the four presented in this nvidia presentation: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf They are though for GPU, but they are general enough to provide a sensible API for Shuffle.
public Vector<T> ShuffleDown ( Vector<T> v, int n );
public Vector<T> ShuffleUp ( Vector<T> v, int n );
public Vector<T> ShuffleXor( Vector<T> v )
public Vector<T> Shuffle( Vector<T> v, Vector<int> index);
@mellinoe & @CarolEidt we should chat. It seems we need a more holistic plan for exposing more intrinsics from our library.
@terrajobst it would be great to nudge this one along.
Going to close this for now. This API needs some more thought about how shuffling would work with regards to variable length vectors (when exposed as a general-purpose API).
If someone feels strongly about this, please feel free to open a new issue. Our API review process is detailed here with a "good example" being listed under step 1. Ideally the new issue would provide many of the sections given, but would at a minimum provide the base rationale and proposed API surface sections.
For this API in particular, the exposure of the System.Runtime.Intrinsic types and the methods for converting to/from Vector<T>
should unblock users who immediately need this functionality.
As mentioned in dotnet/runtime#14281, a lot of useful SIMD code relies on shuffling data around within registers. There are a whole bunch of instructions available to do this, so we'd like to expose a coherent subset of these via Vector.
Rationale and Usage
Simple stream processing, such as the operations already supported on Vector, are entirely parallel and thus don't benefit from shuffling operations. However, lots of common uses of SIMD do require it, such as doing geometric vector operations like matrix multiplies and vector cross products. These are key scenarios for certain users, such as game developers.
If rearranging elements is required for an algorithm and fast hardware support is not provided, the cost of loading and storing to and from SIMD registers might exceed any benefits gained from using SIMD in the first place.
When looking at existing libraries that wrap SIMD functionality at a low level, it's apparent that there are a few main use cases of shuffle:
Rather than expose one single shuffle method directly, I think it makes sense to go with this slightly higher level functionality such that the API doesn't become too tied to a particular architecture's instruction set.
Proposed API
New methods on the Vector struct:
New methods on Vector class that are simple wrappers around the Vector methods.
Details and Questions