dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

Add Sse.CreateShuffleMask (expose _MM_SHUFFLE) API #34714

Open EgorBo opened 4 years ago

EgorBo commented 4 years ago

Rationale

Shuffle is a quite popular API in the SIMD world, and in some of the Shuffle APIs masks are passed as a constant argument of byte type. Let's say we have a vector and we want to reverse it:

Vector128<int> vector = Vector128.Create(10, 20, 30, 40);

Current approach we use in the BCL for Matrix4x4 and e.g. in ML.NET is to pre-calculate that mask-byte by hands and leave as a magic constant (can you say what 0x55 does?) e.g.:

Vector128<int> reversed = Sse2.Shuffle(vector, 27);

However, in the C/C++ world there is a special macro for that in xmmintrin.h - _MM_SHUFFLE so a C++ equivalent of the operation above is:

__m128i reversed = _mm_shuffle_epi32 (vector, _MM_SHUFFLE(0,1,2,3)); // 27
// 0,1,2,3 are basically indices (in a reversed order)

So, the idea is to expose _MM_SHUFFLE to C# (and, perhaps, _MM_SHUFFLE2).

Proposed API

    public abstract class Sse
    {
        internal Sse() { }

        public static bool IsSupported { get => IsSupported; }

+       [MethodImpl(MethodImplOptions.AggressiveInlining)]
+       public static byte CreateShuffleMask(byte p3, byte p2, byte p1, byte p0)
+       {
+           return (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);
+       }

Alternative names: ShuffleParameter, MakeShuffleMask.

Notes

For better codegen it's expected that this API is used only with constant arguments and the whole function is always inlineable. However, inlining doesn't happen with jit minOpts (tier0) so the API can slightly regress such tier0 code (an additional jump-table is emitted). A possible solution for this problem is to convert it into an intrinsic but I don't think it makes sense since if it's in a hot path it will be recompiled to tier1 anyway and inlined/folded.

Also, I suggest we keep the same order indices are passed to _MM_SHUFFLE in C/C++ to make life easier for those who port existing SIMD algorithms to C#.

/cc @tannergooding

ghost commented 4 years ago

Tagging @tannergooding as an area owner

tannergooding commented 4 years ago

We should likely have another that takes 2 arguments for use with SHUFPD (_MM_SHUFFLE2 in C++). It may be worth a convenience overload for PSHUB which just calls Vector128.Create(byte, byte, ..., byte) (discoverability).

I'm not sure if CreateShuffleMask is the "best" name. The mask is also used for permutes and other instructions. I don't have a proposed alternative but CreateShuffleMask is my favorite of the proposed/alternatives.

However, inlining doesn't happen with jit minOpts (tier0) so the API can slightly regress such tier0 code (an additional jump-table is emitted). A possible solution for this problem is to convert it into an intrinsic but I don't think it makes sense since if it's in a hot path it will be recompiled to tier1 anyway and inlined/folded.

It's trivial to do this in importation and I'd say we just do so. It will improve codegen and reduce the overall size of the IR trees.

Also, I suggest we keep the same order indices are passed to _MM_SHUFFLE in C/C++ to make life easier for those who port existing SIMD algorithms to C#.

I don't think I agree here. We've explicitly swapped the ordering elsewhere (such as in the Vector128.Create APIs) to match the "normal" .NET ordering (starting from index 0 and going up). It's likely just one of the things users will need to account for when transitioning to .NET and should be easily caught.

gfoidl commented 4 years ago

I don't think I agree here.

I agree with @tannergooding. Within .NET it should be consistent and it's a easy mental tranistion. Having some APIs starting at 0, others ending at 0 feels awkward.

The mask is also used for permutes and other instructions

The same would apply to _MM_SHUFFLE. I think it's fine to have a "link" in the name to the C++ macro.

tannergooding commented 4 years ago

The same would apply to _MM_SHUFFLE. I think it's fine to have a "link" in the name to the C++ macro.

Right, the difference here though is that _MM_SHUFFLE was designed when SSE was the only thing around. Permute wasn't introduced until much later and just happens to take the same mask/control input.

If someone has a good name suggestion, we might be interested in taking that instead. Otherwise, it's probably not a big deal and people can figure it out (or we can also propose CreatePermuteMask for readability/discoverability).

saucecontrol commented 4 years ago

I'd just go with CreateByteMask or CreateMaskByte. We'll have larger masks if we ever implement AVX-512. And things like blendps also take a byte mask, so making explicit names for each mask purpose could get out of hand.

And agree on the order. Actual vector order makes way more sense for the masks. I find when porting code, it's super easy to translate _MM_SHUFFLE to a binary literal, but when I'm writing new code, I have to pause to think about reversing the order when I write the mask.

EgorBo commented 4 years ago

starting from index 0 and going up

Ok, makes sense, should I update the api proposal text or comments are basically part of it?

I'd just go with CreateByteMask or CreateMaskByte.

Yeah these look better 👍

tannergooding commented 4 years ago

I'd just go with CreateByteMask or CreateMaskByte.

I'm not sure CreateByteMask accurately describes what is being done here nor that it will retain its meaning alongside future ISAs.

We'll have larger masks if we ever implement AVX-512.

As for AVX-512, it has an entirely new masking system with new registers that won't be able to be tracked by the existing types.

And things like blendps also take a byte mask, so making explicit names for each mask purpose could get out of hand.

blendps also takes a different imm8 format, so it needs a custom method anyways.

For blendps it is 1 bit per entry and specifies left (0) vs right (1) For shufps it is 2 bits per entry and specifies which element (0 through 3) For dpps it uses 4 bits to specify which elements are multiplied and 4 bits to specify which are added.

There are then other instructions (aeskeygenassist, cmpps, extractps, etc) which take an imm8 that is used for other scenarios. And there are other instructions (mpsadbw, etc) which take an imm8 as a control word but differ from the above three's behavior.

It's also worth noting that we use control vs mask in the API names, so it might be worth taking that into account as part of the proposal.

saucecontrol commented 4 years ago

I'm not sure CreateByteMask accurately describes what is being done here nor that it will retain its meaning alongside future ISAs.

My point was that it's kind of strange to have a helper to create shuffle/permute masks if there's no helper for others mask types.

Going back to the rationale for needing this helper in the first place, the masks are often demystified simply by writing them as binary literals. For example, 0b_01_01_01_01 is much easier to follow than 0x55. Likewise, 0b_00_01_10_11 is easier to follow than 27 or 0x1B. This is true of dpps masks as well, where you might write them as a binary literal in groupings of 4 bits, e.g. 0b_1111_0001. The only real advantage a helper would have over that would be to swap the bit/crumb/nibble order to match the actual vector order.

So I was suggesting that a general-purpose CreateByteMask that accepts either 4 crumbs or 8 bits and reverses the order. Slightly more readable than the binary literal without being overly specific.

tannergooding commented 4 years ago

I think at that point it isn't much effort to just manually reverse the binary literal.

The purpose of these is so you can construct an appropriate mask in a "readable" fashion using regular "decimal" values. They also help in porting native code that is already using the "equivalent" helpers.

BruceForstall commented 4 years ago

@tannergooding Should this be added to 5.0 milestone, or Future? (or closed)

tannergooding commented 4 years ago

This can probably be marked Future. I think it is likely worth adding as it greatly improves readability for these intrinsics and an "equivalent" is provided by native. We are currently adding comments effectively describing the above in our own S.P.Corelib code anyways and that would be the recommendation we tell people otherwise.

This probably requires some more discussion around how to expose the functionality and whether having just the Shuffle version provided by native is sufficient or if we should also provide versions for the other masks (or maybe overloads that take each mask part, etc).

JimBobSquarePants commented 3 years ago

Also, I suggest we keep the same order indices are passed to _MM_SHUFFLE in C/C++ to make life easier for those who port existing SIMD algorithms to C#.

I would tend to agree with @EgorBo here. As someone who is learning all this stuff for the first time (2 weeks in) I've found that it's important to be able to reference existing examples in other languages (especially since the MS docs simply name the intrinsics as defined in the Intel references). However, and it's a big however, _MM_SHUFFLE is entirely unintuitive (for me anyway) when used in other cases such as blending and I would love to see specific methods to help us generate the correct byte mask/control for each scenario.

I've already implemented _MM_SHUFFLE in my own codebase and also a helper for generating spans for the overloads that accept Vector128<byte> and Vector256<byte> masks.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0)
    => (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void MmShuffleSpan(ref Span<byte> span, byte control)
{
    InverseMmShuffle(
            control,
            out int p3,
            out int p2,
            out int p1,
            out int p0);

    ref byte spanBase = ref MemoryMarshal.GetReference(span);

    for (int i = 0; i < span.Length; i += 4)
    {
        Unsafe.Add(ref spanBase, i) = (byte)(p0 + i);
        Unsafe.Add(ref spanBase, i + 1) = (byte)(p1 + i);
        Unsafe.Add(ref spanBase, i + 2) = (byte)(p2 + i);
        Unsafe.Add(ref spanBase, i + 3) = (byte)(p3 + i);
    }
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void InverseMmShuffle(
    byte control,
    out int p3,
    out int p2,
    out int p1,
    out int p0)
{
    p3 = control >> 6 & 0x3;
    p2 = control >> 4 & 0x3;
    p1 = control >> 2 & 0x3;
    p0 = control >> 0 & 0x3;
}

Calling the span generating method is a little clunky but I'm posting it as an example of the kind of methods developers will want and find extremely useful.

Span<byte> bytes = stackalloc byte[Vector256<byte>.Count];
Shuffle.MmShuffleSpan(ref bytes, control);
Vector256<byte> vmask = Unsafe.As<byte, Vector256<byte>>(ref MemoryMarshal.GetReference(bytes));

I'm afraid I can't help with naming here as I've not been exposed yet to the full range of use cases for control/mask generation.