Open EgorBo opened 4 years ago
Tagging @tannergooding as an area owner
We should likely have another that takes 2 arguments for use with SHUFPD
(_MM_SHUFFLE2
in C++). It may be worth a convenience overload for PSHUB
which just calls Vector128.Create(byte, byte, ..., byte)
(discoverability).
I'm not sure if CreateShuffleMask
is the "best" name. The mask is also used for permutes and other instructions. I don't have a proposed alternative but CreateShuffleMask
is my favorite of the proposed/alternatives.
However, inlining doesn't happen with jit minOpts (tier0) so the API can slightly regress such tier0 code (an additional jump-table is emitted). A possible solution for this problem is to convert it into an intrinsic but I don't think it makes sense since if it's in a hot path it will be recompiled to tier1 anyway and inlined/folded.
It's trivial to do this in importation and I'd say we just do so. It will improve codegen and reduce the overall size of the IR trees.
Also, I suggest we keep the same order indices are passed to _MM_SHUFFLE in C/C++ to make life easier for those who port existing SIMD algorithms to C#.
I don't think I agree here. We've explicitly swapped the ordering elsewhere (such as in the Vector128.Create
APIs) to match the "normal" .NET ordering (starting from index 0 and going up).
It's likely just one of the things users will need to account for when transitioning to .NET and should be easily caught.
I don't think I agree here.
I agree with @tannergooding. Within .NET it should be consistent and it's a easy mental tranistion. Having some APIs starting at 0, others ending at 0 feels awkward.
The mask is also used for permutes and other instructions
The same would apply to _MM_SHUFFLE
. I think it's fine to have a "link" in the name to the C++ macro.
The same would apply to _MM_SHUFFLE. I think it's fine to have a "link" in the name to the C++ macro.
Right, the difference here though is that _MM_SHUFFLE
was designed when SSE
was the only thing around.
Permute
wasn't introduced until much later and just happens to take the same mask/control input.
If someone has a good name suggestion, we might be interested in taking that instead. Otherwise, it's probably not a big deal and people can figure it out (or we can also propose CreatePermuteMask
for readability/discoverability).
I'd just go with CreateByteMask
or CreateMaskByte
. We'll have larger masks if we ever implement AVX-512. And things like blendps
also take a byte mask, so making explicit names for each mask purpose could get out of hand.
And agree on the order. Actual vector order makes way more sense for the masks. I find when porting code, it's super easy to translate _MM_SHUFFLE to a binary literal, but when I'm writing new code, I have to pause to think about reversing the order when I write the mask.
starting from index 0 and going up
Ok, makes sense, should I update the api proposal text or comments are basically part of it?
I'd just go with CreateByteMask or CreateMaskByte.
Yeah these look better 👍
I'd just go with CreateByteMask or CreateMaskByte.
I'm not sure CreateByteMask
accurately describes what is being done here nor that it will retain its meaning alongside future ISAs.
We'll have larger masks if we ever implement AVX-512.
As for AVX-512, it has an entirely new masking system with new registers that won't be able to be tracked by the existing types.
And things like blendps also take a byte mask, so making explicit names for each mask purpose could get out of hand.
blendps
also takes a different imm8
format, so it needs a custom method anyways.
For blendps
it is 1 bit per entry and specifies left
(0) vs right
(1)
For shufps
it is 2 bits per entry and specifies which element (0 through 3)
For dpps
it uses 4 bits to specify which elements are multiplied and 4 bits to specify which are added.
There are then other instructions (aeskeygenassist
, cmpps
, extractps
, etc) which take an imm8 that is used for other scenarios.
And there are other instructions (mpsadbw
, etc) which take an imm8 as a control word but differ from the above three's behavior.
It's also worth noting that we use control
vs mask
in the API names, so it might be worth taking that into account as part of the proposal.
I'm not sure CreateByteMask accurately describes what is being done here nor that it will retain its meaning alongside future ISAs.
My point was that it's kind of strange to have a helper to create shuffle/permute masks if there's no helper for others mask types.
Going back to the rationale for needing this helper in the first place, the masks are often demystified simply by writing them as binary literals. For example, 0b_01_01_01_01
is much easier to follow than 0x55
. Likewise, 0b_00_01_10_11
is easier to follow than 27
or 0x1B
. This is true of dpps
masks as well, where you might write them as a binary literal in groupings of 4 bits, e.g. 0b_1111_0001
. The only real advantage a helper would have over that would be to swap the bit/crumb/nibble order to match the actual vector order.
So I was suggesting that a general-purpose CreateByteMask
that accepts either 4 crumbs or 8 bits and reverses the order. Slightly more readable than the binary literal without being overly specific.
I think at that point it isn't much effort to just manually reverse the binary literal.
The purpose of these is so you can construct an appropriate mask in a "readable" fashion using regular "decimal" values. They also help in porting native code that is already using the "equivalent" helpers.
@tannergooding Should this be added to 5.0 milestone, or Future? (or closed)
This can probably be marked Future
. I think it is likely worth adding as it greatly improves readability for these intrinsics and an "equivalent" is provided by native.
We are currently adding comments effectively describing the above in our own S.P.Corelib code anyways and that would be the recommendation we tell people otherwise.
This probably requires some more discussion around how to expose the functionality and whether having just the Shuffle
version provided by native is sufficient or if we should also provide versions for the other masks (or maybe overloads that take each mask part, etc).
Also, I suggest we keep the same order indices are passed to _MM_SHUFFLE in C/C++ to make life easier for those who port existing SIMD algorithms to C#.
I would tend to agree with @EgorBo here. As someone who is learning all this stuff for the first time (2 weeks in) I've found that it's important to be able to reference existing examples in other languages (especially since the MS docs simply name the intrinsics as defined in the Intel references). However, and it's a big however, _MM_SHUFFLE
is entirely unintuitive (for me anyway) when used in other cases such as blending and I would love to see specific methods to help us generate the correct byte mask/control for each scenario.
I've already implemented _MM_SHUFFLE
in my own codebase and also a helper for generating spans for the overloads that accept Vector128<byte>
and Vector256<byte>
masks.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0)
=> (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void MmShuffleSpan(ref Span<byte> span, byte control)
{
InverseMmShuffle(
control,
out int p3,
out int p2,
out int p1,
out int p0);
ref byte spanBase = ref MemoryMarshal.GetReference(span);
for (int i = 0; i < span.Length; i += 4)
{
Unsafe.Add(ref spanBase, i) = (byte)(p0 + i);
Unsafe.Add(ref spanBase, i + 1) = (byte)(p1 + i);
Unsafe.Add(ref spanBase, i + 2) = (byte)(p2 + i);
Unsafe.Add(ref spanBase, i + 3) = (byte)(p3 + i);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void InverseMmShuffle(
byte control,
out int p3,
out int p2,
out int p1,
out int p0)
{
p3 = control >> 6 & 0x3;
p2 = control >> 4 & 0x3;
p1 = control >> 2 & 0x3;
p0 = control >> 0 & 0x3;
}
Calling the span generating method is a little clunky but I'm posting it as an example of the kind of methods developers will want and find extremely useful.
Span<byte> bytes = stackalloc byte[Vector256<byte>.Count];
Shuffle.MmShuffleSpan(ref bytes, control);
Vector256<byte> vmask = Unsafe.As<byte, Vector256<byte>>(ref MemoryMarshal.GetReference(bytes));
I'm afraid I can't help with naming here as I've not been exposed yet to the full range of use cases for control/mask generation.
Rationale
Shuffle
is a quite popular API in the SIMD world, and in some of the Shuffle APIs masks are passed as a constant argument ofbyte
type. Let's say we have a vector and we want to reverse it:Current approach we use in the BCL for Matrix4x4 and e.g. in ML.NET is to pre-calculate that mask-byte by hands and leave as a magic constant (can you say what
0x55
does?) e.g.:However, in the C/C++ world there is a special macro for that in
xmmintrin.h
-_MM_SHUFFLE
so a C++ equivalent of the operation above is:So, the idea is to expose
_MM_SHUFFLE
to C# (and, perhaps,_MM_SHUFFLE2
).Proposed API
Alternative names:
ShuffleParameter
,MakeShuffleMask
.Notes
For better codegen it's expected that this API is used only with constant arguments and the whole function is always inlineable. However, inlining doesn't happen with jit minOpts (tier0) so the API can slightly regress such tier0 code (an additional jump-table is emitted). A possible solution for this problem is to convert it into an intrinsic but I don't think it makes sense since if it's in a hot path it will be recompiled to tier1 anyway and inlined/folded.
Also, I suggest we keep the same order indices are passed to
_MM_SHUFFLE
in C/C++ to make life easier for those who port existing SIMD algorithms to C#./cc @tannergooding