[API Proposal]: GFNI Intrinsics

MineCake147E commented 10 months ago

Background and motivation

GFNI is supported by Intel in the Ice Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and bit manipulations. An efficient bit-reversal can be implemented with it.

API Proposal

namespace System.Runtime.Intrinsics.X86;

public abstract class Gfni : Sse41
{
    public static bool IsSupported { get; }

    public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);

    public abstract class X64 : Sse41.X64
    {
        public static bool IsSupported { get; }
    }

    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<byte> GaloisFieldAffineTransformInverse(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldAffineTransform(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldMultiply(Vector512<byte> left, Vector512<byte> right);
    }
}

API Usage

// https://wunkolo.github.io/post/2020/11/gf2p8affineqb-bit-reversal/
public static Vector128<byte> ReverseBits128(Vector128<byte> value)
{
    var xmm0 = Gfni.GaloisFieldAffineTransform(value, Vector128.Create(0b10000000_01000000_00100000_00010000_00001000_00000100_00000010_00000001ul).AsByte(), 0);
    return Ssse3.Shuffle(xmm0, Vector128.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, byte.MinValue));
}

Alternative Designs

No response

Risks

No response

ghost commented 10 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation `GFNI` is supported by Intel in the Ice Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and bit manipulations. An [efficient bit-reversal](https://wunkolo.github.io/post/2020/11/gf2p8affineqb-bit-reversal/) can be implemented with it. ### API Proposal ```csharp namespace System.Runtime.Intrinsics.X86 { public abstract class Avx512Gfni : Avx512F { public static bool IsSupported { get; } public static Vector512 GaloisFieldAffineTransformInverse(Vector512 x, Vector512 a, [ConstantExpected] byte b); public static Vector512 GaloisFieldAffineTransform(Vector512 x, Vector512 a, [ConstantExpected] byte b); public static Vector512 GaloisFieldMultiply(Vector512 left, Vector512 right); public abstract class VL : Avx512F.VL { public static new bool IsSupported { get; } public static Vector256 GaloisFieldAffineTransformInverse(Vector256 x, Vector256 a, [ConstantExpected] byte b); public static Vector128 GaloisFieldAffineTransformInverse(Vector128 x, Vector128 a, [ConstantExpected] byte b); public static Vector256 GaloisFieldAffineTransform(Vector256 x, Vector256 a, [ConstantExpected] byte b); public static Vector128 GaloisFieldAffineTransform(Vector128 x, Vector128 a, [ConstantExpected] byte b); public static Vector256 GaloisFieldMultiply(Vector256 left, Vector256 right); public static Vector128 GaloisFieldMultiply(Vector128 left, Vector128 right); } } public abstract class AvxGfni : Avx { public static bool IsSupported { get; } public static Vector256 GaloisFieldAffineTransformInverse(Vector256 x, Vector256 a, [ConstantExpected] byte b); public static Vector256 GaloisFieldAffineTransform(Vector256 x, Vector256 a, [ConstantExpected] byte b); public static Vector256 GaloisFieldMultiply(Vector256 left, Vector256 right); } public abstract class Gfni : Sse41 { public static bool IsSupported { get; } public static Vector128 GaloisFieldAffineTransformInverse(Vector128 x, Vector128 a, [ConstantExpected] byte b); public static Vector128 GaloisFieldAffineTransform(Vector128 x, Vector128 a, [ConstantExpected] byte b); public static Vector128 GaloisFieldMultiply(Vector128 left, Vector128 right); } } ### API Usage ```csharp // https://wunkolo.github.io/post/2020/11/gf2p8affineqb-bit-reversal/ public static Vector128 ReverseBits128(Vector128 value) { var xmm0 = Gfni.GaloisFieldAffineTransform(value, Vector128.Create(0b10000000_01000000_00100000_00010000_00001000_00000100_00000010_00000001ul).AsByte(), 0); return Ssse3.Shuffle(xmm0, Vector128.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, byte.MinValue)); } ``` ### Alternative Designs _No response_ ### Risks _No response_

Author:	MineCake147E
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`
Milestone:	-

PaulusParssinen commented 10 months ago

Here's more unexpected uses for the Galois Field Affine Transformation Instruction. collected by animetosho 👍

saucecontrol commented 9 months ago

Should these be named Gfni128, Gfni256, and Gfni512 to be consistent with Pclmulqdq256 and Pclmulqdq512? The ISA support flags work the same way with GFNI.

Same thing with Avx512F.VL overloads mirroring AvxGfni/Gfni256. They probably don't need to be there, as they were skipped with VPCLMULQDQ.

terrajobst commented 8 months ago

Video

Looks good as proposed

namespace System.Runtime.Intrinsics.X86;

public abstract class Avx512Gfni : Avx512F
{
    public static bool IsSupported { get; }

    public static Vector512<byte> GaloisFieldAffineTransformInverse(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
    public static Vector512<byte> GaloisFieldAffineTransform(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
    public static Vector512<byte> GaloisFieldMultiply(Vector512<byte> left, Vector512<byte> right);

    public abstract class VL : Avx512F.VL
    {
        public static new bool IsSupported { get; }

        public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
        public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);
    }
}

public abstract class AvxGfni : Avx
{
    public static bool IsSupported { get; }

    public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
    public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
    public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
}

public abstract class Gfni : Sse41
{
    public static bool IsSupported { get; }

    public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);
}

saucecontrol commented 3 weeks ago

For consistency with the AVX10 surface (and #86952), this should probably be revised to

namespace System.Runtime.Intrinsics.X86;

public abstract class Gfni : Sse41
{
    public static bool IsSupported { get; }

    public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);

    public abstract class X64 : Sse41.X64
    {
        public static bool IsSupported { get; }
    }

    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<byte> GaloisFieldAffineTransformInverse(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldAffineTransform(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldMultiply(Vector512<byte> left, Vector512<byte> right);
    }
}

Also, the affine transform ops treat the second operand as an 8x8bit matrix and are named in the C intrinsics to indicate one operand is a vector of 64-bit values (e.g. _mm_gf2p8affine_epi64_epi8). It might make more sense to define those as Vector128<ulong>, etc for consistency. With names being x and a to match the C defs (although the matrix operand is capital A there), it can be difficult to remember which is which, but having one int64 and one int8 makes it more clear. And should there be signed overloads?

bartonjs commented 3 weeks ago

Video

[ConstantExpected] byte b should be [ConstantExpected] byte control
Otherwise, looks good as proposed

namespace System.Runtime.Intrinsics.X86;

public abstract class Gfni : Sse41
{
    public static bool IsSupported { get; }

    public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte control);
    public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<byte> a, [ConstantExpected] byte control);
    public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);

    public abstract class X64 : Sse41.X64
    {
        public static bool IsSupported { get; }
    }

    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte control);
        public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<byte> a, [ConstantExpected] byte control);
        public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<byte> GaloisFieldAffineTransformInverse(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte control);
        public static Vector512<byte> GaloisFieldAffineTransform(Vector512<byte> x, Vector512<byte> a, [ConstantExpected] byte control);
        public static Vector512<byte> GaloisFieldMultiply(Vector512<byte> left, Vector512<byte> right);
    }
}

saucecontrol commented 2 weeks ago

I'll implement this one

saucecontrol commented 2 weeks ago

I just got a chance to watch the API review video. It sounds like there was some confusion around the immediate operand for the affine instructions. The documentation defines the affine transform as producing each output byte from the formula A * x + b, where

A is an 8x8 bit matrix vector
x is a byte vector
b is defined as a constant vector, as if the immediate byte were broadcast to all positions

This doesn't fit the pattern of what we typically call a 'control' byte, which might select a lane for processing or give a permute order. Since it's an actual operand used in the mathematical definition in this case, it would be more clear if the name matched the documentation. It should be noted that this discussion was part of the API review for the original shape, when it was decided to keep the name b.

I also didn't hear any mention of the 8x8 matrix operand's type in the discussion. Typical use, as in the sample given in top issue, would have the same matrix for each 64-bit lane. Example repeated here:

// https://wunkolo.github.io/post/2020/11/gf2p8affineqb-bit-reversal/
public static Vector128<byte> ReverseBits128(Vector128<byte> value)
{
    var xmm0 = Gfni.GaloisFieldAffineTransform(value, Vector128.Create(0b10000000_01000000_00100000_00010000_00001000_00000100_00000010_00000001ul).AsByte(), 0);
    return Ssse3.Shuffle(xmm0, Vector128.Create(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, byte.MinValue));
}

Note that the sample creates the matrix vector by broadcast of a ulong and then calls AsByte(), where the cast ends up being noise. Likewise, the EVEX instruction encoding supports a 64-bit memory broadcast for the matrix operand. Between matching the documentation more closely and more closely matching the typical use of the instruction, I think it makes more sense to define that operand as VectorXXX<ulong> rather than VectorXXX<byte>.

Proposed shape would be:

namespace System.Runtime.Intrinsics.X86;

public abstract class Gfni : Sse2
{
    public static bool IsSupported { get; }

    public static Vector128<byte> GaloisFieldAffineTransformInverse(Vector128<byte> x, Vector128<ulong> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldAffineTransform(Vector128<byte> x, Vector128<ulong> a, [ConstantExpected] byte b);
    public static Vector128<byte> GaloisFieldMultiply(Vector128<byte> left, Vector128<byte> right);

    public abstract class X64 : Sse2.X64
    {
        public static bool IsSupported { get; }
    }

    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<byte> GaloisFieldAffineTransformInverse(Vector256<byte> x, Vector256<ulong> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldAffineTransform(Vector256<byte> x, Vector256<ulong> a, [ConstantExpected] byte b);
        public static Vector256<byte> GaloisFieldMultiply(Vector256<byte> left, Vector256<byte> right);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<byte> GaloisFieldAffineTransformInverse(Vector512<byte> x, Vector512<ulong> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldAffineTransform(Vector512<byte> x, Vector512<ulong> a, [ConstantExpected] byte b);
        public static Vector512<byte> GaloisFieldMultiply(Vector512<byte> left, Vector512<byte> right);
    }
}

dotnet / runtime