[API Proposal]: AVX-512 IFMA Intrinsics

MineCake147E commented 10 months ago

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.MultiplyAdd52Low(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.MultiplyAdd52High(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

Risks

None

ghost commented 10 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation `AVX-512 IFMA` is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for `VPMULLQ` instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as `VPMADD52LUQ` finishes in only 4 clock cycles. ### API Proposal ```csharp namespace System.Runtime.Intrinsics.X86 { public abstract class Avx512Ifma : Avx512F { public static bool IsSupported { get; } public static Vector512 FusedMultiplyUInt52LowAddUInt64(Vector512 a, Vector512 b, Vector512 c); public static Vector512 FusedMultiplyUInt52HighAddUInt64(Vector512 a, Vector512 b, Vector512 c); public abstract class VL : Avx512F.VL { public static new bool IsSupported { get; } public static Vector256 FusedMultiplyUInt52LowAddUInt64(Vector256 a, Vector256 b, Vector256 c); public static Vector256 FusedMultiplyUInt52HighAddUInt64(Vector256 a, Vector256 b, Vector256 c); public static Vector128 FusedMultiplyUInt52LowAddUInt64(Vector128 a, Vector128 b, Vector128 c); public static Vector128 FusedMultiplyUInt52HighAddUInt64(Vector128 a, Vector128 b, Vector128 c); } } } ``` ### API Usage ```csharp zmm0 = Avx512Ifma.FusedMultiplyUInt52LowAddUInt64(zmm0, zmm2, zmm3); zmm1 = Avx512Ifma.FusedMultiplyUInt52HighAddUInt64(zmm1, zmm2, zmm3); ``` An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics: https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411 ### Alternative Designs - Alternative Names - `VPMADD52LUQ` (`FusedMultiplyUInt52LowAddUInt64`) - `FusedMultiplyAddLowUInt52` - `VPMADD52HUQ` (`FusedMultiplyUInt52HighAddUInt64`) - `FusedMultiplyAddHighUInt52` ### Risks None

Author:	MineCake147E
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`
Milestone:	-

BruceForstall commented 10 months ago

@dotnet/avx512-contrib

tannergooding commented 10 months ago

The names here could be "better". I'd expect simply MultiplyAdd52Low and MultiplyAdd52High or similar would be sufficient and more closely matches the "name" portion of the underlying C/C++ intrinsic _mm512_madd52lo_epu64 and _mm512_madd52hi_epu64 (the name portion is madd52lo and madd52hi).

These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point)

MineCake147E commented 10 months ago

I updated the proposal accordingly.

terrajobst commented 8 months ago

Video

We should rename the parameters to addend, left, and right

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);

        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        }
    }
}

saucecontrol commented 2 weeks ago

Similar to https://github.com/dotnet/runtime/issues/86849, this should probably be changed to:

namespace System.Runtime.Intrinsics.X86;

// approved in https://github.com/dotnet/runtime/issues/98833
public abstract class AvxIfma : Avx2
{
    // new nested class
    [Intrinsic]
    public new abstract class V512
    {
        public static new bool IsSupported { get => IsSupported; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
    }
}

Since the parent is not yet implemented, we also have the option of changing that name to just Ifma, since it wouldn't directly correlate to the AVX_IFMA cpuid bit any longer.

dotnet / runtime