dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

[API Proposal]: AVX-512 IFMA Intrinsics #96476

Open MineCake147E opened 8 months ago

MineCake147E commented 8 months ago

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.MultiplyAdd52Low(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.MultiplyAdd52High(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

Risks

None

ghost commented 8 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation `AVX-512 IFMA` is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for `VPMULLQ` instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as `VPMADD52LUQ` finishes in only 4 clock cycles. ### API Proposal ```csharp namespace System.Runtime.Intrinsics.X86 { public abstract class Avx512Ifma : Avx512F { public static bool IsSupported { get; } public static Vector512 FusedMultiplyUInt52LowAddUInt64(Vector512 a, Vector512 b, Vector512 c); public static Vector512 FusedMultiplyUInt52HighAddUInt64(Vector512 a, Vector512 b, Vector512 c); public abstract class VL : Avx512F.VL { public static new bool IsSupported { get; } public static Vector256 FusedMultiplyUInt52LowAddUInt64(Vector256 a, Vector256 b, Vector256 c); public static Vector256 FusedMultiplyUInt52HighAddUInt64(Vector256 a, Vector256 b, Vector256 c); public static Vector128 FusedMultiplyUInt52LowAddUInt64(Vector128 a, Vector128 b, Vector128 c); public static Vector128 FusedMultiplyUInt52HighAddUInt64(Vector128 a, Vector128 b, Vector128 c); } } } ``` ### API Usage ```csharp zmm0 = Avx512Ifma.FusedMultiplyUInt52LowAddUInt64(zmm0, zmm2, zmm3); zmm1 = Avx512Ifma.FusedMultiplyUInt52HighAddUInt64(zmm1, zmm2, zmm3); ``` An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics: https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411 ### Alternative Designs - Alternative Names - `VPMADD52LUQ` (`FusedMultiplyUInt52LowAddUInt64`) - `FusedMultiplyAddLowUInt52` - `VPMADD52HUQ` (`FusedMultiplyUInt52HighAddUInt64`) - `FusedMultiplyAddHighUInt52` ### Risks None
Author: MineCake147E
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime.Intrinsics`
Milestone: -
BruceForstall commented 8 months ago

@dotnet/avx512-contrib

tannergooding commented 8 months ago

The names here could be "better". I'd expect simply MultiplyAdd52Low and MultiplyAdd52High or similar would be sufficient and more closely matches the "name" portion of the underlying C/C++ intrinsic _mm512_madd52lo_epu64 and _mm512_madd52hi_epu64 (the name portion is madd52lo and madd52hi).

These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point)

MineCake147E commented 7 months ago

I updated the proposal accordingly.

terrajobst commented 6 months ago

Video

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);

        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        }
    }
}