dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.58k stars 4.55k forks source link

[API Proposal]: : AVX-IFMA Intrinsics #98833

Open DeepakRajendrakumaran opened 4 months ago

DeepakRajendrakumaran commented 4 months ago

Background and motivation

The upcoming Intel® Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake will introduce AVX-IFMA instruction set architecture which provides VEX-encoded versions of following Instructions

Reference: https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html .

This proposal aims to expose AVX-IFMA instructions via intrinsics.

Note: A public proposal exists already for AVX-512 IFMA(https://github.com/dotnet/runtime/issues/96476)

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxIfma : Avx2
    {
        internal AvxIfma () { }

        public static new bool IsSupported { get; }

        public new abstract class X64 : Avx2.X64
        {
            internal X64() { }

            public static new bool IsSupported { get; }

        }

        /// <summary>
        /// __m128i _mm_madd52lo_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z)
        /// vpmadd52luq xmm, xmm, xmm
        /// </summary>
        public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);

        /// <summary>
        /// _m128i _mm_madd52hi_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z)
        /// vpmadd52huq xmm, xmm, xmm
        /// </summary>
        public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);

        /// <summary>
        /// __m256i _mm_madd52lo_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z)
        /// vpmadd52luq ymm, ymm, ymm
        /// </summary>
        public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);

        /// <summary>
        /// __m256i _mm256_madd52hi_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z)
        /// vpmadd52huq ymm, ymm, ymm
        /// </summary>
        public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);

    }
}

API Usage

Vector128<ulong> foo(Vector128<ulong> arg0, Vector128<ulong> arg1, Vector128<ulong> arg2)
{
        return AvxIfma.MultiplyAdd52Low(arg0, arg1, arg2);
}

Alternative Designs

No response

Risks

No response

anthonycanino commented 4 months ago

@dotnet/avx512-contrib

ghost commented 4 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation The upcoming Intel® Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake will introduce AVX-IFMA instruction set architecture which provides VEX-encoded versions of following Instructions - `VPMADD52HUQ`—Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products to Qword Accumulators - `VPMADD52LUQ`—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Reference: https://www.intel.com/content/www/us/en/content-details/812218/intel-architecture-instruction-set-extensions-programming-reference.html . This proposal aims to expose AVX-IFMA instructions via intrinsics. Note: A public proposal exists already for AVX-512 IFMA(https://github.com/dotnet/runtime/issues/96476) ### API Proposal ```csharp namespace System.Runtime.Intrinsics.X86 { public abstract class AvxIfma : Avx2 { internal AvxIfma () { } public static new bool IsSupported { get; } public new abstract class X64 : Avx2.X64 { internal X64() { } public static new bool IsSupported { get; } } /// /// __m128i _mm_madd52lo_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z) /// vpmadd52luq xmm, xmm, xmm /// public static Vector128 MultiplyAdd52Low(Vector128 a, Vector128 b, Vector128 c); /// /// _m128i _mm_madd52hi_avx_epu64 (__m128i __X, __m128i __Y, __m128i __Z) /// vpmadd52huq xmm, xmm, xmm /// public static Vector128 MultiplyAdd52High(Vector128 a, Vector128 b, Vector128 c); /// /// __m256i _mm_madd52lo_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z) /// vpmadd52luq ymm, ymm, ymm /// public static Vector256 MultiplyAdd52Low(Vector256 a, Vector256 b, Vector256 c); /// /// __m256i _mm256_madd52hi_avx_epu64 (__m256i __X, __m256i __Y, __m256i __Z) /// vpmadd52huq ymm, ymm, ymm /// public static Vector256 MultiplyAdd52High(Vector256 a, Vector256 b, Vector256 c); } } ````` ### API Usage ```csharp Vector128 foo(Vector128 arg0, Vector128 arg1, Vector128 arg2) { return AvxIfma.MultiplyAdd52Low(arg0, arg1, arg2); } ``` ### Alternative Designs _No response_ ### Risks _No response_
Author: DeepakRajendrakumaran
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged`, `arch-avx512`
Milestone: -
huoyaoyuan commented 4 months ago

What's the relationship with the AVX512.VL version? The mentioned instruction names are same.

I assume they are indicated by different feature flags as AVX512.VL requires 512 bit and more register support.

tannergooding commented 4 months ago

Yes, different feature flags and encodings. Avx512Ifma.VL is 128-bit and 256-bit EVEX encoded.

AvxIfma is 128-bit and 256-bit VEX encoded support. A given CPU might support one, both, or neither.

terrajobst commented 4 months ago

Video

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxIfma : Avx2
    {
        public static new bool IsSupported { get; }

        public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
        public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);

        public new abstract class X64 : Avx2.X64
        {
            public static new bool IsSupported { get; }
        }
    }
}