dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

[API Proposal]: Add support for AVX-512 VNNI hardware instructions #86849

Open MadProbe opened 1 year ago

MadProbe commented 1 year ago

Background and motivation

There already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI) Also this feature is in preview to be consistent with existing AvxVnni API

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    [Intrinsic]
    public abstract class Avx512Vnni : Avx512F
    {
        public static new bool IsSupported { get => IsSupported; }

        [Intrinsic]
        public new abstract class X64 : Avx512F.X64
        {
            public static new bool IsSupported { get => IsSupported; }
        }

        [Intrinsic]
        public new abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get => IsSupported; }

            /// <summary>
            /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        }

        /// <summary>
        /// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
    }
}

API Usage

The motivation for this proposal is largely the same as that of AvxVnni. These instructions are used the same way that AvxVnni is and may be universally used in any algorithm as long as you know where to use them and have good performance improvements against multiple instruction counterparts with same output.

// Example ripped from my Adler32 rolling hash implementation code
// where 256-bit vector are currently used instead but easily can be widened to 512-bit vector as shown below
// Also non-revelant stuff has been cut off for brevity
while (IsAddressLessThan(ref dataRef, ref end)) {
    Vector512<byte> bytes = Vector512.LoadUnsafe(ref dataRef);
    vadlerBA += vadlerA;

    if (Avx512Vnni.IsSupported) {
        vadlerBmult = Avx512Vnni.MultiplyWideningAndAdd(vadlerBmult.AsInt32(), bytes, mults_vector).AsUInt32();
    } else {
        vadlerBmult += Avx512BW.MultiplyAddAdjacent(Avx512BW.MultiplyAddAdjacent(bytes, mults_vector), Vector512<short>.One).AsUInt32();
    }

    vadlerA += Avx512BW.SumAbsoluteDifferences(bytes, zero).AsUInt64();
    dataRef = ref Add(ref dataRef, Vector512<byte>.Count);
}

Alternative Designs

Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on some intel processors with hybrid core architecture (Adler Lake and its successors).

Risks

N/A

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation There already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI) ### API Proposal ```csharp namespace System.Runtime.Intrinsics.X86 { [Intrinsic] [RequiresPreviewFeatures("Avx512Vnni is in preview.")] public abstract class Avx512Vnni : Avx512F { public static new bool IsSupported { get => IsSupported; } [Intrinsic] public new abstract class X64 : Avx2.X64 { public static new bool IsSupported { get => IsSupported; } } /// /// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b) /// VPDPBUSD zmm, zmm, zmm/m512 /// public static Vector512 MultiplyWideningAndAdd(Vector512 addend, Vector512 left, Vector512 right) => MultiplyWideningAndAdd(addend, left, right); /// /// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b) /// VPDPWSSD zmm, zmm, zmm/m512 /// public static Vector512 MultiplyWideningAndAdd(Vector512 addend, Vector512 left, Vector512 right) => MultiplyWideningAndAdd(addend, left, right); /// /// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b) /// VPDPBUSDS zmm, zmm, zmm/m512 /// public static Vector512 MultiplyWideningAndAddSaturate(Vector512 addend, Vector512 left, Vector512 right) => MultiplyWideningAndAddSaturate(addend, left, right); /// /// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b) /// VPDPWSSDS zmm, zmm, zmm/m512 /// public static Vector512 MultiplyWideningAndAddSaturate(Vector512 addend, Vector512 left, Vector512 right) => MultiplyWideningAndAddSaturate(addend, left, right); } } ``` ### API Usage ```csharp // Example ripped from my Adler32 rolling hash implementation code // where 256-bit vector are currently used instead but easily can be widened to 512-bit vector as shown below // Also non-revelant stuff has been cut off for brevity while (IsAddressLessThan(ref dataRef, ref end)) { Vector512 bytes = Vector512.LoadUnsafe(ref dataRef); vadlerBA += vadlerA; if (Avx512Vnni.IsSupported) { vadlerBmult = Avx512Vnni.MultiplyWideningAndAdd(vadlerBmult.AsInt32(), bytes, mults_vector).AsUInt32(); } else { vadlerBmult += Avx512BW.MultiplyAddAdjacent(Avx512BW.MultiplyAddAdjacent(bytes, mults_vector), Vector256.One).AsUInt32(); } vadlerA += Avx512BW.SumAbsoluteDifferences(bytes, zero).AsUInt64(); dataRef = ref Add(ref dataRef, Vector512.Count); } ``` ### Alternative Designs Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on [some intel processors with hybrid core architecture (Adler Lake and its successors).](https://en.wikipedia.org/wiki/AVX-512?useskin=vector#ref_adl-avx512-note) ### Risks N/A
Author: MadProbe
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged`
Milestone: -
teo-tsirpanis commented 1 year ago

Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class

I don't this this will pass; AVX-VNNI and AVX512-VNNI are distinct instruction sets; in fact the 512-bit instructions came before the 128 and 256-bit ones.

tannergooding commented 1 year ago

Yes, this would need to be its own class Avx512Vnni.

Provided our CI hardware supports it (and I believe it does), it would not need to be in preview. I don't think AvxVnni needs to be in preview anymore either, correspondingly.

tannergooding commented 1 year ago

@MadProbe, couple fixes are needed....

  1. Avx512Vnni.X64 should inherit from Avx512F.X64
  2. We need to define Avx512Vnni.VL and correspondingly the 128-bit and 256-bit versions of the functions

The latter will be identical to AvxVnni, but there will be a difference in what hardware reports support and when its available, etc.

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

MadProbe commented 1 year ago

@MadProbe, couple fixes are needed....

1. `Avx512Vnni.X64` should inherit from `Avx512F.X64`

2. We need to define `Avx512Vnni.VL` and correspondingly the 128-bit and 256-bit versions of the functions

The latter will be identical to AvxVnni, but there will be a difference in what hardware reports support and when its available, etc.

All done, sorry for late edit

bartonjs commented 1 year ago

Video

Looks good as proposed

namespace System.Runtime.Intrinsics.X86
{
    [Intrinsic]
    public abstract class Avx512Vnni : Avx512F
    {
        public static new bool IsSupported { get => IsSupported; }

        [Intrinsic]
        public new abstract class X64 : Avx512F.X64
        {
            public static new bool IsSupported { get => IsSupported; }
        }

        [Intrinsic]
        public new abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get => IsSupported; }

            /// <summary>
            /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSD xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSD ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAdd(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPBUSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
            /// VPDPWSSDS xmm, xmm, xmm/m128
            /// </summary>
            public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPBUSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

            /// <summary>
            /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
            /// VPDPWSSDS ymm, ymm, ymm/m256
            /// </summary>
            public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        }

        /// <summary>
        /// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
    }
}
MadProbe commented 11 months ago

I am interested why is this proposal is tagged with needs-further-triage? If there are some concerns: please speak them so I can know what's wrong and we can come to a conclusion? And if a proposal is marked with needs-further-triage, shouldn't it also be demoted to api-suggestion then?

tannergooding commented 11 months ago

It was marked that way because a non area owner set the milestone and the milestone being "correct" needed confirmation from the area owners

saucecontrol commented 1 week ago

To follow the newer pattern we've established with AVX10, this should probably be changed to

namespace System.Runtime.Intrinsics.X86;

// existing class
public abstract class AvxVnni : Avx2
{
    // new nested class
    [Intrinsic]
    public new abstract class V512
    {
        public static new bool IsSupported { get => IsSupported; }

        /// <summary>
        /// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSD zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPBUSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);

        /// <summary>
        /// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
        /// VPDPWSSDS zmm, zmm, zmm/m512
        /// </summary>
        public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
    }
}

which eliminates the duplication between AvxVnni and Avx512Vnni.VL and allows us to report support for AvxVnni when any of AVX-VNNI, AVX512-VNNI+AVX512-VL, or AVX10V1 are present.

MineCake147E commented 1 week ago

To follow the newer pattern we've established with AVX10, this should probably be changed to

which eliminates the duplication between AvxVnni and Avx512Vnni.VL and allows us to report support for AvxVnni when any of AVX-VNNI, AVX512-VNNI+AVX512-VL, or AVX10V1 are present.

I don't think so. AVX512-VNNI predates AVX-VNNI. There are many CPUs that don't support AVX-VNNI but do support AVX512-VNNI. I wouldn't expect AvxVnni.IsSupported to be true in Ice Lake CPUs. I also feel odd to see AvxVnni.V512.IsSupported returning true in Ice Lake CPUs, if not everyone does, because Ice Lake CPUs don't support AVX-VNNI per se. Same applies for AVX-IFMA. Sapphire Rapids CPUs don't support AVX-IFMA, but they do support AVX512-IFMA.

Even though there's no difference aside instruction encoding, the name confuses me a lot. I think Avx512Vnni should be at least separate from AvxVnni, if renaming AvxVnni is inappropriate, which is most likely the case. It should be fine with letting Avx512Vnni.VL extend AvxVnni except IsSupported returning false in CPUs without AVX512-VNNI.

tannergooding commented 1 week ago

AVX512-VNNI predates AVX-VNNI. There are many CPUs that don't support AVX-VNNI but do support AVX512-VNNI.

This isn’t relevant nor impacted by the proposed API surface change. AVX512 is a legacy and effectively deprecated api set that we would likely have not exposed as is f we knew in advance about the transition to AVX10.1 and the new scheme that would exist moving forward.

Doing this removes duplication without negatively impacting any consumer of the api, minus a minor nuance that the class name minorly differs from the CPUID bit name and keeps it consistent with the intended “converged isa” schema that’s been defined for the future by Intel under Avx10.1

saucecontrol commented 1 week ago

Even though there's no difference aside instruction encoding, the name confuses me a lot.

I think if this were going through review today, it would probably be named just Vnni so that it wouldn't imply a direct connection to the AVX_VNNI cpuid bit, but that ship has sailed.

It should be fine with letting Avx512Vnni.VL extend AvxVnni except IsSupported returning false in CPUs without AVX512-VNNI.

This is also confusing because you either: 1) Inherit the methods from the base class, which VS will suggest you change to call the base class directly anyway, and which also breaks the assumption that has always held in the S.R.I hierarchy that each ISA implies support for those below it. 2) Duplicate the methods in the VL class, hiding the ones from the base class. This ends up forcing users to duplicate their code, because they have to make 2 IsSupported checks to protect two otherwise identical intrinsic method calls.

MineCake147E commented 1 week ago

AVX512 is a legacy and effectively deprecated api set that we would likely have not exposed as is f we knew in advance about the transition to AVX10.1 and the new scheme that would exist moving forward.

The problem is: AMD. AMD's Zen5 microarchitecture doesn't support AVX10.1. Zen5 also lacks support for FP16 which Intel added in Sapphire Rapids and later included in AVX10.1, while supporting VP2INTERSECT which Intel added in Tiger Lake but later ditched in Rocket Lake, and isn't even reintroduced in Granite Rapids. If we didn't expose Avx512*, we would run into a lot of issues supporting not only Rocket Lake and earlier Intel CPUs, but also Zen4, Zen5 and even later AMD CPUs, which lack (or at least are alleged to lack) support for some subsets included in AVX10.1 like FP16. I don't think AMD will bring support for AVX10, especially given the fact that Zen5 supports VP2INTERSECT.

tannergooding commented 1 week ago

NOTE: Some of the below discussion is largely based on "reasonable speculation" supported by typically seen timeframes. It shouldn't be taken as fact, a promise of the future, etc

I think if this were going through review today, it would probably be named just Vnni so that it wouldn't imply a direct connection to the AVX_VNNI cpuid bit, but that ship has sailed.

I'd say doubtful to this. The Avx prefix remains important to disambiguate from other potential architectures (an issue we have with Arm.Aes vs X86.Aes for example) and is part of the generalized schema for non-core extensions under AVX10

The problem is: AMD.

AMD doesn't cause any problem here. My statement was that AVX512 is legacy/deprecated and would not be exposed "as is", not that it wouldn't be exposed. Had we known in advance, we likely would've exposed this functionality following a split that more closely models how things functionally exist today while still fitting the generalized schema for the future. -- Legacy/deprecated here is also a little bit of a misnomer and there is deeper context. The encoding and general feature support introduced by AVX512 remains, as does the potential for 512-bit support to exist in hardware. Realistically all that has changed is that rather than us having 512 required with 128/256-bit optional we have 128/256-bit required with 512-bit optional. How the ISAs are planned to be exposed in the future, the requirement that things stay converged, and other considerations also come in that make the AVX512 "legacy/deprecated" in terms of what's being discussed here (as now it is rather more AVX10+V512 support)

That is, rather than defining Avx512F + Avx512BW + Avx512CD + Avx512DQ + Avx512Vbmi + ... and then defining nested VL classes for each we would have baked in some of the assumptions that the internal runtime implementation requires and that future implementations have been guaranteed to implement under the converged ISA. There would have been some nuance that theoretically some hardware (such as Intel Knights Landing) might exist or be exposed in the future which doesn't fit this model, but the overall tradeoff would've been worth it.

Thus, we likely would've had a schema where:

This would have been a divergence from what was formally spec'd in CPUID, but it would in turn hide some of the messy nuance that exists and make it much easier for developers to write code that works in production for both hardware. With the setup we have today, we instead have this awkward duplication between many of the Avx512Isa.VL classes and the Avx10v1 class; this requires developers to effectively duplicate their code if they want to support both leading to a pit of failure and general UX concern.

AMD's Zen5 microarchitecture doesn't support AVX10.1. Zen5 also lacks support for FP16 which Intel added in Sapphire Rapids and later included in AVX10.1, while supporting VP2INTERSECT which Intel added in Tiger Lake but later ditched in Rocket Lake, and isn't even reintroduced in Granite Rapids.

This is effectively related to timing as designing a CPU and integrating all the functionality takes years (for many reasons). You can see this in between when a specification is announced and when it actually shows up in hardware with there being, on average, a 2-3 year delay between when an ISA specification is revealed and when we first see hardware start to implement it. Sometimes its a bit less and other times longer, depending on exactly what is required, how similar it is to past support, if additional or differing silicon is required, etc.

In the case of Zen5, it was first announced in some official capacity back in 2018, had confirmation on the fabrication process back in 2022, and it shipped in 2024. FP16 had been announced in mid 2021 (shipping with Sapphire Lake in 2023) and it was likely too late to include it. Correspondingly, VP2INTERSECT had been announced several years prior and shipped in Tiger Lake (2020), so if Zen5 followed the general speculation given here, this would line up cleanly with why certain ISAs appeared but not others. -- Notably early Alder Lake (2021) chips also included VP2INTERSECT support (prior to AVX512 as a whole being fused off). Rocket Lake (2021) didn't have it, but that particular microarchitecture was in many senses parallel to the normal flow having been a successor to Comet lake and remaining on the 14nm process, rather than the newer 10nm (Tiger Lake) or 7nm (Alder Lake) process.

This will likely be rectified in Zen6 considering the same overall timing (which as per the top note is purely speculation).

I don't think AMD will bring support for AVX10, especially given the fact that Zen5 supports VP2INTERSECT.

Support for VP2INTERSECT has nothing to do with AVX10, it is allowed to be independent.

The only thing currently missing is FP16 at which point AVX10 support is a minor change to what CPUID reports. Unlike adding a new ISA, changing what CPUID reports doesn't require a significant change to silicon, it can typically be enabled via microcode and done very late in the design process (we've seen this repeatedly in the past, both in terms of there often being BIOS control switches to enable/disable key bits, but also in terms of microcode patches changing what CPUID reports to workaround bugs or other issues).

Particularly with the new x86 Ecosystem Advisory Group that was formed by Intel, AMD, and other key companies/persons, I would speculate we see an increase in collaboration, early sharing of ISA specifications, and convergence between what hardware supports (as detailed in the announcement post). This would include X86S, APX, AVX10.1, AVX10.2, and future ISAs.

-- As an additional note, one of the likely reasons that VP2INTERSECT was "pulled" for Sapphire Rapids is because the implementation introduced in Tiger Lake clocked at 25-40 cycles and taking around 46 micro-ops. Software-based alternatives that had better performance were found and the support no longer existed in Sapphire Rapids. This may in part be why it was excluded from AVX10. AMD Zen5 then ended up shipping an implementation that takes 1 cycle, making it perform better than the software-based alternatives and if some of the hardware analysis is to be believed without significant cost to the silicon, as much of it is shared with the VPCONFLICT support. It would not be unreasonable for us to see the ISA brought back as required in some future AVX10 version or for us to see it brought back simply as VP2INTERSECT in future Intel CPUs and remain an optional extension.

MineCake147E commented 1 week ago

My statement was that AVX512 is legacy/deprecated and would not be exposed "as is", not that it wouldn't be exposed. Had we known in advance, we likely would've exposed this functionality following a split that more closely models how things functionally exist today while still fitting the generalized schema for the future. -- Legacy/deprecated here is also a little bit of a misnomer and there is deeper context. Realistically all that has changed is that rather than us having 512 required with 128/256-bit optional we have 128/256-bit required with 512-bit optional. How the ISAs are planned to be exposed in the future, the requirement that things stay converged, and other considerations also come in that make the AVX512 "legacy/deprecated" in terms of what's being discussed here (as now it is rather more AVX10+V512 support)

I see, that makes sense. I apologize for my misunderstanding.

FP16 had been announced in mid 2021 (shipping with Sapphire Lake in 2023)

I think you meant to write Sapphire Rapids, right?

tannergooding commented 1 week ago

I think you meant to write Sapphire Rapids, right?

Yes, too many similar microarchitecture names and I messed this particular one up 😄