Open MadProbe opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.
Author: | MadProbe |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Runtime.Intrinsics`, `untriaged` |
Milestone: | - |
Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class
I don't this this will pass; AVX-VNNI and AVX512-VNNI are distinct instruction sets; in fact the 512-bit instructions came before the 128 and 256-bit ones.
Yes, this would need to be its own class Avx512Vnni
.
Provided our CI hardware supports it (and I believe it does), it would not need to be in preview. I don't think AvxVnni
needs to be in preview anymore either, correspondingly.
@MadProbe, couple fixes are needed....
Avx512Vnni.X64
should inherit from Avx512F.X64
Avx512Vnni.VL
and correspondingly the 128-bit and 256-bit versions of the functionsThe latter will be identical to AvxVnni
, but there will be a difference in what hardware reports support and when its available, etc.
This issue has been marked needs-author-action
and may be missing some important information.
@MadProbe, couple fixes are needed....
1. `Avx512Vnni.X64` should inherit from `Avx512F.X64` 2. We need to define `Avx512Vnni.VL` and correspondingly the 128-bit and 256-bit versions of the functions
The latter will be identical to
AvxVnni
, but there will be a difference in what hardware reports support and when its available, etc.
All done, sorry for late edit
Looks good as proposed
namespace System.Runtime.Intrinsics.X86
{
[Intrinsic]
public abstract class Avx512Vnni : Avx512F
{
public static new bool IsSupported { get => IsSupported; }
[Intrinsic]
public new abstract class X64 : Avx512F.X64
{
public static new bool IsSupported { get => IsSupported; }
}
[Intrinsic]
public new abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get => IsSupported; }
/// <summary>
/// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPBUSD xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPWSSD xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPBUSD ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPWSSD ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPBUSDS xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPWSSDS xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPBUSDS ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPWSSDS ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
/// <summary>
/// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
}
I am interested why is this proposal is tagged with needs-further-triage
?
If there are some concerns: please speak them so I can know what's wrong and we can come to a conclusion?
And if a proposal is marked with needs-further-triage
, shouldn't it also be demoted to api-suggestion
then?
It was marked that way because a non area owner set the milestone and the milestone being "correct" needed confirmation from the area owners
To follow the newer pattern we've established with AVX10, this should probably be changed to
namespace System.Runtime.Intrinsics.X86;
// existing class
public abstract class AvxVnni : Avx2
{
// new nested class
[Intrinsic]
public new abstract class V512
{
public static new bool IsSupported { get => IsSupported; }
/// <summary>
/// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
}
which eliminates the duplication between AvxVnni
and Avx512Vnni.VL
and allows us to report support for AvxVnni
when any of AVX-VNNI
, AVX512-VNNI+AVX512-VL
, or AVX10V1
are present.
To follow the newer pattern we've established with AVX10, this should probably be changed to
which eliminates the duplication between
AvxVnni
andAvx512Vnni.VL
and allows us to report support forAvxVnni
when any ofAVX-VNNI
,AVX512-VNNI+AVX512-VL
, orAVX10V1
are present.
I don't think so. AVX512-VNNI predates AVX-VNNI. There are many CPUs that don't support AVX-VNNI but do support AVX512-VNNI.
I wouldn't expect AvxVnni.IsSupported
to be true
in Ice Lake CPUs. I also feel odd to see AvxVnni.V512.IsSupported
returning true
in Ice Lake CPUs, if not everyone does, because Ice Lake CPUs don't support AVX-VNNI per se.
Same applies for AVX-IFMA. Sapphire Rapids CPUs don't support AVX-IFMA, but they do support AVX512-IFMA.
Even though there's no difference aside instruction encoding, the name confuses me a lot.
I think Avx512Vnni
should be at least separate from AvxVnni
, if renaming AvxVnni
is inappropriate, which is most likely the case.
It should be fine with letting Avx512Vnni.VL
extend AvxVnni
except IsSupported
returning false
in CPUs without AVX512-VNNI.
AVX512-VNNI predates AVX-VNNI. There are many CPUs that don't support AVX-VNNI but do support AVX512-VNNI.
This isn’t relevant nor impacted by the proposed API surface change. AVX512 is a legacy and effectively deprecated api set that we would likely have not exposed as is f we knew in advance about the transition to AVX10.1 and the new scheme that would exist moving forward.
Doing this removes duplication without negatively impacting any consumer of the api, minus a minor nuance that the class name minorly differs from the CPUID bit name and keeps it consistent with the intended “converged isa” schema that’s been defined for the future by Intel under Avx10.1
Even though there's no difference aside instruction encoding, the name confuses me a lot.
I think if this were going through review today, it would probably be named just Vnni
so that it wouldn't imply a direct connection to the AVX_VNNI cpuid bit, but that ship has sailed.
It should be fine with letting Avx512Vnni.VL extend AvxVnni except IsSupported returning false in CPUs without AVX512-VNNI.
This is also confusing because you either:
1) Inherit the methods from the base class, which VS will suggest you change to call the base class directly anyway, and which also breaks the assumption that has always held in the S.R.I hierarchy that each ISA implies support for those below it.
2) Duplicate the methods in the VL class, hiding the ones from the base class. This ends up forcing users to duplicate their code, because they have to make 2 IsSupported
checks to protect two otherwise identical intrinsic method calls.
AVX512 is a legacy and effectively deprecated api set that we would likely have not exposed as is f we knew in advance about the transition to AVX10.1 and the new scheme that would exist moving forward.
The problem is: AMD.
AMD's Zen5 microarchitecture doesn't support AVX10.1.
Zen5 also lacks support for FP16 which Intel added in Sapphire Rapids and later included in AVX10.1, while supporting VP2INTERSECT which Intel added in Tiger Lake but later ditched in Rocket Lake, and isn't even reintroduced in Granite Rapids.
If we didn't expose Avx512*
, we would run into a lot of issues supporting not only Rocket Lake and earlier Intel CPUs, but also Zen4, Zen5 and even later AMD CPUs, which lack (or at least are alleged to lack) support for some subsets included in AVX10.1 like FP16.
I don't think AMD will bring support for AVX10, especially given the fact that Zen5 supports VP2INTERSECT.
NOTE: Some of the below discussion is largely based on "reasonable speculation" supported by typically seen timeframes. It shouldn't be taken as fact, a promise of the future, etc
I think if this were going through review today, it would probably be named just Vnni so that it wouldn't imply a direct connection to the AVX_VNNI cpuid bit, but that ship has sailed.
I'd say doubtful to this. The Avx
prefix remains important to disambiguate from other potential architectures (an issue we have with Arm.Aes
vs X86.Aes
for example) and is part of the generalized schema for non-core extensions under AVX10
The problem is: AMD.
AMD doesn't cause any problem here. My statement was that AVX512 is legacy/deprecated and would not be exposed "as is", not that it wouldn't be exposed. Had we known in advance, we likely would've exposed this functionality following a split that more closely models how things functionally exist today while still fitting the generalized schema for the future.
-- Legacy/deprecated here is also a little bit of a misnomer and there is deeper context. The encoding and general feature support introduced by AVX512 remains, as does the potential for 512-bit support to exist in hardware. Realistically all that has changed is that rather than us having 512 required
with 128/256-bit optional
we have 128/256-bit required
with 512-bit optional
. How the ISAs are planned to be exposed in the future, the requirement that things stay converged, and other considerations also come in that make the AVX512 "legacy/deprecated" in terms of what's being discussed here (as now it is rather more AVX10+V512 support
)
That is, rather than defining Avx512F
+ Avx512BW
+ Avx512CD
+ Avx512DQ
+ Avx512Vbmi
+ ...
and then defining nested VL
classes for each we would have baked in some of the assumptions that the internal runtime implementation requires and that future implementations have been guaranteed to implement under the converged ISA. There would have been some nuance that theoretically some hardware (such as Intel Knights Landing
) might exist or be exposed in the future which doesn't fit this model, but the overall tradeoff would've been worth it.
Thus, we likely would've had a schema where:
AVX512F
with AVX512VL
)
class Avx512Isa { class VL { } }
we can have class AvxIsa { class V512 { } }
F+BW+CD+DQ+VL
) should be provided together
x86-64-v4
definition and what exists for all current hardware except Knights Landing
and Knights Mill
(which are different/special for many other reasons)IFMA
, VBMI
, VBMI2
, VPOPCNTDQ
, VNNI
, BITALG
, VP2INTERSECT
, GFNI
, VPCLMULQDQ
, and VAES
) were still independentAvx10v1
and Avx10v2
for everything elseThis would have been a divergence from what was formally spec'd in CPUID, but it would in turn hide some of the messy nuance that exists and make it much easier for developers to write code that works in production for both hardware. With the setup we have today, we instead have this awkward duplication between many of the Avx512Isa.VL
classes and the Avx10v1
class; this requires developers to effectively duplicate their code if they want to support both leading to a pit of failure and general UX concern.
AMD's Zen5 microarchitecture doesn't support AVX10.1. Zen5 also lacks support for FP16 which Intel added in Sapphire Rapids and later included in AVX10.1, while supporting VP2INTERSECT which Intel added in Tiger Lake but later ditched in Rocket Lake, and isn't even reintroduced in Granite Rapids.
This is effectively related to timing as designing a CPU and integrating all the functionality takes years (for many reasons). You can see this in between when a specification is announced and when it actually shows up in hardware with there being, on average, a 2-3 year delay between when an ISA specification is revealed and when we first see hardware start to implement it. Sometimes its a bit less and other times longer, depending on exactly what is required, how similar it is to past support, if additional or differing silicon is required, etc.
In the case of Zen5, it was first announced in some official capacity back in 2018, had confirmation on the fabrication process back in 2022, and it shipped in 2024. FP16 had been announced in mid 2021 (shipping with Sapphire Lake
in 2023) and it was likely too late to include it. Correspondingly, VP2INTERSECT
had been announced several years prior and shipped in Tiger Lake
(2020), so if Zen5 followed the general speculation given here, this would line up cleanly with why certain ISAs appeared but not others. -- Notably early Alder Lake (2021) chips also included VP2INTERSECT
support (prior to AVX512 as a whole being fused off). Rocket Lake (2021) didn't have it, but that particular microarchitecture was in many senses parallel to the normal flow having been a successor to Comet lake and remaining on the 14nm process, rather than the newer 10nm (Tiger Lake) or 7nm (Alder Lake) process.
This will likely be rectified in Zen6 considering the same overall timing (which as per the top note is purely speculation).
I don't think AMD will bring support for AVX10, especially given the fact that Zen5 supports VP2INTERSECT.
Support for VP2INTERSECT
has nothing to do with AVX10
, it is allowed to be independent.
The only thing currently missing is FP16
at which point AVX10
support is a minor change to what CPUID reports. Unlike adding a new ISA, changing what CPUID reports doesn't require a significant change to silicon, it can typically be enabled via microcode and done very late in the design process (we've seen this repeatedly in the past, both in terms of there often being BIOS control switches to enable/disable key bits, but also in terms of microcode patches changing what CPUID reports to workaround bugs or other issues).
Particularly with the new x86 Ecosystem Advisory Group
that was formed by Intel
, AMD
, and other key companies/persons, I would speculate we see an increase in collaboration, early sharing of ISA specifications, and convergence between what hardware supports (as detailed in the announcement post). This would include X86S
, APX
, AVX10.1
, AVX10.2
, and future ISAs.
-- As an additional note, one of the likely reasons that VP2INTERSECT
was "pulled" for Sapphire Rapids is because the implementation introduced in Tiger Lake clocked at 25-40 cycles and taking around 46 micro-ops. Software-based alternatives that had better performance were found and the support no longer existed in Sapphire Rapids. This may in part be why it was excluded from AVX10. AMD Zen5 then ended up shipping an implementation that takes 1 cycle, making it perform better than the software-based alternatives and if some of the hardware analysis is to be believed without significant cost to the silicon, as much of it is shared with the VPCONFLICT
support. It would not be unreasonable for us to see the ISA brought back as required in some future AVX10 version or for us to see it brought back simply as VP2INTERSECT
in future Intel CPUs and remain an optional extension.
My statement was that AVX512 is legacy/deprecated and would not be exposed "as is", not that it wouldn't be exposed. Had we known in advance, we likely would've exposed this functionality following a split that more closely models how things functionally exist today while still fitting the generalized schema for the future. -- Legacy/deprecated here is also a little bit of a misnomer and there is deeper context. Realistically all that has changed is that rather than us having
512 required
with128/256-bit optional
we have128/256-bit required
with512-bit optional
. How the ISAs are planned to be exposed in the future, the requirement that things stay converged, and other considerations also come in that make the AVX512 "legacy/deprecated" in terms of what's being discussed here (as now it is rather moreAVX10+V512 support
)
I see, that makes sense. I apologize for my misunderstanding.
FP16 had been announced in mid 2021 (shipping with
Sapphire Lake
in 2023)
I think you meant to write Sapphire Rapids, right?
I think you meant to write Sapphire Rapids, right?
Yes, too many similar microarchitecture names and I messed this particular one up 😄
Background and motivation
There already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI) Also this feature is in preview to be consistent with existing AvxVnni API
API Proposal
API Usage
The motivation for this proposal is largely the same as that of AvxVnni. These instructions are used the same way that AvxVnni is and may be universally used in any algorithm as long as you know where to use them and have good performance improvements against multiple instruction counterparts with same output.
Alternative Designs
Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on some intel processors with hybrid core architecture (Adler Lake and its successors).
Risks
N/A