Open MineCake147E opened 10 months ago
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.
Author: | MineCake147E |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Runtime.Intrinsics` |
Milestone: | - |
@dotnet/avx512-contrib
The names here could be "better". I'd expect simply MultiplyAdd52Low
and MultiplyAdd52High
or similar would be sufficient and more closely matches the "name" portion of the underlying C/C++ intrinsic _mm512_madd52lo_epu64
and _mm512_madd52hi_epu64
(the name portion is madd52lo
and madd52hi
).
These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point)
I updated the proposal accordingly.
addend
, left
, and right
namespace System.Runtime.Intrinsics.X86
{
public abstract class Avx512Ifma : Avx512F
{
public static bool IsSupported { get; }
public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get; }
public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
}
}
}
Similar to https://github.com/dotnet/runtime/issues/86849, this should probably be changed to:
namespace System.Runtime.Intrinsics.X86;
// approved in https://github.com/dotnet/runtime/issues/98833
public abstract class AvxIfma : Avx2
{
// new nested class
[Intrinsic]
public new abstract class V512
{
public static new bool IsSupported { get => IsSupported; }
public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
}
}
Since the parent is not yet implemented, we also have the option of changing that name to just Ifma
, since it wouldn't directly correlate to the AVX_IFMA cpuid bit any longer.
Background and motivation
AVX-512 IFMA
is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4. These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative forVPMULLQ
instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, asVPMADD52LUQ
finishes in only 4 clock cycles.API Proposal
API Usage
An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:
https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411
Alternative Designs
Risks
None