dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.92k stars 4.64k forks source link

AVX-512 support in System.Runtime.Intrinsics.X86 #35773

Closed twest820 closed 1 year ago

twest820 commented 4 years ago

I presume supporting AVX-512 intrinsics is in plan somewhere, but couldn't find an existing issue tracking their addition. There seem to be two parts to this.

  1. Support for EVEX encoding and use of zmm registers. I'm not entirely clear on compiler versus jit distinctions but perhaps this would allow jit to update existing 128 and 256 bit wide code using the Sse, Avx, or other System.Runtime.Intrinsics.X86 classes to EVEX.
  2. Addition of Avx512 classes with the new instructions at 128, 256, and 512 bit widths.

There is some interface complexity with the (as of this writing) 17 AVX-512 subsets since Knights Landing/Mill, Skylake, Cannon Lake, Cascade Lake, Cooper Lake, and Ice/Tiger Lake all support different variations. To me, it seems most natural to deprioritize support for the Knights (they're no longer in production, so presumably nearly all code targeting them has already been written) and implement something in the direction of

class Avx512FCD : Avx2 // minimum common set across all Intel CPUs with AVX-512
class Avx512VLDQBW : Avx512FCD // common set for enabled Skylake μarch cores and Sunny Cove

plus non-inheriting classes for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT (the remaining four subsets—4FMAPS, 4NNIW, ER, and PF—are specific to Knights). This is similar to the existing model for Bmi1, Bmi2, and Lzcnt and aligns to current hardware in a way which composes with existing inheritance and IsSupported properties. It also helps with incremental roll out.

Finding naming for code readability that's still clear as to which instructions are available where seems somewhat tricky. Personally, I'd be content with idioms like

using Avx512 = System.Runtime.Intrinsics.X86.Avx512VLDQBW; // loose terminology

but hopefully others will have better ideas.

ghost commented 4 years ago

Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.

Symbai commented 4 years ago

8264 & #31420 but looks like a tracking issue is still missing.

tannergooding commented 4 years ago

There isn't an explicit tracking issue right now.

AVX-512 represents a significant investment as it nearly triples the current surface area (from ~1500 APIs to ~4500 APIs). It additionally adds a new encoding, additional registers that would require support (this is extending to 512 bits, supporting 16 more registers, and adding 8 mask registers), a new SIMD type (TYP_SIMD64 and Vector512<T>), and more. While this support could be added piece by piece, I'm not sure if this meets the bar for trying to drive through API review any time soon (.NET 5) and so I won't have time to create the relevant API proposals, etc. I do imagine that will change as the hardware support starts becoming more prevalent and the scenarios it can be used and will be beneficial increases.

If someone does want to create a rough proposal of what AVX-512F would look like (since that is the base for the rest of the AVX-512 support), then I'd be happy to provide feedback and continue the discussion until it does bubble up.

CC. @CarolEidt, @echesakovMSFT, @BruceForstall as they may have additional or different thoughts/opinions

twest820 commented 4 years ago

Totally agree. Visual C++'s main AVX-512 roll out seems to have spanned the entire Visual Studio 2017 lifecycle and is still receiving attention in recent VS 2019 updates. It seems to me an initial question here could be what an AVX-512 roadmap might look like across multiple .NET annual releases. In the meantime, there is the workaround of calling intrinsics from C++, C++ from C++/CLI, and C++/CLI from C#. But I wouldn't have opened this issue if that layering was a great developer experience compared to intrinsics from C#. :-)

+3000 APIs is maybe ultimately on the low side. My current scrape of the Intel Intrinsics Guide lists 4255 AVX-512 intrinsics and 540 instructions. Only 380 of the intrinsics are not in the F+CD+VL+DQ+BW group supported from Skylake-SP and X and Ice Lake supports 4124 of the 4255 (give or take errors in the Guide I haven't caught or just on my part). Depending how exactly AVX-512F is defined I count it as totaling either 1435 or 2654 intrinsics. So it might make more sense to try to start with the initial 1500 intrinsics prioritized for Visual C++ 2017. Or even some subset thereof. I don't have that list, though.

Within this context, @tannergooding, if you can give me some more definition of what you're looking for in an AVX-512F sketch I can probably put something together.

I touched on this in #226, but the ability to jit existing 128 and 256 bit System.Runtime.Intrinsics.X86 APIs to EVEX for access to zmm registers 16-31 would be a valuable minimum increment even if not headlined by the addition of an Avx512 class. Definitely for most of the kernels in the various numerical codes I've written and perhaps also for the CLR's internal use of SIMD. (I can suggest some other pragmatically minded clickstops if there's interest.)

tannergooding commented 4 years ago

At the most basic level, there would need to be a Vector512<T> type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

The methods proposed would likely have signatures like the following at a minimum (essentially mirroring SSE/AVX, but extending to V512):

/// <summary>
/// __m512d _mm512_add_pd (__m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> left, Vector512<double> right)

On top of that minimum, there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

/// <summary>
/// __m512d _mm512_mask_add_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> value, Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload merges values not written to by the mask

/// <summary>
/// __m512d _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload zeros values not written to by the mask

EVEX additionally has support for broadcast versions which take right as a T* and broadcast the value to all elements of the V512, but I'm not sure those are explicitly needed and warrant further discussion. I imagine the JIT could recognize a Vector128.Create(value) call and optimize it to generate the ideal code (noting C++ does similar).

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Then there are 128-bit and 256-bit versions for most of these, but they fall under AVX512VL which would require its own thought into how to expose. My initial thought is that we would likely try to follow the normal hierarchy, but where something inherits from multiple classes we would need additional consideration in how they get exposed. This would require at least a breakdown of what ISAs exist and what their dependencies are.

john-h-k commented 4 years ago

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Does this mean rounding immediates on operations like add/sub rather than explicit rounding instructions?

tannergooding commented 4 years ago

No, the rounding instructions convert floats to integrals, while the rounding mode impacts the returned result for x + y (for example).

IEEE 754 floating-point arithmetic is performed taking the inputs as given, computing the "infinitely precise result" and then rounding to the nearest representable result. When the "infinitely precise result" is equally close to two representable values, you need a tie breaker to determine which to choose. The default tie breaker is "ToEven", but you can (not .NET, but in other languages or in hardware) set the rounding mode to do something like AwayFromZero, ToZero, ToPositiveInfinity, or ToNegativeInfinity instead. EVEX supports doing this on a per operation basis without having to use explicit instructions to modify and restore the floating-point control state

john-h-k commented 4 years ago

Ah, brilliant, I think that is what I meant but I didn't word it great 😄

scalablecory commented 4 years ago

I would appreciate the "compare into mask" instructions in AVX-512BW to speed up parsing and IndexOf.

saucecontrol commented 4 years ago

there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

For the new mask and maskz instruction variants, couldn't the JIT recognize VectorNNN<T>.Zero in the value arg and use the zero source encoding? That would mean only doubling the API surface area instead of tripling it 😄

tannergooding commented 4 years ago

Yes, there are likely some tricks we can do to help limit the number of exposed APIs and/or the number of APIs we need to review.

twest820 commented 4 years ago

This would require at least a breakdown of what ISAs exist and what their dependencies are.

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware. Probably not AMD either if they implement AVX-512. It seems github doesn't support tables in comments so I made a small repo with details.

At the most basic level, there would need to be a Vector512 type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

there would need to be a proposal for a new x86 specific Mask8

Also 16, 32, and 64 bit masks. And rounding, comparison, minmax, and mantissa norm and sign enums. The BF16 subset planned for Tiger Lake would require #936 but that's unimportant at this point. I'll see about getting something sketched, hopefully in the next week or so.

tannergooding commented 4 years ago

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware

It is a bit more in depth than this...

ANDPD for example depends on AVX512DQ for the 512-bit variant. The 128 and 256-bit variant depend on both AVX512DQ and AVX512VL. Since this has two dependencies, it can't be modeled using the existing inheritance hierarchy.

Now, given how VL works, it might be feasible to expose it as the following:

public abstract class AVX512F : ??
{
    public abstract class VL
    {
    }
}

public abstract class AVX512DQ : AVX512F
{
    public abstract class VL : AVX512F.VL
    {
    }
}

This would key off the existing model we have used for 64-bit extensions (e.g. Sse41.X64) and still maintains the rule that AVX512F.VL.IsSupported means AVX512F.IsSupported, etc. There are also other considerations that need to be taken into account, such as what AVX512F depends on (iirc, it is more than just AVX2 and also includes FMA, which needs to be appropriately exposed).

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

twest820 commented 4 years ago

It is a bit more in depth than this...

Hi Tanner, yes, it is. That's why the classes suggested as a starting point when this issue was opened don't attempt to model every CPUID flag individually. It's also why I posted the tabulation in the repo linked above.

While there are lots of possible factorings, it seems to me they're all going to be less than ideal in some way because class inheritance is an inexact match to the CPUID flags. My thoughts have gone in the same direction as you're exploring but I landed in a little bit different place. I'm not sure how abstract classes would work with the current static method model for intrinsics but one option might be

public class Avx512F : Avx2 // inheriting from Avx2 captures more surface than Fma?
{
    public static bool IsSupported // checks OSXSAVE and F CPUID

    // eventually builds out to 1435 F intrinsics

    public class VL // eventually has all of the 1208 VL subset intrinsics which depend on F
    {
        public static bool IsSupported // checks OSXSAVE and VL CPUID but not F
    }
}

public class Avx512DQ : Avx512F // Intrinsics Guide says no DQ instructions have CPUID dependencies on F but arch manual says F must be checked before checking for DQ
{
    public static bool IsSupported // checks OSXSAVE, F and DQ CPUIDs

    // eventually has 223 DQ intrinsics which do not depend on VL

    public class VL // has the 176 DQ intrinsics which do depend on VL
    {
        public static bool IsSupported // checks OSXSAVE, F, VL, and DQ
    }
}

Presumably CD and BW would look much like DQ. My thinking for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT when opening this issue was similar.

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future. It might also be more friendly to intellisense performance requirements during coding. The disadvantage is the CPUID structure would constantly be restated in code. This doesn't seem helpful to readability and forces developers to think a lot about which intrinsics are in which subsets while coding. That seems more distracting than necessary and probably occasionally frustrating. So I'm unsure this is the best available tradeoff.

These ideas can be expressed without nested classes, which I think might be a little more friendly to coding. I'll leave those variants for a later reply, though.

There are also other considerations that need to be taken into account, such as what AVX512F depends on

Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse. The reason the class hierarchy below Avx2 and Fma works is because Intel and AMD have always shipped expanding instruction sets. In this sense, the Fma-Avx2 fork probably wasn't great for continuing the derivation chain. But all we can do now is to make our best attempt at not creating similar problems in the AVX-512 surface.

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

If you're saying Microsoft's internal .NET review process is such that architects and similar would want to see an Avx512 class hierarchy, including Avx512F, laid out before approving work on a VL implementation that seems fair. However if they'd insist you (or another developer) code F before VL I think that's more than a bit strange. And maybe also somewhat disconnected from early Avx512 adoption, where adjusting existing 128 and 256 bit kernels to use masks or take advantage of certain additional instructions might be common.

tannergooding commented 4 years ago

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future ... Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse.

The architecture manuals both indicate that you cannot just check for x. Before using Sse2 you must first check that:

  1. CPUID is supported by querying bit 21 of the EFLAGS register
  2. Checking that CPUID.01H:EDX.SSE[bit 25] is 1
  3. Checking that CPUID.01H:EDX.SSE2[bit26] is 1

The same goes for all of the hierarchies we've modeled in the exposed API surface, for example SSE4.2: image

We also followed up with Intel/AMD for places where the expectation and the manual had discrepancies. For example, SSE4.1 strictly indicates checking SSSE3, SSE3, SSE2, and SSE. While SSE4.2 only strictly indicates SSE4.1, SSSE3, SSE2, and SSE (missing SSE3). These are just an oversight in the specification and the intent is that SSE4.2 requires checking SSE4.1, which requires SSSE3, which requires SSE3, which requires SSE2, which requires SSE, which requires CPUID.

Many applications don't do this full checking (even if just once on startup) and instead only check the relevant CPUID bit and assume the others are correct. Its the same as AVX and AVX512F both requiring you to check the OSXSAVE bit and that the OS has stated it supports the relevant YMM or ZMM register before checking the relevant feature bit. Likewise, extensions (AVX2, FMA, AVX512DQ) are all spec'd as requiring you to check the baseline flag as well (AVX or AVX512F).

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications. So there will never be a case, for example, where an x86 CPU shipped SSE4.1 support without SSE3 support, etc

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

Yes this is an edge case where the hierarchy can't be cleanly modeled. However, the hierarchy in general helps indicate what actual APIs are available and therefore removes duplication and confusion for the user in general. If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright 😄

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

The required work can actually be largely broken down into 3 stages:

  1. Adding EVEX support
  2. Adding TYP_SIMD64 support
  3. Adding TYP_MASK8/16/32/64 support

The first is the biggest blocker today and must be done before either stage 2 or 3. It would require updating the register allocator to be aware of the 16 new registers so they can be used, saved, and restored and the emitter to be able to successfully encode them when encountered. This would, in theory, require no public API surface changes and assuming it is always beneficial like using the VEX encoding is, it could automatically light up for the existing 128-bit/256-bit APIs when AVX512VL is supported.

The second is a smaller chunk of work, it's a natural extension on top of what we already have and the bulk of the work would actually be in API review just ensuring we are exposing the full surface and doing it correctly. Actually implementing support for these APIs shouldn't be too difficult as we have a table driven approach already, so it should just be adding new table entries and mapping them to the existing instructions just taking/returning TYP_SIMD64. Support for TYP_SIMD64 in the JIT should just be expanding the checks we do in a few places and again ensuring that the upper 256-bits are properly used/saved/restored.

The third is the biggest work item. It would require doing everything in both 1 and 2 for a new set of types. That is, the register allocator needs to be aware of these new registers so they can be used, saved, and restored. Likewise, the emitter needs to be able to successfully encode them. Support for the new types would also have to be integrated with the other various stages as well. We then also need to have an API review for the entire surface area which is, at a minimum, effectively everything we've already exposed for 128/256-bits but with an additional mask parameter. It explodes more if we include extensions to the Vector512 versions. Actually implementing them will likely be largely table driven but will require various parts of the table driven infrastructure to be updated and new support adding in lowering and other stages to account for optimizes that can or should be done.

The second or third could technically be done in either order and yes there may be a larger use case for having 3 first, as it is an extension to existing algorithms and avoids needing to manually mask and blend. However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL.

There, of course, may be other gotchas or surprises not listed above that would be encountered when actually implementing these. It would also be much harder to test since the amount of AVX512 hardware available today is limited in comparison to the amount with AVX2 support which needs to be taken under consideration.

twest820 commented 4 years ago

If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright

I think so too. 😄 It's also why I proposed some things aligned with Knights and Skylake. While we don't know if, how, or when AMD might implement AVX-512, Intel is done with those two microarchitectures and we know Sunny Cove doesn't backtrack from Skylake instructions. So looking at how .NET might support the 96% of Ice Lake intrinsics which have been consistently available since Skylake is hopefully a pretty safe target.

Some of Intel's blog posts from years ago indicate CD will always be present with F, which is where the class Avx512FCD above comes from. Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA. It's a helpful simplification if a similar assumption can be made for BW, DQ, and VL.

However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL

Thanks for explaining! I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low? If so, that implies the distinction I was trying to make about please consider unlocking some of 3 before finishing everything in 2 might not be large.

My test situation's even worse until either desktop Ice Lakes or expanded Ice Lake laptop availability so I totally get the challenges there. I also appreciate EVEX support is a substantial effort.

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications.

Oh excellent, appreciate the catch (we have an unmanaged class I should correct as it's not honoring the SSE hierarchy). Fixed up the code comments in my previous.

Curiously, the Intrinsics Guide typically does not indicate dependencies on AVX-512F even though sections 15.2.1, 15.3, and 15.4 of the Intel 64 and IA-32 Architectures Software Development Manual all indicate software must check F before checking other subset flags. I'll ask about this on the Intrinsics Guide bug thread over in Intel's ISA forum. I think there's also a typo in figure 15-5 of the arch manual as it should indicate table 15-2 rather than 2-2.

tannergooding commented 4 years ago

Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA

Even if Intel would be unlikely to ever ship F without CD, the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD (they would be different ISAs otherwise) and so we wouldn't provide them as part of the same class (especially considering how new the instructions are, relatively speaking).

I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low

It varies from intrinsic to intrinsic, but in general the intrinsics are table driven and so if it doesn't expose any new "concepts" then it is just adding a new entry to https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiclistxarch.h with the appropriate flags. The various paths know to lookup this information in the table to determine how it should be handled.

When it does introduce a new concept or if it requires specialized handling, then it requires a table entry and the relevant logic to be added to the various locations in the JIT (generally importation, lowering, register allocation, and codegen). In the ideal scenario, the new concept/handling is more generally applicable and so it is a one time cost for the first intrinsic that uses it and subsequent usages are then able to go down the simple table driven route.

The tests are largely table driven as well and are generated from the templates and metadata in https://github.com/dotnet/runtime/blob/master/src/coreclr/tests/src/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx. This ensures the various relevant code paths are covered without having to explicitly codify the logic every time.

For 1, it is ideally just an encoding difference like the legacy vs VEX encoding was in which case there aren't really any new tests or APIs to expose. For 2, it is just extending the APIs to support 512-bit versions and so it, for the vast majority, is just reusing the existing concepts and will just be adding table entries. For 3, it is introducing a number of new concepts and so it will require quite a bit of revision to the intrinsic infrastructure to account for the mask operands and the various optimizations that can happen with them.

twest820 commented 4 years ago

Minor status bump: Intel's never been particularly active on their instruction set extensions forum but they've recently stopped responding entirely. So no update from Intel on the questions about the arch manual and intrinsics guide that were raised here a month ago.

the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD

Interesting. The arch manual states software must also check for F when checking for CD (and strongly recommends checking F before CD). You've more privileged access to what Intel really meant and context on how to resolve conflicts between the arch manual and intrinsics guide than most of us. Thanks for sharing.

tannergooding commented 4 years ago

I think my statement might have been misinterpreted.

I was indicating that the following should be possible (where + indicates supported and - indicates unsupported):

The following should never be possible:

AFAIK, there has never been a CPU that has shipped as +F, -CD, but given the spec it should be possible for some CPU to ship with such support.

hanblee commented 3 years ago

The required work can actually be largely broken down into 3 stages:

  1. Adding EVEX support
  2. Adding TYP_SIMD64 support
  3. Adding TYP_MASK8/16/32/64 support

@tannergooding Have you considered staged approach to 1 above by first adding EVEX encoding without ZMM or mask support? This would allow use of AVX-512* instructions that operate on XMM and YMM without introducing Vector512<T> or Mask8 types and their underlying support in the JIT. For example, the following would then become possible:

/// <summary>
/// __m256i _mm256_popcnt_epi32 (__m256i a)
///   VPOPCNTD ymm, ymm
/// </summary>
public static Vector256<uint> PopCount(Vector256<uint> value)
tannergooding commented 3 years ago

AVX512-F is the "baseline" instruction set and doesn't expose any 128-bit or 256-bit variants it exposes the 512-bit and mask variants. The 128-bit and 256-bit variants are part of the separate AVX512-VL instruction set (which depends on AVX512-F).

In order to support the encoding correctly, we need to be aware of the full 512-bit state and appropriately save/restore the upper bits across call boundaries among other things.

ArnimSchinz commented 3 years ago

Vector<T> support would also be very nice.

saucecontrol commented 3 years ago

Vector<T> support would also be very nice.

Variable size for Vector<T> already results in unpredictable performance between AVX2 and non-AVX2 hardware due to the cost of cross-lane operations and the larger minimum vector size being useful in fewer places. Auto-extending Vector<T> to 64 bytes on AVX-512 hardware would aggravate the situation.

However, the common API defined for cross-platform vector helpers (#49397) plus Static Abstracts in Interfaces would allow the best of both worlds: shared logic where vector size doesn't matter, plus ISA- or size-specific logic where it does.

ArnimSchinz commented 3 years ago

I like how the usage of Vector<T> makes the code forward compatible and hardware independant. Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

tannergooding commented 3 years ago

Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

"best possible performance" isn't always the same as using the largest available vector size. It is often the case that larger vectors come with increased costs for small inputs or for handling various checks to see which path needs to be taken.

Support for 512-bits in Vector needs to be considered, profiled, and potentially left as an opt-in AppContext switch to ensure that various apps can choose and use what is right for them.

ArnimSchinz commented 3 years ago

I think "small inputs" are not very common for vectorized operation use cases. If my input is smaller than Vector<T>.Count, i choose another path. But i do not use Vector<T> at all, if i do not expect most of the input to be multiple times bigger than the largest available vector size. Many azure virtual machines already support 512-bits in vectors and i think it would be a great addition. I think there is no scenario where i would want my Vector<T> vector to be smaller than the largest available vector size.

tannergooding commented 3 years ago

Vectors get used internally in the framework in many locations, including things like IndexOf for strings/spans.

You can certainly check on Vector<T>.Count, but automatically using 64-byte vectors will change the perf semantics of existing code. That is, code that was vectorized for 16-64 bytes or 32-64 bytes will no longer be vectorized once Vector<T> is implicitly 512-bits.

While its true that you get the most benefit for vectors with large inputs, there are also many cases where they are beneficial for smaller inputs and where they can accelerate common inputs. Take for example names where you will commonly have 10-32 characters (20-64 bytes). AVX-512 will not commonly help here, but 128-bit vectors can (as you can drastically reduce the number of comparisons needed).

There are many more examples where this comes up in real world code and so how 512-bit support is exposed needs to be considered, including how users either opt-in or opt-out of getting implicit vectors of that size.

twest820 commented 3 years ago

I think "small inputs" are not very common for vectorized operation use cases.

It really depends on your scenarios but, in general, I tend to disagree with this statement. Intel processors, in particular, tend to impose transition latencies and downclock on wider vectors. There's a lot of detailed cases here depending on the the exact the hardware, which vector widths you're transitioning between, and so on. But, broadly speaking, quite a few of the compute kernels I've written don't gain enough at 256 bits to offset downclocking on processors where that occurs. They therefore run faster in their 128 bit wide version. This is particularly likely on loops tend to go go hundreds of thousands to millions of iterations while completing within the few milliseconds involved in vector width transitions while the CPU requests its power supply to increase voltage. And, since upclocking after downclocking is similarly sticky, even if your loop runs faster it may not be a net win when narrower code follows.

One such case I often see is a light compute kernel maxing out DRAM bandwidth at 128 bits. Going wider (or multithreaded) on these just hits memory access harder and frequently profiles a few percent slower.

If you have a set of compute dense kernels working enough data they can pound full vector width for, say, a couple seconds without getting hung up on bottlenecks like AVX lane swaps or cache misses then, yes, there's a decent chance it's beneficial to maximize vector width. However, it's not uncommon I see narrower workloads run faster due to more concurrent ALU port utilization even when downclocking isn't an issue. I also have workloads where data to data dependencies are such that 128 accelerates nicely but 256 bit versions of the same loop run more slowly due to dependencies between different parts of the longer vectors even without downclocking. Most of those kernels don't express any better if opened up to 512.

These are some of the reasons why my exchanges with Tanner earlier on this issue focused on AVX512VL and not so much on going 512 bits wide. From a development standpoint, typically what I do is enable hot scalar loops for 128 bit dispatch and profile that. If it's not fast enough then I look at 256 bit and so on. Often what I see is going to VEX encoded 128 bit captures most of the benefit, frequently due to ymm availability avoiding xmm register spilling, and there's not enough gain to going wider to justify supporting and testing the additional code paths. One thing which prompted me to open this issue, in fact, was reviewing VEX128 disassemblies and recognizing having zmm access via AVX512VL would avoid all the ymm spilling that was bogging those particular loops.

TL;DR, what Tanner said. :slightly_smiling_face:

ArnimSchinz commented 3 years ago

Thank you for these explanations. The opt-in and opt-out option sounds good to me (like setting the minimum/maximum Vector<T>.Count to any of the hardware supported lengths). In my case i have many numeric operations like finding the minimum double value in potentially millions of double values (inside a span/array) and i guess large vectors make sense there, but like you said - i would need to profile that. I can not predict every potential input or usage, but i could test some different scenarios.

HighPerfDotNet commented 2 years ago

Is this likely to make .NET 7?

The reason I ask is because recently leaked info pretty much confirmed that Zen4 (at least its server variant) will support AVX-512 and Intel's next chip should have lower frequency penalties for AVX-512 usage, which hopefully will drive adoption.

tannergooding commented 2 years ago

Is this likely to make .NET 7?

This is not currently planned for .NET 7. The amount of work required to support AVX-512 is absolutely massive and nearly triples the current API surface area (from ~1500 to ~4500 APIs). It is a very large work item and restricted to highly specialized scenarios and hardware making an already "niche" (in that the majority of .NET customers don't use it, they just indirectly benefit from it) more "nice" (because an even smaller customer base will use and/or benefit from it; simply by not having hardware that can utilize it). There are also several open questions about things like reliably testing in CI, extending Vector<T> to support or benefit from 512-bit vectors and more.

This is something we will likely do eventually but there are also currently several higher priority work items that take precedence. The main blocking things here are:

  1. Extending the ABI to understand/support the saving/restoration of the upper 256-bits of vector registers (this is required to support the EVEX encoding)
  2. Extending the ABI to understand/support the new "masking" registers
  3. Extending the register allocator to support the additional 16 SIMD registers available
  4. Extending the register allocator to support the "masking" registers
  5. Extending the emitter to support the EVEX encoding

Once all of that is done, the remaining work is a lot simpler and is largely driven by adding new lines to hwintrinsiclistxarch.h, which defines a table of Intrinsic IDs and the associated metadata (such as method name, vector size, parameter count, and instruction to emit).

HighPerfDotNet commented 2 years ago

Thank you for this explanations, let's hope next gen CPUS will drive adoption and hopefully build critical mass soon

eladmarg commented 2 years ago

@tannergooding thanks for the detailed answer.

I do believe there will be a benefit in the long run after avx512 will become mainstream as technology continue improving

@lemire gained 40% performance improvement for json parsing thanks to avx512

So currently there are other priorities, hope this will catch up in net 8

lemire commented 2 years ago

Like ARM's SVE and SVE2, AVX-512 is not merely 'same as before but with wider registers'. It requires extensive work at the software level because it is a very different paradigm. On the plus side, recent Intel processors (Ice Lake and Tiger Lake) have good AVX-512 support, without downclocking and with highly useful instructions. And the good results are there: we parse JSON at record-breaking speeds. AVX-512 allows you to do base64 encoding/decoding at the speed of a memory copy. Crypto, machine learning, compression...

@tannergooding is of course correct that it is not likely that most programmers will directly benefit from AVX-512 in the short term, but I would argue that many more programmers would benefit indirectly if AVX-512 was used in core libraries. E.g., we are currently working on how to use AVX-512 for processing unicode.

On the downside, AMD is unlikely to support widely AVX-512 in the near future, and Intel is still putting out laptop processors without AVX-512 support...

HighPerfDotNet commented 2 years ago

CC: @tannergooding

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

https://videocardz.com/newz/amd-confirms-ryzen-7000-is-up-to-16-cores-and-170w-tdp-rdna2-integrated-gpu-a-standard-ai-acceleration-based-on-avx512

So that means even consumer level chip will support it, meaning it will also be in server Genoa chip due Q4 this year. This also means Intel will have to enable AVX-512 in their consumer chips too.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

tannergooding commented 2 years ago

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

This also means Intel will have to enable AVX-512 in their consumer chips too.

It does not mean or imply that. Different hardware manufacturers may have competing design goals or ideologies about where it makes sense to expose different ISAs.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

The amount of work required to support such a feature is greater than simply adding the direct hardware intrinsic support for AVX-512 instructions.

It requires all the same JIT changes around handling EVEX, the additional 16 registers, having some TYP_SIMD64, and the op-mask registers. Plus it would also require language support, a full fledged assembly lexer/parser, and more.


More generally, AVX-512 support will likely happen eventually. But even partial support is a non-trivial amount of work, particularly in the register allocator, debugger, and in the context save/restore logic.

tannergooding commented 2 years ago

The work required here can effectively be broken down into a few categories:

The first step is to update the VM to query CPUID and track the available ISAs. Then the basis of any additional work is adding support for EVEX encoded instructions but limiting it only to AVX-512VL with no masking support and no support for XMM16-XMM31. This would allow access to new 128-bit and 256-bit instructions but not access to any of the more complex functionality. It would be akin to exposing some new AVX3 ISA in complexity.

Then there are three more complex work items that could be done in any order.

  1. Extend the register support to the additional 16 registers AVX-512 makes available. These are XMM16-XMM31 and would require work in the thread, callee, and caller save/restore contexts, work in the debugger, and some minimal work in the register allocator to indicate they are available but only on 64-bit and only when AVX-512 is supported.
  2. Extend the register support to the upper 256-bits of the registers. This involves exposing and integrating a TYP_SIMD64 throughout the JIT as well as work in the thread, callee, and caller save/restore contexts, work in the debugger
  3. Extend the register support to the KMASK registers. This would require more significant work in the register allocator and potentially the JIT to support the "entirely new" concept and registers.

There is then library work required to expose and support Vector512<T> and the various AVX-512 ISAs. This work could be done incrementally alongside the other work.


Conceivably the VM and basic EVEX encoding work are "any time". It would require review from the JIT team but is not complex enough that it would be impractical to consider incrementally. The same goes for any library work exposed around this, with an annotation that API review would likely want to see more concrete numbers on the number of APIs exposed, where they are distributed, etc.

The latter three work items would touch larger amounts of JIT code however and could only be worked on if the JIT team knows they have the time and resources to review and ensure everything is working as expected. For some of these more complex work items it may even be desirable to have a small design doc laying out how its expected to work, particularly for KMASK registers.

tannergooding commented 2 years ago

Also noting that the JIT team has the final say on when feature work can go in, even for the items I called out as seemingly anytime.

HighPerfDotNet commented 2 years ago

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

Well, we'll know in a few months, but right now it seems that it's 99.9% happening, even in cheap mass produced consumer chips.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

tannergooding commented 2 years ago

It is pretty definitive since it comes from interview given by Director of Technical Marketing (AMD) Robert Hallock, you can view it here:

They explicitly state "not AMD proprietary, that's all I can say" and then "I can't here you" in response to "can we say anything like AVX-512, or anything like that?". That does not sound like confirmation to me, rather it is explicitly not disclosing additional details at this time and we will get additional details in the future.

An example is this could simply be AVX-VNNI and/or AMX which are both explicitly non AVX-512 based ISAs that supports AI/ML scenarios.

We will ultimately have to wait and see what official documentation is provided in the future.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

It continues to be incorrect to speculate here or make presumptions about what the hardware vendors can or cannot do based on what other hardware vendors are doing.

As I mentioned AVX-512 will likely come in due time but there needs to be sufficient justification for the non-trivial amount of work. Another hardware vendor getting support might give more weight to that justification but there needs to be definitive response and documentation from said vendors covering exactly what ISAs will be supported (AVX-512 is a large foundational ISA and roughly 15 other sub-ISAs), where that support will exist, etc.

HighPerfDotNet commented 2 years ago

Official from AMD today - support for AVX 512 in Zen 4

AMD-FAD-2022-Zen-4-Improvements-on-5nm 1

Source: https://www.servethehome.com/amd-technology-roadmap-from-amd-financial-analyst-day-2022/

tannergooding commented 2 years ago

Said slides should be available from an official source about 4 hours after the event ends: https://www.amd.com/en/press-releases/2022-06-02-amd-to-host-financial-analyst-day-june-9-2022

I'll take a closer look when that happens, but it doesn't look like it goes any more in depth into what ISAs are covered vs not.

tannergooding commented 2 years ago

The raw slides aren't available, but its covered by the recorded and publicly available webcast: https://ir.amd.com/news-events/financial-analyst-day

Skip to 45:35 for the relevant portion and slide.

Edit: Raw slides are under the Technology Leadership link.

lemire commented 2 years ago

AMD is vague. He did refer to HPC which suggests it might be more than BF16 and VNNI.

tannergooding commented 2 years ago

Yes, we'll need to continue waiting for more details. AVX-512, per specification, requires at least AVX-512F which includes the EVEX encoding, the additional 16-registers, 512-bit register support, and the kmask register support.

The "ideal" scenario is that this also includes AVX-512VL and therefore the EVEX encoding, the additional 16-registers, and the masking support are available to all 128-bit and 256-bit instructions. This would allow better optimizations for existing code paths, use of the new instructions including full width permute and vptern, and isn't only going to light up for large inputs and HPC scenarios.

However, having official confirmation that the ISA (even if only AVX-512F) is now going to be cross-vendor does help justify work done towards supporting this.

HighPerfDotNet commented 2 years ago

I expect to see a CPU-Z screenshot of Zen 4 with details of supported ISAs soon - but given what was said so far it does FEEL to me that the support will be pretty extensive (Genoa version needs it and chiplets are the same as consumer anyway), so that should be including new AVX-512 VNNI

Symbai commented 1 year ago

image

Only says AVX-512F. But according to Wikipedia it also supports VL. MSVC compiler already supports AVX-512 since 2020. I really hope seeing this in .NET 8.

filipnavara commented 1 year ago

Zen4's AVX512 flavors is all of Ice Lake plus AVX512-BF16.

So Zen4 has: AVX512-F AVX512-CD AVX512-VL AVX512-BW AVX512-DQ AVX512-IFMA AVX512-VBMI AVX512-VNNI AVX512-BF16 AVX512-VPOPCNTDQ AVX512-VBMI2 AVX512-VPCLMULQDQ AVX512-BITALG AVX512-GFNI AVX512-VAES

The ones it is missing are: The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW AVX512-VP2INTERSECT (from Tiger Lake) AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)

Source: https://www.mersenneforum.org/showthread.php?p=614191

lemire commented 1 year ago

The gist of it is that @HighPerfDotNet was right. AMD Zen 4 has full AVX-512 support (full in the sense that it is competitive with the best Intel offerings).

I submit to you that this makes supporting AVX-512 much more compelling.

tannergooding commented 1 year ago

We're already working on adding AVX-512 support in .NET 8, a few foundational PRs have already been merged ;)