dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

14.98k stars 4.66k forks source link

Arm64: Add SVE/SVE2 support in .NET 9 #93095

Closed kunalspathak closed 1 month ago

kunalspathak commented 11 months ago

Until Armv8, NEON architecture enabled users to write vectorized code using SIMD instructions. The vector length for NEON instructions remains fixed at 128 bits. To take into account High Performance Computing applications, newer versions of Arm like v9 has developed Scalable Vector Extension (SVE). SVE is a new programming model that contains SIMD instructions operating on flexible vector length, ranging from 128 bits to 2048 bits. SVE2 extends SVE to enable domains like Computer vision, multimedia, etc. More details about SVE can be found on Arm's website. In .NET 9, we want to start the foundational work in .NET libraries and RyuJIT to add support for SVE2 feature.

[ ] Design: The instructions present in SVE2 needs to be exposed through .NET libraries to our customers. However, we want to make sure that the customers don't have to rewrite their code to consume SVE2 functionality and the experience of using SVE2 features should be seamless. Currently, .NET has Vector<T> that represents flexible vector length depending on the underlying hardware and the idea is to expose SVE2 "flexible vector length" functionality using Vector<T>. We need to think about the common instructions and validate if they can be represented by APIs that just takes Vector<T> as parameter. Additionally, SVE2 introduces "predicate register" to mark vector lanes active/inactive. We do not want to expose this concept as well to our consumer through .NET libraries because of reason mentioned above. Hence, for every API, need to come up with a pseudo code of how the API should be implemented internally in JIT such that the "predicate register" concept is created and consumed inside JIT implicitly. There is a good discussion that happened in the past about the API proposal in https://github.com/dotnet/runtime/issues/88140.
[ ] Implement APIs in System.Runtime.Intrinsic.Arm.SVE2: Once design is finalized, need to add all the APIs in a new SVE2 class under System.Runtime.Intrinsic.Arm. They need to be plumbed through the JIT (by transforming tree nodes to represent "predicate register" concept, if needed). all the way to generating code. https://github.com/dotnet/runtime/issues/99957
[ ] If possible, automatically generate the test template for various APIs
[ ] Update Antigen to pick up the new SVE APIs
[ ] SVE hardware availability in CI
[ ] Backend support: Regardless of API design, we need to add instruction encoding of all the SVE2 instructions that we are planning to support. Here is a rough list of things that needs to happen to add the new instructions. Covered in https://github.com/dotnet/runtime/issues/94549.
- [ ] Add new entries in hwinstrinsiclistarm64.h for AdvSimd.SVE2 APIs
- [x] Depending on the API call the right emitIns_*() code
- [x] emitIns_*() methods add the new instructions support
- [x] emitfmtsarm64.h - needs to add new instruction formatting https://github.com/dotnet/runtime/pull/94285
- [x] instrsarm64.h - needs to add new instruction to instruction formatting mapping https://github.com/dotnet/runtime/pull/94285
- [x] Add the encoding for new instructions in emitOutputInstr()
- [x] Add new Zx registers and predicate registers https://github.com/dotnet/runtime/issues/99658
- [ ] Add JitStressRegs mode to always allocate high Z/P register unless the candidate specifically says to use low registers.
- [ ] Make sure TP regression is minimum
- [x] Testing if the encoding matches the display ins using windbg/msvc/cordistools
- [x] Add entries for the new instructions in genArm64EmitterUnitTests()
- [ ] Fix the formatEncode* data
- [ ] Refactor and consolidate the common code/asserts, convert some methods lookup like insSveIsLslN / insSveGetLslOrModN into table drive loop up.
- [ ] Possibly frame layout code update such that the SVE_simd and SVE_mask are towards the end of the frame in order for offset in stack to take into consider the variable VL.
[x] Automation for encoding: If we see above list, it is very time consuming to add each instruction's encoding in the code base. There are 800+ instructions and considering 30 mins for each instruction, it will take 400+ human hours to add and validate the encoding. Hence, there is an experiment going on to generate the encoding data and the C++ code around it automatically. The understanding is that the C++ code generated will not be accurate and manual inspection will still be required. Edit: A working prototype of the tool that generates 2 C++ files for encoding is in https://github.com/dotnet/runtime/pull/94285.
- [x] Produce a json/xml file that contains all the encoding data of all the instructions. A good resource from which this can be extracted is here.
- [x] Recreate instrsarm64.h that contains the existing as well as the new formats along with encoding binary representation and hexadecimal representation.
- [x] If there are new instruction formats, add their entries in emitfmtsarm64.h file.
- [x] Based on the mmmm, dddd, etc. in the binary representation of encoding, have to tool write a logic to produce the instruction bytes. In other words, this will generate code that can be pasted in emitOutputInstr() function.
- [x] Depending on the number of encodings in each group of instructions, they can be tied to appropriate INST* like INST9 or INST8, etc. and regenerated in sorted order. Note that if we need to regenerate existing files like instrsarm64.h and emitfmtsarm64.h , existing instruction's encoding also needs to be generated by the tool.
[ ] MOVPRFX instruction support: GCStress, sanity check. Also, when emitting every single instruction, need to make sure that if previous instruction was movprfx, we verify it follows all the rules with regards to destination registers, size, etc. https://github.com/dotnet/runtime/issues/105514
[ ] Before consuming the new APIs in libraries, make sure that Mono has support for it or else SVE.IsSupported should return false for it.
[ ] On Windows, make sure that new SVE registers as well as predicate registers are available during debugging as well as Vector<T> shows the right number of elements.
[ ] ABI related changes for SVE/predicate registers in VM
[ ] Support breakpoint patching for SVE instructions
- ICorDebug needs to be updated to know which SVE instructions have relative read/write/jumps - walker.cpp is specifically the place that needs to be updated
- Similar work has been done for AVX512 - https://github.com/dotnet/runtime/pull/89705

ghost commented 11 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

## Overview As we did in the past for [.NET 5](https://github.com/dotnet/runtime/issues/35853), [.NET 7](https://github.com/dotnet/runtime/issues/64820) and [.NET 8](https://github.com/dotnet/runtime/issues/77010), we would like to continue improving Arm64 in .NET 9 as well. Here are the top-level themes we plan to address. Some of the issues are from the past releases that we did not get time to work upon. while others are about adding instructions of newer arm versions or exposing the Arm functionality to the .NET API level. ### SVE2 support Until Armv8, NEON architecture enabled users to write vectorized code using SIMD instructions. The vector length for NEON instructions remains fixed at 128 bits. To take into account High Performance Computing applications, newer versions of Arm like v9 has developed Scalable Vector Extension (SVE). SVE is a new programming model that contains SIMD instructions operating on flexible vector length, ranging from 128 bits to 2048 bits. SVE2 extends SVE to enable domains like Computer vision, multimedia, etc. More details about SVE can be found on [Arm's website](https://developer.arm.com/documentation/102340/0100/Introducing-SVE2). In .NET 9, we want to start the foundational work in .NET libraries and RyuJIT to add support for SVE2 feature. - [ ] **Design**: The instructions present in SVE2 needs to be exposed through .NET libraries to our customers. However, we want to make sure that the customers don't have to rewrite their code to consume SVE2 functionality and the experience of using SVE2 features should be seamless. Currently, .NET has [`Vector`](https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector-1?view=net-7.0) that represents flexible vector length depending on the underlying hardware and the idea is to expose SVE2 "flexible vector length" functionality using `Vector`. We need to think about the common instructions and validate if they can be represented by APIs that just takes `Vector` as parameter. Additionally, SVE2 introduces "predicate register" to mark vector lanes active/inactive. We do not want to expose this concept as well to our consumer through .NET libraries because of reason mentioned above. Hence, for every API, need to come up with a pseudo code of how the API should be implemented internally in JIT such that the "predicate register" concept is created and consumed inside JIT implicitly. There is a good discussion that happened in the past about the API proposal in https://github.com/dotnet/runtime/issues/88140. - [ ] **Implement APIs in `System.Runtime.Intrinsic.Arm.SVE2`**: Once design is finalized, need to add all the APIs in a new `SVE2` class under `System.Runtime.Intrinsic.Arm`. They need to be plumbed through the JIT (by transforming tree nodes to represent "predicate register" concept, if needed). all the way to generating code. - [ ] **Backend support**: Regardless of API design, we need to add instruction encoding of all the SVE2 instructions that we are planning to support. Here is a rough list of things that needs to happen to add the new instructions: - [ ] Add new entries in hwinstrinsiclistarm64.h for AdvSimd.SVE2 APIs - [ ] Depending on the API call the right emitIns_*() code - [ ] `emitIns_*()` methods add the new instructions support - [ ] `emitfmtsarm64.h` - needs to add new instruction formatting - [ ] `instrsarm64.h` - needs to add new instruction to instruction formatting mapping - [ ] Add the encoding for new instructions in `emitOutputInstr()` - [ ] Add new `Zx` registers and predicate registers - [ ] Make sure TP regression is minimum - [ ] Testing if the encoding matches the display ins using windbg/msvc/cordistools - [ ] Add entries for the new instructions in `genArm64EmitterUnitTests()` - [ ] Fix the `formatEncode*` data - [ ] **Automation for encoding**: If we see above list, it is very time consuming to add each instruction's encoding in the code base. There are 800+ instructions and considering 30 mins for each instruction, it will take 400+ human hours to add and validate the encoding. Hence, there is an experiment going on to generate the encoding data and the C++ code around it automatically. The understanding is that the C++ code generated will not be accurate and manual inspection will still be required. - [ ] Produce a json/xml file that contains all the encoding data of all the instructions. A good resource from which this can be extracted is [here](https://docsmirror.github.io/A64/2023-06/). - [ ] Recreate `instrsarm64.h` that contains the existing as well as the new formats along with encoding binary representation and hexadecimal representation. - [ ] If there are new instruction formats, add their entries in `emitfmtsarm64.h` file. - [ ] Based on the `mmmm`, `dddd`, etc. in the binary representation of encoding, have to tool write a logic to produce the instruction bytes. In other words, this will generate code that can be pasted in `emitOutputInstr()` function. - [ ] Depending on the number of encodings in each group of instructions, they can be tied to appropriate `INST*` like `INST9` or `INST8`, etc. and regenerated in sorted order. Note that if we need to regenerate existing files like `instrsarm64.h` and `emitfmtsarm64.h` , existing instruction's encoding also needs to be generated by the tool. ### New instructions - [ ] https://github.com/dotnet/runtime/issues/84510 - [ ] Explore new instructions added in Armv8.3 ~ Armv9 and see if we can use them in JIT - [ ] Start using Post increment addressing mode in JIT wherever applicable. ### Performance improvements - [ ] https://github.com/dotnet/runtime/issues/68028 - [ ] https://github.com/dotnet/runtime/issues/10444 - [ ] https://github.com/dotnet/runtime/issues/84328 - [ ] Consume LoadVector/StoreVector in .NET libraries ### Stretch goals - [ ] Consume SVE2 APIs in .NET libraries - [ ] https://github.com/dotnet/runtime/issues/77916 - [ ] Experiment how much TP impact we will see if enabled pointer authentication for coreclr. Depending on that, decide if it should be enabled for JIT.

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak commented 11 months ago

cc: @a74nh @TamarChristinaArm @SwapnilGaikwad

jkotas commented 11 months ago

Currently, .NET has Vector that represents flexible vector length depending on the underlying hardware and the idea is to expose SVE2 "flexible vector length" functionality using Vector.

We should think about how this is going to work with the new SVE streaming mode. Do we expect to support the SVE streaming mode in .NET eventually? If yes, how it is going to affect the design?

tannergooding commented 11 months ago

Do we expect to support the SVE streaming mode in .NET eventually

I'd presume it would be desirable to support. Giving users the power of the per hardware lightup is a significant advantage to many backing frameworks that power .NET applications.

If yes, how it is going to affect the design?

The streaming instructions are more abstract and I don't think the design for C++ has been finalized yet either. I at least don't see any of the new instructions under https://developer.arm.com/architectures/instruction-sets/intrinsics

I expect they will be a completely separate consideration for how they're supported as compared to the more standard SIMD processing instructions and we'll need to cross that bridge when we come to it.

This is particularly given how it allows the effective SVE length to be changed and requires explicit start/stop instructions. And so I imagine the runtime itself will require some complex changes to work with the concept and ensure it doesn't impact the ABI, field usage, inlining boundaries, etc.

jkotas commented 11 months ago

The streaming instructions are more abstract and I don't think the design for C++ has been finalized yet either. I at least don't see any of the new instructions under

My understanding that the streaming mode uses the same instructions as regular SVE mode (subset of them), except that the instructions operate on larger vector sizes. If we were to use the current Vector<T> in both streaming and non-streaming mode, the size of Vector<T> would need to change between these two modes somehow. I do not see how we would pull this off.

This observation made me think that reusing Vector<T> for SVE registers may be a poor choice. Would it make more sense to have a new special type that can be only use on stack for locals and arguments and that does not have a constant size at runtime?

tannergooding commented 11 months ago

Would it make more sense to have a new special type that can be only use on stack for locals and arguments and that does not have a constant size at runtime?

That is effectively how C describes its SVE support. They are similar to "incomplete" types in many aspects, but with a few less restrictions. They can't be used with atomics, arrays, sizeof, pointer arithmetic, fields, captures, etc. They can still be used with ref T, T*, templates, parameters/return types, etc.

At least for non-streaming, a lot of that is only restrictive because C/C++ has existing requirements for things like sizeof(T) to be constant and is namely AOT. There's not actually anything "preventing" most of it from working, it would just require the relevant cnt* instruction to be executed and included as part of the size computation dynamically (and can be constant in a JIT environment).

If we did define a new type, I think we'd functionally be defining a ref struct ScalableVector + writing an analyzer to help enforce it doesn't get used for some other edge cases. I don't think that a platform specific vector type would warrant a proper language feature or language level enforcement for C#/F#/etc.

My understanding that the streaming mode uses the same instructions as regular SVE mode (subset of them)

This is my understanding as well.

For reference, the Procedure Call Standard` is here: https://github.com/ARM-software/abi-aa/releases/download/2023Q1/aapcs64.pdf

While the Vector Function ABI is here: https://github.com/ARM-software/abi-aa/releases/download/2023Q1/vfabia64.pdf

Non-streaming SVE has clear and concrete benefits and can easily integrate with our existing types and concepts. It doesn't really require any significant changes to make light up (no more than AVX-512 did, anyways). The biggest open question for it is how exactly to support it for AOT. But we also have a couple solutions to support both the VLA (vector length agnostic) and VLS (vector length specific) concepts. We can also section that work if required and only support VLS for AOT in the first iteration (much as is already done for Vector<T> today). -- Any work to improve Vector<T> for AOT or R2R scenarios then benefits all platforms and can likely be shared.

Streaming SVE still has some concrete benefits, however, it works significantly differently from anything we've exposed so far. There are simply too many open ended questions around it, how the user will start/stop the functionality, how it will impact the managed/native ABI boundary, inlining, callee/caller save state, etc. Trying to solve all of these would likely push non-streaming SVE support out of .NET 9 and potentially put us in a place where non-streaming SVE is functionally worse off from a general usability perspective. So I think it's ultimately better to, for now, treat it like it doesn't exist and say its UB if a user enables it (much as we define for other similar user togglable CPU state such as: floating-point exceptions, changing the default rounding mode, flushing denormals, enabling strict alignment validation, etc).

Then, when we're ready to take a deeper look at the feature, we can determine whether using Vector<T> + an analyzer is feasible for it or whether we truly need a new type. We can also, at the same time, answer the other questions about how it works at the ABI level for managed code, including GC pause points. -- I imagine, based on the general blog posts and other areas around the topic, we likely don't want to look at it until we have the ability to generate multiple versions of a function (one streaming aware, one streaming unaware), to efficiently handle the various dynamic callee/caller save state support around the ZA storage area, to do relevant state tracking for streaming enablement, etc

jkotas commented 11 months ago

Non-streaming SVE has clear and concrete benefits and can easily integrate with our existing types and concepts. It doesn't really require any significant changes to make light up (no more than AVX-512 did, anyways). The biggest open question for it is how exactly to support it for AOT.

Good point about AOT support for non-streaming SVE. How are we going to do that with Vector<T>? I do not see any good options.

tannergooding commented 11 months ago

How are we going to do that with Vector? I do not see any good options.

It ultimately depends on the code the user writes and how much we want to invest.

For recommended coding patterns, where the user treats Vector<T> and Vector64/128/256/512<T> like a ref struct, then it works much as it does in C/C++ and we can trivially generate VLA (vector length agnostic) code. The ABI already exists to support passing any number of Vector<T> arguments, declaring Vector<T> locals, and other scenarios; so we just need to support it as it is supported in native.

If the user deviates from this and starts using Vector<T> as the field of a struct, declares arrays of it, or other scenarios that depend on the size, then it gets a little more involved. However, we can still generate VLA code by making some computations dynamic. For example, sizeof(Vector<int>) functionally becomes cntd * 4. Since the packing of Vector<T> is well-defined, regardless of size, we can then use this to make offset calculations or other sizeof calculations work just as well and also for cheap.

We also have the option of generation VLS (vector length specific) code. This functionally means we can say a given method only support Vector<T> if it is 128-bits. This is basically how Vector<T> works on x64 when you say it should be 32-bytes, after all. If we ever get to the point of being able to generate multiple versions of a function, we could support generating both 128-bit and 256-bit versions in the future, or however we need based on the sizes we or the user wants to support. -- In the worst case, this simply fails to launch the app if SVE support exists and the size is larger than appropriate, but we also have the option of explicitly using predication if the actual size is larger. In practice, most SVE hardware today remains 128-bit and only a couple exceptions are larger (a 512-bit super computer and a 256-bit implementation for Graviton3). We can likewise guard paths based on some if (Sve.IsSupported and sizeof(Vector<T>) == ...) check if that's desirable. It really just comes down to how much effort we want to spend supporting the usage of Vector<T> in already non-standard scenarios (noting that support can be incremental over time as well).

jkotas commented 11 months ago

It ultimately depends on the code the user writes and how much we want to invest.

The design that we choose has a large influence over the cost required to support it.

If we choose a design that mirrors the C/C++ design, the cost of supporting SVE for AOT and for streaming mode is going to be very low.

If we choose the current proposed design that uses existing Vector<T> type, the cost of properly supporting SVE for AOT and for streaming mode seems to be very large, close to prohibitive. We should have a discussion about this and explicitly decide whether we like this long-term outcome.

For recommended coding patterns, where the user treats Vector and Vector64/128/256/512 like a ref struct, then it works much as it does in C/C++ and we can trivially generate VLA (vector length agnostic) code.

It is not unusual to see the Vector types used as regular static and instance fields. The design needs to support all situations where Vector* types can be used today. It is ok if it is slower, but it needs to work. Otherwise, it would be a major breaking change.

If the user deviates from this and starts using Vector as the field of a struct, declares arrays of it, or other scenarios that depend on the size, then it gets a little more involved. However, we can still generate VLA code by making some computations dynamic.

In the limit, this means creating a type loader that computes field layout at runtime and teaching codegen to use dynamic sizes and offsets produced by the type loader. We have done that in .NET Native for UWP, it was very complicated and produced poor results. I do not think we would ever want to do that again for form-factors without a JIT.

In practice, most SVE hardware today remains 128-bit and only a couple exceptions are larger (a 512-bit super computer and a 256-bit implementation for Graviton3).

Yes, SVE is a nascent technology. It is reasonable to expect that there will implementations that take advantage of the full range of allowed lengths, with both streaming and non-streaming modes. We should not take shortcuts based on what is available today.

tannergooding commented 11 months ago

The TL;DR is I think the evidence of reusing Vector<T> and it's benefits to the general ecosystem significantly outweigh any drawbacks. While I think a new type has significantly more drawbacks and will hinder the adoption and usability of SVE.

The design that we choose has a large influence over the cost required to support it.

I agree. It also dictates the ease at which it can be integrated into existing SIMD algorithms, used in shared coding patterns, light up implicitly around non-SVE based code, and the cost that is moved out of the runtime and onto other tools such as analyzers or the language, etc.

If we choose a design that mirrors the C/C++ design, the cost of supporting SVE for AOT and for streaming mode is going to be very low.

Possibly, but in turn I believe the cost of supporting the feature in general and the cost of integrating it into the rest of the ecosystem is going to be significantly higher.

A net new type, with the level of restrictions being discussed, requires the VM and JIT to block and fail for any number of the problematic patterns. It likely also requires a complex analyzer or language level feature (it is the latter in C/C++) to direct users towards "the right thing", since the wrong thing will cause true failures at runtime. That then extends to considerations beyond C# and impacts other languages where intrinsics can be used (such as F#) where they then need to do the same/similar work as well.

A net new type means significant increase to the existing API surface area and less integration with existing features or algorithms. Given the restrictions it has, it would be much like ref struct and wouldn't be usable with interfaces or generics. It would require a net new code path to be added rather than allowing the "core" of an implementation to be shared and only specializing where absolutely required.

A net new type means that it is harder to implicitly light up existing code paths. It means that a computer with 512-bit SVE might not be able to successfully use Vector512<T>, even though there is logically nothing preventing that in VLS mode (which is defined for C/C++) or for a JIT environment. It would likewise limit the ability for SVE to be used alongside AdvSimd and to allow it to be used to opportunistically improve codegen in those scenarios.

All of this raises the complexity bar significantly higher, introduces more risk towards being able to successfully ship the feature, and reduces the chance of adoption. Particularly if it was a language feature, it would likewise take time and effort away from other more important features which would be more broadly used and applicable.

On the other hand, reusing Vector<T> comes with almost none of these problems. It keeps alive an existing type, doesn't require complex language or analyzer support, can implicitly light up on existing vectorized code, and can easily be integrated to existing algorithms/patterns. The main downside is that for AOT where the user wants to support agnostic sizes, the VM will need some additional work to support the fact that types are dynamically sized. This same problem then extends to the JIT under streaming mode.

Yes, SVE is a nascent technology. It is reasonable to expect that there will implementations that take advantage of the full range of allowed lengths, with both streaming and non-streaming modes. We should not take shortcuts based on what is available today.

I agree that supporting the full range of lengths is desirable. I disagree that only supporting a subset today is taking a shortcut.

We have up to 2 sizes (128 and 256) that "need" support today because they have real world hardware that .NET will likely run on. We then have 1 additional size (512), with real hardware today, that could theoretically be supported. But that is only if we expect to run on a super computer. We then have 2 more sizes (1024 and 2048) that could theoretically be supported in the next 3 years, but which is incredibly unlikely to be supported in the .NET 9 timeframe.

Finally, we then have 11 types (other multiples of 128) which are technically supported by SVE, but which are disallowed by SVE2. Such support doesn't ever need to be provided since it was optional and I'm unaware of any hardware that actually shipped such support. The change in SVE2 has effectively deprecated it as well, which decreases value further.

It is then, in my mind, completely reasonable to prioritize the existing support in the first iteration of the implementation and to limit any apps produced for it to those sizes. After all, .NET 9 only needs to consider hardware that will reasonably be targeted in the next 30 months (12 months till ship + 18 months of support), including Android/iOS. Limiting the feature to just support 128-bit and 256-bit (for the one piece of hardware that currently supports that) should then be completely fine and allows us to spread the more involved work out over future releases when and where it becomes relevant.

If we choose the current proposed design that uses existing Vector type, the cost of properly supporting SVE for AOT and for streaming mode seems to be very large, close to prohibitive

I don't see this as the case given the reasons above.

The base AOT work required is effectively the same regardless. The main difference is with a net new type we can simply choose to have the VM throw a type load exception if an SVE based vector is used "incorrectly", rather than emit the slower codegen required to support Vector<T>. However, many of the restrictions are purely synthetic and logically already must exist to support other facets of the feature. -- In the face of streaming SVE mode, this support is further expanded upon because it allows the size of an SVE based vector to be changed by some external factor (A can switch the mode and then call B which uses SVE) or even for only part of a method.

Even supporting things like locals functionally needs a lot of the same support around offset calculations and the like. For example, taking the pointer to an SVE based vector is still allowed, as is dereferencing it, simply not doing pointer arithmetic on it (that is for svint32_t* x you can do *x, but not x[0] or x[1]). So therefore the following is valid:

#include <arm_sve.h>

extern void M(int* px, svint32_t* py, svint32_t* pz, int* pw);

void N()
{
    int x;
    svint32_t y, z;
    int w;

    M(&x, &y, &z, &w);
}

This code, as can be seen on https://godbolt.org/z/6qdz71G8q, then requires you to effectively dynamically allocate the stack space in the prologue, to create space for y, z, and release it in the epilogue. It then requires some amount of dynamic offset computations (see addvl) to get the relevant addresses of such locals.

The only real "additional" complexity around supporting struct S { svint32_t x; int32_t y; } is that it has sequential layout and so requires a dynamic offset calculation to access y. That being said, the actual code to support a dynamic offset calculation isn't complex (and while streaming mode isn't supported, it is trivial for the JIT and only relegated to AOT). -- The worst case is many different structs like this, where you need to chain multiple dynamic size calculations together. There are ways to minimize this impact, such as putting these locals "last". We can also operate under the, reasonable, assumption that such scenarios will be rare. They're inefficient; we already recommend people don't do them for fixed length vectors and to use alternative patterns instead.

There is likewise nothing requiring we do all the work at once. AOT is more limited than a JIT environment and we correspondingly already have features that don't work end to end or which may be more limited in the former. Vector<T> is already this way today for x64. We can (and likely should) iterate on the design over a couple releases to improve general AOT support where possible. -- It wouldn't be unreasonable for it to work like it does on x64 in the first iteration, which is to require it to be vector length specific (VLS) and to expand that to VLA (vector length agnostic) in the next release, when we have the time to finish that support. This is also completely within the C/C++ confines of the feature, where they similarly define an ABI for both VLS and VLA versions of a function.

Finally, by having this support be around Vector<T>, we not only support the mainline Arm64 scenarios, but we also improve support for x64 simultaneously and then have a type and general support that can likely be extended to other platforms with variable length support.

jkotas commented 11 months ago

A net new type, with the level of restrictions being discussed, requires the VM and JIT to block and fail for any number of the problematic patterns.

Byref-like types provide the restrictions that we would need here. There may be a few additional ones around taking size depending on the exact design. It should very straightforward enforce at runtime. It should not require analyzers or language support, exceptions thrown at runtime should be enough.

A net new type means that it is harder to implicitly light up existing code paths. It means that a computer with 512-bit SVE might not be able to successfully use Vector512<T>

I see implicit light-up of existing Vector paths as independent problem from SVE-specific Intrinsics. The light-up of existing architecture-neutral Vector paths can work the same way as AVX512 light-up, I do not see a problem with that. We have been designing the AVX512 light up for existing Vector<T> codepaths as configurable, so Vector<T> may be still configured to be only 256 or even 128 bit when AVX512 is available.

My concern is specifically about the type used with architecture specific SVE instructions. I do not think that it is appropriate to have this type to be configurable. This type should always match the SVE bitness of the underlying platform (and current streaming mode). It should not be user configurable.

A net new type means significant increase to the existing API surface area and less integration with existing features or algorithms.

I do not see why the new type for SVE specific instructions significantly increases the existing API surface. We are talking about adding a ton of new SVE specific Intrinsics. A new supporting type for them sounds like a drop in a bucket.

I agree that a new type means that you need to convert from/to the new type if you mix and match platform specific and platform neutral methods. It is a problem today as well if you mix and match Vector<T> and Vector128/256/512.

Limiting the feature to just support 128-bit and 256-bit (for the one piece of hardware that currently supports that) should then be completely fine and allows us to spread the more involved work out over future releases when and where it becomes relevant.

This is only true if we have good understanding of what we are going to do in the future releases. We are discussing many options here. We should have a firm plan that we agree on as viable.

The only real "additional" complexity around supporting struct S { svint32_t x; int32_t y; } is that it has sequential layout and so requires a dynamic offset calculation to access y. That being said, the actual code to support a dynamic offset calculation isn't complex

This is not the interesting complex case. This can be all handled by codegen as you have said and it does not require runtime type loader. The interesting complex cases are classes, statics and generics like class S<T> { svint32_t a; object b; [ThreadStatic] static svint32_t a; }. It is where type runtime loader is required.

Finally, by having this support be around Vector<T>, we not only support the mainline Arm64 scenarios, but we also improve support for x64 simultaneously and then have a type and general support that can likely be extended to other platforms with variable length support.

We have an option to make the new type to behave same as Vector<T> on x64 and improve them together. It would likely lead to Vector<T> to be deprecated over time in favor of the new type.

tannergooding commented 11 months ago

I think it would be good if we had a meeting to discuss this more in depth. I want to make sure we are at least on the same page with regards to eachothers concerns and the impact they would have on the ecosystem.

From my perspective, we have something today that works and makes .NET one of the best places to write SIMD code, particularly when it comes to supporting multiple platforms. Regular SVE is a new feature that meshes very well with the existing conventions and is the mainline scenario for Arm64 moving forward. There exists similar functionality designed for other platforms that also make the approach viable.

Streaming SVE is a more niche feature that is for even more specialized contexts. It is namely designed for complex matrix operations (and why it is introduced as part of SME), this is much like Intel AMX which also supports matrix operations, but does so in its own unique mechanism.

Streaming SVE deviates from any of the existing approaches and so doesn't mesh cleanly. This is namely because it is a switch that allows user-mode to dynamically change the size of a vector. It works similarly to other dynamic CPU configuration that .NET has historically opted to not support. Some examples include IEEE 754 exceptions, changing the floating-point rounding mode, setting strict alignment, etc.

Because of the scenario's its designed around and because of how it operates, it should be designed and considered separately. We may even determine it's not desirable to support at all, or should only be supported in a limited fashion. But we should not be restricting or hindering the more mainline customer scenario, nor should we be making it harder for them to utilize and integrate the new mainline functionality into their existing code and algorithms because this feature exists.

Byref-like types provide the restrictions that we would need here. There may be a few additional ones around taking size depending on the exact design. It should very straightforward enforce at runtime. It should not require analyzers or language support, exceptions thrown at runtime should be enough.

I don't think that's viable. If we're looking at matching what C/C++ requires, then there are a lot of restrictions put in place. That includes restricting things like structs with SveVector<T> as an instance field.

If we aren't looking at matching, then whether or not these objects can be declared on the heap has little impact on the JIT support. It may have some minimal impact on the GC support, but that's much less impacted with regions. Preventing the use of use in generics and their ability to implement interfaces is effectively going against everything we're trying to give users and ourselves around SIMD vectorization and will significantly raise the maintenance burden and complexity of supporting these types in the BCL, so much so that we may simply opt to not provide such paths.

I see implicit light-up of existing Vector paths as independent problem from SVE-specific Intrinsics. The light-up of existing architecture-neutral Vector paths can work the same way as AVX512 light-up, I do not see a problem with that. We have been designing the AVX512 light up for existing Vector codepaths as configurable, so Vector may be still configured to be only 256 or even 128 bit when AVX512 is available.

I don't see them as independent. This also isn't just about Vector<T>, this includes being able to use SVE instructions with Vector128<T>/AdvSimd code paths or being able to use Vector256<T> when the underlying SVE implementation is 256-bits in length.

AVX-512 light up works because it allows implicit use of new instructions with existing V128/V256 code paths and with existing patterns that developers are using with such code. Users can then additionally opt into providing a V512 code path if that is beneficial for their scenario and they can take explicit advantage of AVX-512 intrinsics, seamlessly, with the relevant IsSupported checks.

This then further interplays with new prototypes, like ISimdVector<TSelf, T>, which allow us to have 1 algorithm that supports any of the supported vector sizes. This allows us to write a simple dispatcher that routes to a single vectorized implementation. You can then, for size specific operations, handle those explicitly via a helper. This maximizes code sharing, reduces maintainability burden, reduces code complexity, and leaves nothing on the table as compared to using the native sizes directly.

By creating a new type and particularly by restricting it to be a ref struct, you remove most of the benefits from these patterns and general directions we're trying to move. You require users to explicitly write a new code path to use SVE and make it significantly more complex to interchange between SVE and the fixed sized types, which then further complicates the ability to share code and optimistically light up using SVE based functionality.

My concern is specifically about the type used with architecture specific SVE instructions. I do not think that it is appropriate to have this type to be configurable. This type should always match the SVE bitness of the underlying platform (and current streaming mode). It should not be user configurable.

This is an explicit feature of SVE and which operating systems expose and allow to be configured on a per-app basis. For example, on Linux: https://www.kernel.org/doc/html/v5.8/arm64/sve.html#prctl-extensions

It is completely appropriate and designed to allow this, so that a given application can default to the system configured size and can otherwise opt for a different size. .NET can and should take advantage of this, preferencing this to be set once at startup and then leaving it as UB if changed after (which is how C/C++ works as well).

A new supporting type for them sounds like a drop in a bucket.

It is one primary new type, which effectively involves taking Vector<T>, duplicating it, renaming it to SveVector<T>, making it a ref struct, and then trying to fill all the holes that are left by it being a ref struct. This includes filling the need for 2, 3, and 4 element tuples and rationalizing that it can't work with the friendly language syntax or features that we rely on in other places.

This is only true if we have good understanding of what we are going to do in the future releases. We are discussing many options here. We should have a firm plan that we agree on as viable.

I don't understand this sentiment. We know what hardware exists today and is likely to exist in the next 3 years. We have existing plans and guidance documentation on how vectorized code works and how we are handling the needs of allowing both platform specific and cross platform code to exist and to allow seamless transition between them.

This is not the interesting complex case. This can be all handled by codegen as you have said and it does not require runtime type loader. The interesting complex cases are classes, statics and generics like class S { svint32_t a; object b; [ThreadStatic] static svint32_t a; }. It is where type runtime loader is required.

Making the new type a ref struct also does not strictly solve the issue. It only solves the issue today and would become broken if/when the language tries to finish the push to allow ref structs to be used in generics or to implement interfaces, which has repeatedly been raised to them as an important scenario.

There's also no reason we can't block those for AOT or JIT scenarios. We are allowed to make breaking changes across major versions. Using vectorization is already decently niche compared to most things and the user doing something like this is effectively an anti-pattern and goes against how SIMD/vectorization is meant to be done. However, it likewise shouldn't really be overly complex to handle such cases, particularly if you aren't needing to consider streaming mode.

We have an option to make the new type to behave same as Vector on x64 and improve them together. It would likely lead to Vector to be deprecated over time in favor of the new type.

That seems like an overall worse world, for all the reasons listed above. It would, in my belief, strongly go against making SVE viable in .NET. It would likewise hinder any other platforms that have support for a length agnostic vector type.

neon-sunset commented 11 months ago

Please link https://github.com/dotnet/runtime/issues/76047 to this issue too.

kunalspathak commented 11 months ago

Please link #76047 to this issue too.

Done.

tannergooding commented 11 months ago

@kunalspathak, @jkotas, and I got together to sync up on the Vector<T> vs ref struct ScalableVector<T> discussion above and we came to an agreed understanding around the concerns for supporting both regular SVE and streaming SVE, as well as other ISAs such as SME.

The conclusion we came to is that we recognize supporting all these modes are important and we recognize the potential conflict between what might make one easier to use vs making them all easier to use. We then believe we can sufficiently write an analyzer to help push users towards writing code that works well in AOT scenarios and by extension Streaming SVE scenarios for the JIT. It is then desirable to continue on the path of utilizing Vector<T> at this point in time and that such an analyzer is inline with existing planned analyzers such as https://github.com/dotnet/runtime/issues/82488. It also allows SVE instructions to mesh with the existing code patterns and planned features like ISimdVector<TSelf, T>.

However, we realize that there are still some unknowns in this area and more investigatory work and discussion needs to be done to ensure that this approach is sound. We plan on doing that investigatory work and revisiting this in a few months time (likely around March-April) at which point we should have a better understanding of how problematic providing that around Vector<T> will be. In the worst case, we will need to pivot and provide a ref struct ScalableVector<T>. Updating to use this type should be trivial as it is essentially a copy/paste and find/replace operation, with the necessary restrictions being provided by it being a ref struct rather than needing to catch them with an analyzer.

a74nh commented 11 months ago

API proposals:

FEAT_SVE:

FEAT_SVE2:

Other SVE Features:

[ ] https://github.com/dotnet/runtime/issues/94026 (8.2, mandatory if SVE is available)
[ ] https://github.com/dotnet/runtime/issues/94027 (8.2, optional. From 8.6 is mandatory if SVE is available)
[ ] https://github.com/dotnet/runtime/issues/94028 (8.2, optional. From 8.6 is mandatory if SVE is available)
[ ] https://github.com/dotnet/runtime/issues/94024 (8.2, optional. Not in any available hardware)
[ ] https://github.com/dotnet/runtime/issues/94025 (8.2, optional. Not in any available hardware)
[ ] https://github.com/dotnet/runtime/issues/94423 (9.0, optional)
[ ] https://github.com/dotnet/runtime/issues/94424 (9.0, optional)
[ ] https://github.com/dotnet/runtime/issues/94425 (9.0, optional)
[ ] https://github.com/dotnet/runtime/issues/94426 (9.0, optional)

total ACLE covered: 6040 total ACLE when expanded: 6212 total method signatures: 3288 total method T signatures: 1045 total methods signatures optional: 984 total methods signatures rejected: 96 total methods names: 627

tannergooding commented 11 months ago

Once the full list of proposals for SVE is up, I can schedule them for API review.

I can do the same for SVE2 and the other extensions when those issues are up.

a74nh commented 11 months ago

Opened issues for all the SVE APIs. I expect all the additional features to be lower priority.

tannergooding commented 11 months ago

I've gone through the FEAT_SVE and left some comments on places where names don't quite line up with AdvSimd or where we might benefit from examples or a slightly different name to provide additional clarity as to what the API is doing.

I haven't gone through FEAT_SVE2 yet but many of the same comments are likely to apply.

For the "other" features, they need to be in their own class (each) since they are unique feature bits at the hardware level.

BruceForstall commented 11 months ago

For the "other" features, they need to be in their own class (each) since they are unique feature bits at the hardware level.

There are a LOT of feature bits (https://developer.arm.com/downloads/-/exploration-tools/feature-names-for-a-profile). Which ones are interesting enough (or do we require) to expose to users separately, and which ones can be grouped and assumed to be always available in any hardware if the rest of the group is available?

tannergooding commented 11 months ago

Many of those represent kernel only functionality or features that we wouldn't expose hardware intrinsics for. Xarch similarly has hundreds of CPUID bits and we primarily only expose the ones oriented around hardware acceleration. -- We likely wouldn't expose intrinsics related to concepts like FEAT_ABLE: Address Breakpoint Linking Extension, for example.

We so far have not done any "implicit grouping", we maintain the 1-to-1 with the hardware feature bits we expose to ensure that other features, checks, and general functionality work as expected. I don't expect the number of ISAs to be overly problematic in practice given the types of instructions we typically want to expose

a74nh commented 10 months ago

Many of those represent kernel only functionality or features that we wouldn't expose hardware intrinsics for.

Linux has a HWCAP entry for every feature exposed by the kernel. I find this is the easiest way to figure out which features are useful. Of course, doing the mapping itself can still take a while.

These might be useful:

N2 features with HWCAPs

- FEAT_SHA1 HWCAP_SHA1 - FEAT_SHA256 HWCAP_SHA2 - FEAT_AES HWCAP_AES - FEAT_PMULL HWCAP_PMULL - FEAT_FP HWCAP_FP - FEAT_AdvSIMD HWCAP_ASIMD - FEAT_CRC32 HWCAP_CRC32 - FEAT_SB HWCAP_SB - FEAT_SSBS2 HWCAP_SSBS - FEAT_DGH HWCAP2_DGH - FEAT_LSE HWCAP_ATOMICS - FEAT_RDM HWCAP_ASIMDRDM - FEAT_DPB HWCAP_DCPOP - FEAT_FP16 "HWCAP_FPHP HWCAP_ASIMDHP" - FEAT_SVE HWCAP_SVE - FEAT_SHA512 HWCAP_SHA512 - FEAT_SHA3 HWCAP_SHA3 - FEAT_SM3 HWCAP_SM3 - FEAT_SM4 HWCAP_SM4 - FEAT_DotProd HWCAP_ASIMDDP - FEAT_FHM HWCAP_ASIMDFHM - FEAT_DPB2 HWCAP2_DCPODP - FEAT_BF16 "HWCAP2_BF16 HWCAP2_SVEBF16" - FEAT_I8MM HWCAP2_SVEI8MM - FEAT_PAuth "HWCAP_PACA HWCAP_PACG" - FEAT_JSCVT HWCAP_JSCVT - FEAT_LRCPC HWCAP_LRCPC - FEAT_FCMA HWCAP_FCMA - FEAT_DIT HWCAP_DIT - FEAT_FlagM HWCAP_FLAGM - FEAT_LSE2 HWCAP_USCAT - FEAT_LRCPC2 HWCAP_ILRCPC - FEAT_FlagM2 HWCAP2_FLAGM2 - FEAT_FRINTTS HWCAP2_FRINT - FEAT_BTI HWCAP2_BTI - FEAT_RNG HWCAP2_RNG - FEAT_MTE HWCAP2_MTE - FEAT_SVE2 HWCAP2_SVE2 - FEAT_SVE_AES HWCAP2_SVEAES - FEAT_SVE_PMULL128 HWCAP2_SVEPMULL - FEAT_SVE_SHA3 HWCAP2_SVESHA3 - FEAT_SVE_SM4 HWCAP2_SVESM4 - FEAT_SVE_BitPerm HWCAP2_SVEBITPERM

And:

N2 features with HWCAPs that are not in N1

- FEAT_SB HWCAP_SB - FEAT_DGH HWCAP2_DGH - FEAT_SVE HWCAP_SVE - FEAT_SHA512 HWCAP_SHA512 - FEAT_SHA3 HWCAP_SHA3 - FEAT_SM3 HWCAP_SM3 - FEAT_SM4 HWCAP_SM4 - FEAT_FHM HWCAP_ASIMDFHM - FEAT_DPB2 HWCAP2_DCPODP - FEAT_BF16 "HWCAP2_BF16 HWCAP2_SVEBF16" - FEAT_I8MM HWCAP2_SVEI8MM - FEAT_PAuth "HWCAP_PACA HWCAP_PACG" - FEAT_JSCVT HWCAP_JSCVT - FEAT_FCMA HWCAP_FCMA - FEAT_DIT HWCAP_DIT - FEAT_FlagM HWCAP_FLAGM - FEAT_LSE2 HWCAP_USCAT - FEAT_LRCPC2 HWCAP_ILRCPC - FEAT_FlagM2 HWCAP2_FLAGM2 - FEAT_FRINTTS HWCAP2_FRINT - FEAT_BTI HWCAP2_BTI - FEAT_RNG HWCAP2_RNG - FEAT_MTE HWCAP2_MTE - FEAT_SVE2 HWCAP2_SVE2 - FEAT_SVE_AES HWCAP2_SVEAES - FEAT_SVE_PMULL128 HWCAP2_SVEPMULL - FEAT_SVE_SHA3 HWCAP2_SVESHA3 - FEAT_SVE_SM4 HWCAP2_SVESM4 - FEAT_SVE_BitPerm HWCAP2_SVEBITPERM

tannergooding commented 10 months ago

Probably worth noting that a number of those aren't necessarily needed to be publicly exposed as platform specific intrinsics. Several of them are things that can be trivially exposed via public APIs. For example, FEAT_BF16 is better to just expose a BFloat16 type and to support via operators.

Rather, intrinsics are preferred for scenarios where they allow significant performance advantages and/or have platform unique behavior that make it problematic to expose in a cross platform manner.

a74nh commented 10 months ago

All the SVE1 APIs have been reviewed by @tannergooding and then updated, then been marked as ready-for-review (except for the mask category which still needs a few fixes). SVE2 and the other extensions have been updated in the same way, but haven't been reviewed.

I've added 4 more APIs I forget to raise issues for. These are all in 9.0, but they are very short. Added them to the list above.

Thought: This issue currently only has comments for SVE. Would it make sense to move .NET 9 Arm64 Performance work to a different issue and reduce the scope of this issue to just be the SVE work?

kunalspathak commented 10 months ago

Thought: This issue currently only has comments for SVE. Would it make sense to move .NET 9 Arm64 Performance work to a different issue and reduce the scope of this issue to just be the SVE work?

I was thinking the same and given that we need more granular work items for SVE, I will probably move the "overall" portion in a different issue.

kunalspathak commented 10 months ago

I was thinking the same and given that we need more granular work items for SVE, I will probably move the "overall" portion in a different issue.

94464

kunalspathak commented 7 months ago

I chatted with @a74nh today about how to expose Streaming APIs and SVE APIs that are streaming compatible and here are some raw notes that came out about exposing streaming behavior. @a74nh - please add if I missed anything.

Streaming instructions (and hence the .NET SME APIs) are executed when streaming mode is ON. In this mode, some of the SVE instructions do not work and hence "incompatible". We talked about how we should surface that information to the user level. We could force C# compiler to give compilation errors to user if they try to use non-streaming APIs in a method that is marked as "streaming" (option 1 below) or we can hide this abstraction and let JIT handle turning streaming on/off and save/restore streaming state (option 2 below) or we can have something in between (option 3).

Option 1. C# support

Add support in c# to give error explicitly to the user and possibly a new syntax of streaming similar to async:

Advantages:

JIT will automatically saving/restore state appropriately
hard for user to get it wrong and if they get wrong, we fail at compile time.

Disadvantages:

Hard to convience C# team to add this platform and hardware specific feature + there is significant work to implement it on roslyn side too. Assumption is with that, it will cover all the scenarios.
If JIT is automatically doing save/restore for each method, that could add perf penalty, because from the documentation: "so mid-operation mode switches should be avoided for performance reasons".

C++ compilers might end up in this option (except without a keyword).

Option 2. JIT support

Expose [Streaming] and [Streaming-Compatible] attribute and let JIT does all the work of saving, restoring streaming state. This essentially means that before every API call that is marked as [Streaming] or .NET SME intrinsic, JIT will do the following:

prev_streaming_status = get_streaming_status(); // JIT inserted
turn_streaming_on();                            // JIT inserted

call streaming_method();

set_streaming_status(prev_streaming_status);    // JIT inserted

However, this need to happen even on non-streaming methods i.e. all the other regular, neon methods.

prev_streaming_status = get_streaming_status(); // JIT inserted
turn_streaming_off();                           // JIT inserted

call non_streaming_method();

set_streaming_status(prev_streaming_status);    // JIT inserted

Advantages:

No dependency on roslyn team.
No burden on user to remember to do this explicitely in their code because JIT will handle it.
No runtime failures

Disadvantages:

No user control when the streaming can be turned on and off.
For all regular non-streaming, we will need to have save/restore streaming mode code (as seen above), a big problematic from performance stand of view.
Just to reiterate, all streaming and non-streaming methods will have code to save/restore and turn on/off streaming. "sve-compatible" marked APIs will be untouched.

Option 3. Libraries support

In addition to [Streaming], [Local_Streaming] and [Streaming-Compatible] method attributes, expose TurnOnStreaming, TurnOffStreaming, IsStreamingOn properties in SME class.

[Streaming] : The methods marked with this attribute will not save/restore or turn streaming ON. They are just a way to inform the caller that they should do the same before calling these methods. In "Debug" mode, JIT can add checks to make sure that streaming mode is ON in the prolog.
[Local_Streaming] : Methods marked with this attribute, JIT will save the previous streaming mode state, turn on streaming, and in the end restore the previous streaming state. This is similar to the first example in "JIT support" section above. This attribute will give some relaxation to the user of not having to worry about adding streaming save/restore before calling these methods. The JIT will do it for them. By definition, all statements in these methods will be streaming. If there are non-streaming statements, user will have to explicitely save/restore the streaming state in between which can impact performance. Hence, it will be advisable to only use it if there are "streaming statements" that are getting executed within the method.
[Streaming-Compatible] : Methods marked with this attribute will inform the callers that if they want, they can turn on streaming before calling these methods. However, inside a method, if there are non-streaming APIs or statements present, user should turn streaming on/off to make sure they do not get runtime failure.

Advantages:

Simple and less work needed.
No hidden perf overhead will be added by JIT because user will control when to turn streaming on and off.

Disadvantages:

It is at the discretion of user to make that they turn streaming on and off appropriately, else can result in runtime failure. We can mitigate it to some extent by adding streaming mode check when compiling in "Debug" as well as add a mode in JitStress that will generate such checks.

Testing:

In "Debug" mode, add if "streaming_mode == ON" in prolog of methods marked with [Streaming]
In "Debug" mode, save the streaming mode state before calling [Local_Streaming] methods and after the call, add a check if the streaming mode state is restored.

Usage

[Local_Streaming]
MyFunc1()
{
  A(); // streaming
  B(); // streaming
}

[Streaming]
MyFunc2()
{
  D(); // streaming

  previous_state = SME.IsStreamingOn();
  SME.TurnOnStreaming();
  E(); // non-streaming
  SME.SetStreamingMode(previous_state);

  F(); // streaming
}

 // Since "Local_Streaming", no need to save/restore
MyFunc1();

// Since "Streaming", save and restore
previous_state = SME.IsStreamingOn();
SME.TurnOnStreaming();

MyFunc2()

SME.SetStreamingMode(previous_state);

References

TODO

Come up with list of SVE/NEON compatible instructions/API

jkotas commented 7 months ago

The streaming mode changes the size of Vector<T>. If we are going to reuse the same instruction API definitions for both non-streaming and streaming mode and keep using Vector<T> as the type for their arguments, it puts other severe restrictions on what streaming-compatible methods can do. For example, streaming-compatible methods cannot access Vector<T> static field since they will have the non-streaming size. It is not something that the JIT can handle transparently. IIRC, an attribute to mark the streaming compatible methods together with a Roslyn analyzer to enforce all rules for streaming-compatible methods is along the lines of a solution that we have discussed during our earlier chat about this (https://github.com/dotnet/runtime/issues/93095#issuecomment-1769501335)

TurnOnStreaming

I think that the best shape for the API to turn the streaming on/off would be a method that takes a delegate that should be executed in streaming mode. The analyzer can enforce that the target method is streaming compatible among other things.

tannergooding commented 7 months ago

For example, streaming-compatible methods cannot access Vector static field since they will have the non-streaming size

It's worth explicitly stating that the same general premise applies whether we use Vector<T> or some new type SveVector<T> (even if a ref struct). Due to the size change, there are many different things that potentially become broken when you consider fields, locals, parameters, references, or indirections. All of these can subtly break between a method that assumes no streaming and a method which enables streaming support.

-- I just want to reiterate this, since a separate type will only solve a small subset of the overall considerations, namely what happens when you encounter the underlying vector type on the GC heap. So it isn't necessarily viable for us to use a separate type or duplicate the API surface between Streaming and Non-streaming modes.

IIRC, an attribute to mark the streaming compatible methods together with a Roslyn analyzer to enforce all rules for streaming-compatible methods is along the lines of a solution that we have discussed during our earlier chat about this

I expect we need a little bit of a balance here given the two related, but somewhat differing considerations. We really have both user-defined methods being streaming compatible and intrinsics being streaming compatible.

For intrinsics themselves, most SVE instructions are streaming compatible and when FEAT_SME_FA64 is available, it changes to all instructions. My expectation is that we basically want to have Sve and something like Sve.NestedClass. The nested class would contain the instructions which may be streaming incompatible and which require a separate check to use. This then allows an analyzer to help prove correctness.

For user-defined methods, I agree that we need some way to attribute them as compatible. This one is a little more difficult as what is compatible or incompatible depends on a few factors, including what the Effective ZA tile width is compared to the Effective SVE vector length. A very large portion of methods won't use SVE at all and would be compatible as well.

Given the limitations and the potential for unsafety, I almost want to say a viable way to handle this would be to introduce a new unmanaged calling convention. This is ultimately along the lines of having a method that takes a delegate, but my thought is that it might integrate more cleanly with existing tooling and put users into the general mindset that normal managed operations may not be available/appropriate. -- That is, we'd have [UnmanagedCallersOnly([typeof(CallConvArm64Sme)])]. This gives users a clear boundary, allows us to easily block the use of certain types/parameters, and gives us the ability to more easily insert any pre/post call cleanup logic to ensure what is effectively a changed ABI remains correct. It will also avoid complexities around inlining or future delegate optimizations that might otherwise be possible, etc.

jkotas commented 7 months ago

a separate type will only solve a small subset of the overall considerations, namely what happens when you encounter the underlying vector type on the GC heap

Nit: byref-like types cannot show up on the GC heap. byref-like type would address this particular concern by construction.

I agree that byref-like type would not address all streaming safety issues by construction.

A very large portion of methods won't use SVE at all and would be compatible as well.

We do not know what this set. I expect that BCL is going be streaming incompatible by default. It would be prohibitively expensive to audit the whole BCL and annotate it as streaming compatible/incompatible. We may choose to audit and annotate small part of BCL as streaming compatible and this set can grow over time.

introduce a new unmanaged calling convention

I do not see the benefit of unmanaged calling convention compared to an attribute. An analyzer to enforce the streaming compatible rules is the main gatekeeper. An attribute is a more natural way to mark methods that the analyzer should enforce the rules for.

tannergooding commented 7 months ago

Nit: byref-like types cannot show up on the GC heap. byref-like type would address this particular concern by construction.

Right, sorry. That's what I meant.

That is, I meant ref struct SveVector<T> fixes the concern around static fields and such fields existing on the GC heap. It doesn't fix the remaining cases such as a user passing ref SveVector<T> or SveVector<T>* or ref struct S { SveVector<T> x; SveVector<T> y; } or parameters or locals leaking across an enablement boundary or ....

We do not know what this set. I expect that BCL is going be streaming incompatible by default. It would be prohibitively expensive to audit the whole BCL and annotate it as streaming compatible/incompatible. We may choose to audit and annotate small part of BCL as streaming compatible and this set can grow over time.

👍. My only real concern is about the eventual scope creep of "sve compatible" and that it may explode to a much larger set of annotations. In practice the only code that should be "incompatible" is if they're using some of the SVE instructions that are incompatible without FEAT_SME_FA64. Especially in the BCL, users shouldn't be having Vector as a static field or in an array, span, etc; because that's significantly less efficient and less portable than operating over a Span<T> directly. Most of the things that would cause you to be SME incompatible are anti-patterns for code that is meant to be used in high perf scenarios, and so its things we already strongly recommend against doing.

I do not see the benefit of unmanaged calling convention compared to an attribute. An analyzer to enforce the streaming compatible rules is the main gatekeeper. An attribute is a more natural way to mark methods that the analyzer should enforce the rules for.

Using just an attribute works, but it doesn't provide a clear boundary for users and will come with additional implications that the VM and JIT need to handle. For example, they may have to special case what can happen with inlining or optimizations across the boundary. They may also require additional tracking to know that we're in a method context that has SVE enabled. If a method is SVE compatible and can be used from both SVE and non-SVE code, then we may need to "specialize" for each to ensure good/efficient codegen. I'm concerned that this additional handling may not be "pay for play" in the JIT.

However, for UnmanagedCallersOnly we already have a lot of this special tracking integrated and it provides a very clear boundary under which power users are already familiar that special limitations may exists. Since enabling streaming mode is in effect dynamically changing the ABI/calling convention, this also makes sense from a logical perspective since that's how users do the same for calling into native.

It's really much the same thing, just trying to play off the existing support and functionality the JIT has.

jkotas commented 7 months ago

However, for UnmanagedCallersOnly we already have a lot of this special tracking integrated and it provides a very clear boundary under which power users are already familiar that special limitations may exists

I am not sure what kind of special tracking for UnmanagedCallersOnly you have in mind. The main parts of UnmanagedCallersOnly are:

UnmanagedCallersOnly require all types in the method signatures to be blittable. It would be unnecessary limitation for streaming methods. It is perfectly fine for streaming compatible methods to take object references as argument.
UnmanagedCallersOnly switches GC mode. It would be unnecessary overhead for streaming compatible methods.

The JIT is not able to inline through unmanaged calli + UnmanagedCallersOnly combo today. I do not think we have it written down anywhere that such optimization is prohibited.

For example, they may have to special case what can happen with inlining or optimizations across the boundary. They may also require additional tracking to know that we're in a method context that has SVE enabled. If a method is SVE compatible and can be used from both SVE and non-SVE code, then we may need to "specialize" for each to ensure good/efficient codegen. I'm concerned that this additional handling may not be "pay for play" in the JIT.

There is a large spectrum of how the streaming support can look: from simple explicit approach that enables it and requires users to do more work, all the way to streaming-aware auto-vectorization on the other end.

I think it would be reasonable to start with simple explicit approach. Something like:

Code is streaming-incompatible by default. I believe that this matches the approach taken for C/C++ for streaming. Is that correct?
Methods can be annotated as streaming compatible using a new attribute. Streaming compatible methods have to avoid patterns incompatible with streaming (no access of Vector fields, etc.) and have to explicitly switch to non-streaming mode around any calls to streaming incompatible code. The rules are enforced by Roslyn analyzer. There is no enforcement of the rules by the runtime or JIT.
Switches between streaming and non-streaming modes are inlining boundaries. The JIT does not need to worry about encountering both streaming compatible and streaming incompatible code in the same method.

kunalspathak commented 7 months ago

I think it would be reasonable to start with simple explicit approach

That's what I had in mind in option 3 above. The only reason I introduced Local_Streaming is that will give a choice to user to not having to turn on/off/save/restore streaming state, but JIT will do it for them.

jkotas commented 7 months ago

The only reason I introduced Local_Streaming is that will give a choice to user to not having to turn on/off/save/restore streaming state

If we were to do automatically inserted mode switches, I think we would want to do them in both directions: turn on streaming around streaming required method calls and turn off streaming around streaming incompatible method calls. It gets more complicated with optimizations - for example, if there are two calls in a row or in a loop, should the JIT be smart enough to turn the streaming on/off just once for the two calls or around the whole loop? I would wait to see the evidence that the automatic mode switches are really needed. Starting with the fully explicit scheme should not prevent us from introducing automatically inserted mode switches later.

a74nh commented 6 months ago

For intrinsics themselves, most SVE instructions are streaming compatible and when FEAT_SME_FA64 is available, it changes to all instructions. My expectation is that we basically want to have Sve and something like Sve.NestedClass. The nested class would contain the instructions which may be streaming incompatible and which require a separate check to use. This then allows an analyzer to help prove correctness.

I don't think this ideal: 1) It's confusing for a user. They now have to be aware of the implementation of SME to understand why a particular SVE instruction is in a different subclass 2) Most, but not all NEON instructions are disabled when FEAT_SME_FA64 is enabled. We need to consider what to do here. The remaining ones are mostly scalar instructions, and I don't think we gain much by keeping them. We certainly don't want to adding subclasses. 3) The instruction disabling is gated on the feature FEAT_SME_FA64. There is nothing in architecture that prevents another extension from enabling or disabling a different set of instructions. Whereas subclassing would be hard to change later.

My suggestion would be to use a method attribute which contains the name of the feature that disables the method. I hope that's something that's not too tricky to retrospectively add to an existing API, as for most use cases it's not changing the API. That would enable us to continue the SVE design, and then retrospectively add the attributes to SVE and AdvSimd.

tannergooding commented 6 months ago

Most, but not all NEON instructions are disabled when FEAT_SME_FA64 is enabled

Do you have this inversed? Based on the architecture manual it looks like NEON is disabled in streaming mode (much like many SVE instructions) unless FEAT_SME_FA64 is enabled.

The manual states that FEAT_SME_FA64 means the full A64 ISA is available in streaming mode and that would logically include NEON.

They now have to be aware of the implementation of SME to understand why a particular SVE instruction is in a different subclass

Users must be aware when an instruction might be supported or not. Any temporary state changes, like SME, would have to be accounted for as part of context switching to ensure other threads don't suddenly fail.

Using attributes + an analyzer to determine if a given intrinsic API is supported is just a different approach to the same problem, but is inconsistent with how we've handled it so far. It also forces reliance on an analyzer for correctness, rather than a more implicit understanding that if Isa.IsSupported returns true, you can use the APIs and they will otherwise throw.

Based on what the manual covers, we have the following separate checks:

AdvSIMD
- Documented via CheckFPAdvSIMDEnabled64()
- This checks CheckFPEnabled() and requires !HAVESME() || (PSTATE.SM != 1) || IsFullA64Enabled())
SVE
- Documented via CheckSVEEnabled()
SVE - Non Streaming
- Documented via CheckNonStreamingSVEEnabled()
- This checks CheckSVEEnabled() and requires !HaveSME() || (PSTATE.SM != 1) || IsFullA64Enabled())
SME
- Documented via CheckSMEEnabled()

Thus, it logically tracks that having 4 classes here would also work, each one exactly corresponding to a documented operation check around the support:

AdvSimd
Sve
SveNonStreaming : Sve
- This could be nested Sve.NonStreaming : AdvSimd
Sme

This ensures we have a very clean and clear separation of support that can be constant in most contexts. The only thing that really changes is in an SME enabled context and based on whether FEAT_SME_FA64 is enabled.

AdvSimd.IsSupported
- returns true by default
- returns false in an SME enabled context, unless FEAT_SME_FA64 is enabled.
Sve
- returns true if FEAT_SVE exists
SveNonStreaming (or Sve.NonStreaming)
- returns the same value as Sve by default
- returns false in an SME enabled context, unless FEAT_SME_FA64 is enabled
Sme
- returns true in an SME enabled context

Going into an SME enabled context requires FEAT_SME and would be dependent on the new attribute approach that Jan laid out above. This attribute is largely only relevant to the JIT and only needs minimal support to ensure that users don't call methods that aren't marked as streaming compatible. The intrinsic APIs would be considered compatible by default and rely on the IsSupported check to determine if it was safe to call or not.

a74nh commented 6 months ago

Most, but not all NEON instructions are disabled when FEAT_SME_FA64 is enabled

Do you have this inversed? Based on the architecture manual it looks like NEON is disabled in streaming mode (much like many SVE instructions) unless FEAT_SME_FA64 is enabled.

The manual states that FEAT_SME_FA64 means the full A64 ISA is available in streaming mode and that would logically include NEON.

Yes, you're right I had this inverted.

AdvSimd.IsSupported

returns true by default

returns false in an SME enabled context, unless FEAT_SME_FA64 is enabled.

Sve

In a world without FEAT_SME_FA64, some AdvSimd instructions are still available:

Quoting DDI0616B_a_SME_Supplement.pdf:

For the avoidance of doubt, A64 scalar floating-point instructions which match following encoding patterns remain legal when the PE is in Streaming SVE mode: A64 Encoding Pattern Instructions or Instruction Class x001 111x xxxx xxxx xxxx xxxx xxxx xxxx Scalar floating-point operations xx10 110x xxxx xxxx xxxx xxxx xxxx xxxx Load/store pair of FP registers xx01 1100 xxxx xxxx xxxx xxxx xxxx xxxx Load FP register (PC-relative literal) xx11 1100 xx0x xxxx xxxx xxxx xxxx xxxx Load/store FP register (unscaled imm) xx11 1100 xx1x xxxx xxxx xxxx xxxx xx10 Load/store FP register (register offset) xx11 1101 xxxx xxxx xxxx xxxx xxxx xxxx Load/store FP register (scaled imm) With the exception of the following floating-point operation which is illegal when the PE is in Streaming SVE mode: A64 Encoding Pattern Instructions or Instruction Class 0001 1110 0111 1110 0000 00xx xxxx xxxx FJCVTZS

You plan on not making these available in C# when in streaming mode?

kunalspathak commented 6 months ago

I will setup a meeting with @tannergooding @a74nh @TamarChristinaArm to discuss this.

EwoutH commented 5 months ago

Thanks for working on this. Very excited about future SVE2 support!

[ ] SVE hardware availability in CI

First step might be having GitHub hosted arm64 runners and images at all:

While useful for many developers, it might help if a request came from an internal Microsoft team.

As for SVE and SVE2, there a CPU with Neoverse V1 (SVE) or Neoverse N2, V2, N3 or V3 cores is needed. Since Ampere doesn't have processors released with these cores, the Microsoft Azure Cobalt 100 CPU looks like the best options, which uses N2 cores.

It seems all pieces of the puzzle are already in-house at Microsoft!

JulieLeeMSFT commented 1 month ago

Closing as completed for .NET 9 effort. We will create a new user story on Arm64 performance for .NET 10 and move some of the remaining items there.