[API Proposal]: Example usages of a VectorSVE API

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

14.66k stars 4.57k forks source link

[API Proposal]: Example usages of a VectorSVE API #88140

Open a74nh opened 1 year ago

a74nh commented 1 year ago

Background and motivation

Adding a vector API for Arm SVE/SVE2 would be useful. SVE is a mandatory feature in Arm 9.0 onwards and is an alternative to NEON. Code written in SVE is vector length agnostic and will automatically scale to the vector length of the machine it is running on, and therefore will only require a single implementation per routine. Use of predication in SVE enables loop heads and tails to be skipped, making code shorter, simpler and easier to write.

This issue provides examples of how such an API might be used.

API Proposal

None provided.

API Usage


  /*
    Sum all the values in an int array.
  */
  public static unsafe int sum_sve(ref int* srcBytes, int length)
  {
    VectorSVE<int> total = Sve.Create((int)0);
    int* src = srcBytes;
    VectorSVEPred pred = Sve.WhileLessThan(i, length);

    /*
      WhileLessThan comes in two variants:
        VectorSVEPred WhileLessThan(int val, int limit)
        VectorSVEComparison WhileLessThan(VectorSVEPred out predicate, int val, int limit)

      A VectorSVEComparison can be tested using the SVE condition codes (none, any, last, nlast etc).
      `if (cmp.nlast) ....`
      `if (Sve.WhileLessThan(out pred, i, length).first) ....`

      `if (cmp)` is the same as doing `if (cmp.any)`

      Ideally the following will not be allowed:
        auto f = Sve.WhileLessThan(out pred, i, length).first
    */

    /*
      Always using a function call for the vector length instead of assigning to a variable will allow
      easier optimisation to INCW (which is faster than incrementing by a variable).
    */

    for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
    {
      VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);

      /*
        This is the standard sve `add` instruction which uses a merge predicate.
        For each lane in the predicate, add the two vectors. For all other lanes use the first vector.
       */
      total = Sve.MergeAdd(pred, total, vec);
    }

    // No tail call required.
    return Sve.AddAcross(total).ToScalar();
  }

  /*
    Sum all the values in an int array, without predication.
    For performance reasons, it may be better to use an unpredicated loop, followed by a tail.
    Ideally, the user would write the predicated version and the Jit would optimise to this if required.
  */
  public static unsafe int sum_sve_unpredicated_loop(ref int* srcBytes, int length)
  {
    VectorSVE<int> total = Sve.Create((int)0);
    int* src = srcBytes;

    int i = 0;
    for (i = 0; i+Sve.VectorLength<int>() <= length; i+= Sve.VectorLength<int>() )
    {
      VectorSVE<int> vec = Sve.LoadUnsafe(ref *src, i);
      total = Sve.Add(total, vec);
    }

    // Predicated tail.
    VectorSVEPred pred = Sve.WhileLessThan(i, length);
    VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
    total = Sve.MergeAdd(pred, vec, total);

    return Sve.AddAcross(total).ToScalar();
  }

  /*
    Count all the non zero elements in an int array.
  */
  public static unsafe int CountNonZero_sve(ref int* srcBytes, int length)
  {
    VectorSVE<int> total = Sve.Create((int)0);
    int* src = srcBytes;
    VectorSVEPred pred = Sve.WhileLessThan(i, length);
    VectorSVEPred true_pred = Sve.CreatePred(true);

    for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
    {
      VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
      VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);

      total = Sve.MergeAdd(cmp_res, total, vec);
    }

    // No tail call required.
    return Sve.AddAcross(total).ToScalar();
  }

  /*
    Count all the non zero elements in an int array, without predication.
  */
  public static unsafe int CountNonZero_sve_unpredicated_loop(ref int* srcBytes, int length)
  {
    VectorSVE<int> total = Sve.Create((int)0);
    int* src = srcBytes;
    VectorSVEPred pred = Sve.WhileLessThan(i, length);
    VectorSVEPred true_pred = Sve.CreatePred(true);

    // Comparisons require predicates. Therefore for a truely non predicated version, use Neon.
    int vector_length = 16/sizeof(int);
    for (int i = 0; i+vector_length <= length; i+=vector_length)
    {
      Vector128<int> vec = AdvSimd.LoadVector128(src);
      Vector128<int> gt = AdvSimd.CompareGreaterThan(vec, zero);
      Vector128<int> bits = AdvSimd.And(gt, one);

      total = AdvSimd.Add(bits, total);
      src += vector_length;
    }

    // Predicated tail.
    VectorSVEPred pred = Sve.WhileLessThan(i, length);
    VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src);
    VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);
    total = Sve.MergeAdd(cmp_res, total, vec);

    return Sve.AddAcross(total).ToScalar();
  }

  /*
    Count all the elements in a null terminated array of unknown size.
  */
  public static unsafe int CountLength_sve(ref int* srcBytes)
  {
    int* src = srcBytes;
    VectorSVEPred pred = Sve.CreatePred(true);
    int ret = 0;

    while (true)
    {
      VectorSVE<int> vec = Sve.LoadUnsafeUntilFault(pred, ref *src); // LD1FF

      /*
        Reading the fault predicate via RDFFRS will also set the condition flags:
          VectorSVEComparison GetFaultPredicate(VectorSVEPred out fault, VectorSVEPred pred)
       */
      VectorSVEPred fault_pred;

      if (Sve.GetFaultPredicate(out fault_pred, pred).last)
      {
        // Last element is set in fault_pred, therefore the load did not fault.

        /*
          Like WhileLessThan, comparisons come in two variants:
            VectorSVEPred CompareEquals(VectorSVEPred pred, VectorSVE a, VectorSVE b)
            VectorSVEComparison CompareEquals(VectorSVEPred out cmp_result, VectorSVEPred pred, VectorSVE a, VectorSVE b)
         */

        // Look for any zeros across the entire vector.
        VectorSVEPred cmp_zero;
        if (Sve.CompareEquals(out cmp_zero, pred, vec, 0).none)
        {
          // No zeroes found. Continue loop.
          ret += Sve.VectorLength<int>();
        }
        else
        {
          // Zero found. Count up to it and return.
          VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
          ret += Sve.PredCountTrue(matches); // INCP
          return ret;
        }
      }
      else
      {
        // Load caused a fault.

        // Look for any zeros across the vector up until the fault.
        VectorSVEPred cmp_zero;
        if (Sve.CompareEquals(out cmp_zero, fault_pred, vec, 0).none)
        {
          // No zeroes found. Clear faulting predicate and continue loop.
          ret += Sve.PredCountTrue(fault_pred); // INCP
          Sve.ClearFaultPredicate(); // SETFFR
        }
        else
        {
          // Zero found. Count up to it and return.
          VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
          ret += Sve.PredCountTrue(matches); // INCP
          return ret;
        }
      }
    }
  }

Alternative Designs

No response

Risks

References

SVE Programming Examples A64 -- SVE Instructions (alphabetic order)

No response

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation Adding a vector API for Arm SVE/SVE2 would be useful. SVE is a mandatory feature in Arm 9.0 onwards and is an alternative to NEON. Code written in SVE is vector length agnostic and will automatically scale to the vector length of the machine it is running on, and therefore will only require a single implementation per routine. Use of predication in SVE enables loop heads and tails to be skipped, making code shorter, simpler and easier to write. This issue provides examples of how such an API might be used. ### API Proposal None provided. ### API Usage ```csharp /* Sum all the values in an int array. */ public static unsafe int sum_sve(ref int* srcBytes, int length) { VectorSVE total = Sve.Create((int)0); int* src = srcBytes; VectorSVEPred pred = Sve.WhileLessThan(i, length); /* WhileLessThan comes in two variants: VectorSVEPred WhileLessThan(int val, int limit) VectorSVEComparison WhileLessThan(VectorSVEPred out predicate, int val, int limit) A VectorSVEComparison can be tested using the SVE condition codes (none, any, last, nlast etc). `if (cmp.nlast) ....` `if (Sve.WhileLessThan(out pred, i, length).first) ....` `if (cmp)` is the same as doing `if (cmp.any)` Ideally the following will not be allowed: auto f = Sve.WhileLessThan(out pred, i, length).first */ /* Always using a function call for the vector length instead of assigning to a variable will allow easier optimisation to INCW (which is faster than incrementing by a variable). */ for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength()) { VectorSVE vec = Sve.LoadUnsafe(pred, ref *src, i); /* This is the standard sve `add` instruction which uses a merge predicate. For each lane in the predicate, add the two vectors. For all other lanes use the first vector. */ total = Sve.MergeAdd(pred, total, vec); } // No tail call required. return Sve.AddAcross(total).ToScalar(); } /* Sum all the values in an int array, without predication. For performance reasons, it may be better to use an unpredicated loop, followed by a tail. Ideally, the user would write the predicated version and the Jit would optimise to this if required. */ public static unsafe int sum_sve_unpredicated_loop(ref int* srcBytes, int length) { VectorSVE total = Sve.Create((int)0); int* src = srcBytes; int i = 0; for (i = 0; i+Sve.VectorLength() <= length; i+= Sve.VectorLength() ) { VectorSVE vec = Sve.LoadUnsafe(ref *src, i); total = Sve.MergeAdd(pred, total, vec); } // Predicated tail. VectorSVEPred pred = Sve.WhileLessThan(i, length); VectorSVE vec = Sve.LoadUnsafe(pred, ref *src, i); total = Sve.MergeAdd(pred, vec, total); return Sve.AddAcross(total).ToScalar(); } /* Count all the non zero elements in an int array. */ public static unsafe int CountNonZero_sve(ref int* srcBytes, int length) { VectorSVE total = Sve.Create((int)0); int* src = srcBytes; VectorSVEPred pred = Sve.WhileLessThan(i, length); VectorSVEPred true_pred = Sve.CreatePred(true); for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength()) { VectorSVE vec = Sve.LoadUnsafe(pred, ref *src, i); VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0); total = Sve.MergeAdd(cmp_res, total, vec); } // No tail call required. return Sve.AddAcross(total).ToScalar(); } /* Count all the non zero elements in an int array, without predication. */ public static unsafe int CountNonZero_sve_unpredicated_loop(ref int* srcBytes, int length) { VectorSVE total = Sve.Create((int)0); int* src = srcBytes; VectorSVEPred pred = Sve.WhileLessThan(i, length); VectorSVEPred true_pred = Sve.CreatePred(true); // Comparisons require predicates. Therefore for a truely non predicated version, use Neon. int vector_length = 16/sizeof(int); for (int i = 0; i+vector_length <= length; i+=vector_length) { Vector128 vec = AdvSimd.LoadVector128(src); Vector128 gt = AdvSimd.CompareGreaterThan(vec, zero); Vector128 bits = AdvSimd.And(gt, one); total = AdvSimd.Add(bits, total); src += vector_length; } // Predicated tail. VectorSVEPred pred = Sve.WhileLessThan(i, length); VectorSVE vec = Sve.LoadUnsafe(pred, ref *src); VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0); total = Sve.MergeAdd(cmp_res, total, vec); return Sve.AddAcross(total).ToScalar(); } /* Count all the elements in a null terminated array of unknown size. */ public static unsafe int CountLength_sve(ref int* srcBytes) { int* src = srcBytes; VectorSVEPred pred = Sve.CreatePred(true); int ret = 0; while (true) { VectorSVE vec = Sve.LoadUnsafeUntilFault(pred, ref *src); // LD1FF /* Reading the fault predicate via RDFFRS will also set the condition flags: VectorSVEComparison GetFaultPredicate(VectorSVEPred out fault, VectorSVEPred pred) */ VectorSVEPred fault_pred; if (Sve.GetFaultPredicate(out fault_pred, pred).last) { // Last element is set in fault_pred, therefore the load did not fault. /* Like WhileLessThan, comparisons come in two variants: VectorSVEPred CompareEquals(VectorSVEPred pred, VectorSVE a, VectorSVE b) VectorSVEComparison CompareEquals(VectorSVEPred out cmp_result, VectorSVEPred pred, VectorSVE a, VectorSVE b) */ // Look for any zeros across the entire vector. VectorSVEPred cmp_zero; if (Sve.CompareEquals(out cmp_zero, pred, vec, 0).none) { // No zeroes found. Continue loop. ret += Sve.VectorLength(); } else { // Zero found. Count up to it and return. VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB ret += Sve.PredCountTrue(matches); // INCP return ret; } } else { // Load caused a fault. // Look for any zeros across the vector up until the fault. VectorSVEPred cmp_zero; if (Sve.CompareEquals(out cmp_zero, fault_pred, vec, 0).none) { // No zeroes found. Clear faulting predicate and continue loop. ret += Sve.PredCountTrue(fault_pred); // INCP Sve.ClearFaultPredicate(); // SETFFR } else { // Zero found. Count up to it and return. VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB ret += Sve.PredCountTrue(matches); // INCP return ret; } } } } ``` ### Alternative Designs _No response_ ### Risks _No response_

Author:	a74nh
Assignees:	-
Labels:	`api-suggestion`, `area-CodeGen-coreclr`
Milestone:	-

a74nh commented 1 year ago

@kunalspathak @tannergooding @BruceForstall

a74nh commented 1 year ago

@TamarChristinaArm

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

Author:	a74nh
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

kunalspathak commented 1 year ago

cc: @JulieLeeMSFT

tannergooding commented 1 year ago

Thanks for opening the issue! This is definitely a space we want to improve, just wanting to lay out some baseline information and expectations for this.

It's worth noting that this is not actionable as an API proposal in its current setup. Any API proposal needs to follow the API proposal template including defining the full public surface area to be added.

It's also worth noting any work towards SVE/SVE2 is going to require:

Hardware we can reliably use and test in CI
Significant validation and design work around the exposed vector type, API surface, and usable surface area
Additional work to ensure things work correctly for AOT scenarios

For 1, there are only some mobile chips (Samsung/Qualcomm) and the AWS Graviton3 that currently support SVE or Armv9 at all. The latter is the only one that would be viable for CI and it would likely require some non-trivial effort to make happen. It may still be some time and require higher market saturation of SVE/SVE2 before work/progress can be made (the same was true for AVX-512). Having more easily accessible hardware that supports the ISAs (both for CI and local development) can accelerate this considerably.

For 2, SVE is a very different programming model from the existing fixed sized vectors exposed by System.Runtime.Intrinsics. It is closer to, but not entirely aligned with, System.Numerics.Vector<T> instead. We had a similar situation with the masking support for AVX-512 and we opted to not expose it for .NET 8. Instead we opted to do pattern recognition and implicit lightup for the existing patterns users already have to use downlevel.

It is entirely possible that SVE will be in a similar boat and in order for the BCL and general user-base to best take advantage of it, we will need to consider a balance between the general programming model most user-code needs for other platforms and what exists in hardware directly.

That is, the simplest way to get SVE support available, without requiring all users to go rewrite their code and add completely separate code paths just for the subset of hardware with SVE support would be to have SVE light-up over Vector<T>. Many of the same predicate/masking patterns will already be getting recognized for AVX-512 and simply involve tracking when a given node returns a mask (such as comparisons) or takes a mask (such as ConditionalSelect)

For 3, native SVE has many limitations around where the SVE vector types can be declared and how they are allowed to be used. Getting the same restrictions in managed would be very complex and not necessarily "pay for play". For the JIT, this is fine since things can be determined at runtime and match exactly. For AOT, this represents a problem since the sizes and other information isn't known until runtime which can complicate a number of scenarios.

TamarChristinaArm commented 1 year ago

@tannergooding Thanks for the initial feedback, just some quick responses:

For 1, there are only some mobile chips (Samsung/Qualcomm) and the AWS Graviton3 that currently support SVE or Armv9 at all. The latter is the only one that would be viable for CI and it would likely require some non-trivial effort to make happen.

There are actually some more that are available. If Linux would be an acceptable target I can send you an email with other options than the ones you have stated here.

It is entirely possible that SVE will be in a similar boat and in order for the BCL and general user-base to best take advantage of it, we will need to consider a balance between the general programming model most user-code needs for other platforms and what exists in hardware directly

Agreed 100%, though the purpose of this ticket is to start the discussions on the SVE intrinsics that would form the basis of Vector<T>. Our expectations is that Vector<T> will provide "good enough" support, but for people who want to use the full capabilities of SVE they would require direct usage of the intrinsics. This pull request was more to highlight some of the difficulties in archiving this and start the discussions around this. In some of the examples for instance we require explicit access to the CC flags which affects which branch gets emitted.

Before doing a full SVE design and proposal, it's better for us to get some input from you folks on how you'd like these kinds of situations to be handled. From our last discussion we highlighted that SVE would need some additional JIT design work for workloads that Vector won't ever cover, such as SME. Therefore I think it's best to focus this issue on how we can even provide core Scalable vector support for current and future Arm architectures rather than how to support them in the generic wrappers. I believe one is the pre-requisite to the other?

For 3, native SVE has many limitations around where the SVE vector types can be declared and how they are allowed to be used. Getting the same restrictions in managed would be very complex and not necessarily "pay for play". For the JIT, this is fine since things can be determined at runtime and match exactly. For AOT, this represents a problem since the sizes and other information isn't known until runtime which can complicate a number of scenarios.

Agreed! one of the questions we had raised last time was what are the constraints on which AOT operates. If AOT is equivalent to a native compiler's -mcpu=native than this won't be an issue as you'd know the vector length you're targetting.

If AOT is supposed to be platform agnostic, than the solution could be the one we discussed 2 years ago, where instead of assuming VLA, we assume a minimum vector length that the user selects. i.e. if they want full compatibility we select VL128 and just simply predicate all SVE operations to this.

This fixes the AOT issues since we'd never use more than VL sized vectors.

But really, to move forward with SVE support we need your input. SVE is much more than just a vector length play, and without support for it .NET the ability to improve will be quite limited. SVE will, and does have capabilities that Advanced SIMD will never have.

tannergooding commented 1 year ago

There are actually some more that are available. If Linux would be an acceptable target I can send you an email with other options than the ones you have stated here.

Linux is great!. We just really need to ensure there are options we can use for both development and validation. This namely means they can't be exclusively mobile devices (i.e. phones). Having a list of viable devices would be good, all the sources I've found for SVE or Armv9 have shown to be phones, android devices, the latest version of Gravitron, or not yet released for general use.

I believe one is the pre-requisite to the other?

It depends on the general approach taken. We could very likely expose some SVE/SVE2 support via Vector<T> or even Vector64/128/256/512<T> without exposing the raw intrinsic APIs.

This model is actually how AVX-512 was designed to be exposed for .NET 8. That is, we planned to ship the general purpose Vector512<T> and then implicit light-up for existing patterns over Vector128/256<T>. The support for VectorMask<T> and the platform specific APIs were an additional stretch goal. We ended up having enough time to do the platform specific APIs and to start work on VectorMask<T>, the latter of which we ultimately determined was not going to be a good fit and was cut for the time being.

We went with this approach because it allowed easier access to the acceleration for a broader audience, worked well when considering all platforms (not just x86/x64), allowed existing code to improve without explicit user action, and allowed developers to iteratively extend their existing algorithms without having to add massive amounts of new platform specific code.

If AOT is equivalent to a native compiler's -mcpu=native than this won't be an issue as you'd know the vector length you're targetting.

The default for most AOT would be platform agnostic. For example, x64 AOT still targets the 2004 baseline and has opportunistic light-up (that is allows for a cached runtime based check to access SSE3-SSE4.2). Arm64 AOT still targets the v8.0 baseline and likewise has opportunistic light-up for some scenarios.

predicate all SVE operations to this.

Predicated instructions has an increased cost over non-predicated, correct?

In particular, we have at least 3 main platforms to consider support around:

Arm64
WASM
x64

There are also other platforms we support (Arm32, x86), platforms that are being brought online by the community (LoongArch, RISC-V), etc.

Most of these platforms have some amount of SIMD support and giving access to the platform specific APIs gives the greatest amount of power in your algorithms. Exposing this support long term is generally desirable.

However, an algorithm supporting every one of these platforms is also very expensive and most of the time the algorithms are fairly similar to each other regardless of the platform being targeted and there's often only minor differences between them. This is why the cross platform APIs were introduced in .NET 7, as it allowed us to take what was previously a minimum of 3 (Arm64, Fallback, x64) but sometimes more (different vector sizes, per ISA lightup, other platforms, etc) paths and merge them down to just 2 paths (Vector, Fallback) while still maintaining the perf or only losing perf within an acceptable margin.

To that end, the general direction we'd like to maintain is that developers can continue writing (for the most part) more general cross platform code for the bulk of their algorithms. They would then utilize the platform specific intrinsics for opportunistic light-up within these cross platform algorithms where it makes a significant impact. Extreme power users could continue writing individual algorithms that are fine-tuned for each platform/architecture, but it would not necessarily be the "core" scenario we expect most users to be targeting (namely due to the complexity around doing this).

I believe that ensuring SVE/SVE2 can fit into this model will help increase its adoption and usage, particularly among developers who may not have access to the full breadth of "latest hardware" on which to test their code for every platform/architecture. It will also help ensure that users don't necessarily need to go and rewrite their code to benefit, they can simply roll forward to the latest version of .NET and implicitly see performance wins. The BCL and other high performance libraries (including domain specific scenarios, such as ML.NET, Image Processing, etc), would update their own algorithms to make targeted usage of the platform specific APIs where they help achieve significantly more performance than the general purpose code.

I do think we should be able to make SVE/SVE2 integrate with the same general model and I believe we should be able to do this around Vector<T> rather than introducing a new Arm64 exclusive vector type.

Given the samples in the OP, there's this callout:

Ideally, the user would write the predicated version and the Jit would optimise to this if required.

I actually think its the inverse and given the cross platform considerations, we want users to write code unpredicated with a tail loop. This is what they already have to do for NEON, it's what they have to do for WASM, it's what they have to do for SSE/AVX. This is also something we considered around AVX-512 masking support and one reason why we opted to not expose VectorMask<T> for right now.

Instead, we could express the code such as:

public static unsafe int sum_sve_unpredicated_loop(int* srcBytes, int length)
{
    Vector<int> total = Vector<int>.Zero;
    (int iterations, int remainder) = int.DivRem(i, Vector<int>.Length);

    for (int i = 0; i < iterations; i++)
    {
        total += Vector.Load(srcBytes);
        srcBytes += Vector<int>.Length;
    }

    // Predicated tail.
    Vector<int> pred = Vector.CreateTrailingMask(remainder);
    total += Vector.MaskedLoad(pred, srcBytes, Vector<int>.Zero);

    return Vector.Sum(total);
}

We would still of course expose Sve.Add, Sve.AddAccross, Sve.LoadVector, Sve.WhileLessThan, etc. However, users wouldn't have to use it, it would just be an additional option if they wanted finer grained control over what codegen they'd get.

There are notably a few other ways this could be done as well, but the general point is that this results in very natural and easy to write code that works on any of the platforms, including on hardware that only supports NEON, without users having to write additional and separate versions of their algorithms to achieve it.

There is a note that CreateTrailingMask, GreaterThan, etc all return a Vector<T> and not a VectorMask<T> (or VectorSVEPred). This is something that we're doing with AVX-512 due to the consideration that VectorMask<T> could not be easily accelerated on downlevel platforms and that it would effectively need to be a Vector<T> in those cases anyways. Such a restriction meant users would have to explicitly rewrite their algorithms to use masking, and that was going to hinder adoption overall.

The JIT handles this difference from hardware in lowering by doing some tracking to know if the value is actually a mask or vector and whether the consumer needs a mask or vector. This then translated very cleanly to existing patterns such as ConditionalSelect(mask, x + y, mergeVector) which can be trivially recognized and emitted as a MergeAdd instead. The same support translates to all other intrinsic APIs that support masking, so we didn't need to expose an additional 3800 API overloads to support masking (this was the minimum actual count of additional API's we'd of had to expose to exactly match what hardware supports).

Noting that this decision also doesn't restrict our ability from adding such support in the future. It just greatly simplified what was required today and made it much faster and more feasible to get the general support out while also keeping the implementation cost low.

We could also add support for the fully predicated loops and optimizing them to be rewritten as unpredicated with a post predicated handler. But that is more complicated and I believe would be better handled as a future consideration after we get the core support out.

I'd like to see us add SVE support and so I'd hope going along the avenues we've found to work and help with adoption and implicit light up would also apply here.

a74nh commented 1 year ago

It's worth noting that this is not actionable as an API proposal in its current setup.

Agreed, this seemed the closest category. Given that this won't turn into a full API proposal, I could just drop the category and move to a normal issue?

Predicated instructions has an increased cost over non-predicated, correct?

For now, yes.

Arm64 AOT still targets the v8.0 baseline and likewise has opportunistic light-up for some scenarios.

Having AOT Arm64 target 8.0 feels sensible. For ReadyToRun, it would then tier appropriately to what is available. For the "no JIT, only AOT" option, maybe a command line option at compile time - essentially the same as the gcc/llvm -march flags.

There is a note that CreateTrailingMask, GreaterThan, etc all return a Vector and not a VectorMask (or VectorSVEPred). This is something that we're doing with AVX-512 due to the consideration that VectorMask could not be easily accelerated on downlevel platforms and that it would effectively need to be a Vector in those cases anyways. Such a restriction meant users would have to explicitly rewrite their algorithms to use masking, and that was going to hinder adoption overall.

Without explicit VectorMask types you're losing type checking. Could VectorMask just be a wrapper over Vector?

On downlevel platforms, the tail is being turned into a scalar loop? Therefore on simple loops, the mask is effectively optimised away, right?

I actually think its the inverse and given the cross platform considerations, we want users to write code unpredicated with a tail loop. This is what they already have to do for NEON, it's what they have to do for WASM, it's what they have to do for SSE/AVX. This is also something we considered around AVX-512 masking support and one reason why we opted to not expose VectorMask for right now.

That makes sense when coming at it from the generic side.

From an implementation side, I'm thinking that predication makes things easier. The user writes a single predicated loop. There is enough information in the loop for the jit to 1) create a predicated loop with no tail or 2) create unpredicated main loop and predicated tail or 3) create unpredicated main loop and scalar tail.

I understand users are not used to writing with predication yet. But, in the example, the user still has to use predication for the tail. Is that why VectorMask<T> was dropped for now - to make it easier for users?

Another option maybe would be to never have Vector<T> supporting masks/predication. Tails would always be scalar. All platforms can be supported easily. Then later Vector<P> is later added which looks similar, but does everything using predication - and the jit turns into the correct thing on the platform. That would skip ever having unpredicated loops with predicated tail, but maybe that's ok.

tannergooding commented 1 year ago

For the "no JIT, only AOT" option, maybe a command line option at compile time - essentially the same as the gcc/llvm -march flags.

We have this already, and allow targeting specific versions (armv8.1, armv8.2, etc) or specific ISAs that are currently supported.

Could VectorMask just be a wrapper over Vector?

Not cheaply/easily and not without requiring users go and explicitly opt into the use of masking.

The general point is that on downlevel hardware (NEON, SSE-AVX2, etc) there is no masking, you only have Vector<T>. If you want to do masking, you have to use explicit APIs like ConditionalSelect or MaskMove and so this is what developers are already doing.

It is fairly trivial for the JIT to recognize and internally do the correct type tracking. It is also fairly trivial for the JIT to insert the relevant implicit conversions if an API expects a mask and was given a vector; or if it expects a vector and was given a mask. One of the reasons this is cheap and easy to do is because it is completely non-observable to whoever is writing the algorithm and so the JIT can do this in a very "pay for play" manner.

If we inverse things, such that we expose VectorMask publicly then we start having to worry about observable differences in various edge cases. We start having to expose overloads that explicitly take the mask, which means most new APIs now have 2x the overloads. We also have to do significantly more work to ensure that the right conversions and optimizations are being done such that it remains efficient, etc.

On downlevel platforms, the tail is being turned into a scalar loop? Therefore on simple loops, the mask is effectively optimised away, right?

Users do different things for different scenarios. Some users simply defer back to the scalar path. Some will "backtrack" and do 1 more iteration when the operation is idempotent. Some use explicit masked operations to ensure the operation remains idempotent.

From an implementation side, I'm thinking that predication makes things easier. The user writes a single predicated loop. There is enough information in the loop for the jit to 1) create a predicated loop with no tail or 2) create unpredicated main loop and predicated tail or 3) create unpredicated main loop and scalar tail.

I'd disagree on this. The general "how to write vectorized code" steps are:

Write your algorithm as you would normally (this is scalar, doing 1 operation per iteration)
Unroll your loop to do n operations per iteration (where n is typically sizeof(Vector) / sizeof(T), that is VectorXXX<T>.Count)
Squash your n operations into 1 SIMD operation per iteration (e.g. n scalar Add become 1x SIMD Add)

Predication/masking is effectively a more advanced scenario that can be used to handling more complex scenarios such as branching or handling the tail (remaining) elements that don't fill the vector. It's not something that's actually needed super often and generally represents a minority of the overall SIMD code you write.

I understand users are not used to writing with predication yet. But, in the example, the user still has to use predication for the tail. Is that why VectorMask was dropped for now - to make it easier for users?

Right. We opted to drop VectorMask<T> (from the AVX-512 support) for now because it was significantly less work for the JIT and also greatly simplified the end user experience. It made it so that users don't have to change how they are thinking about or writing their code at all, they simply continue doing what they've already been doing and it happens to take advantage of the underlying hardware if the functionality is available.

It also means that we don't have to do more complex handling for downlevel hardware (which is currently the far more common scenario) and can instead restrict it only to the newer hardware. It means that we won't typically encounter unnecessary predication which means we don't have to worry about optimizing the unnecessary bits away. It then also simplifies the general handling of loops as we don't have to consider predication as part of loop cloning, loop hoisting, or other optimizations in general.

Another option maybe would be to never have Vector supporting masks/predication. Tails would always be scalar. All platforms can be supported easily.

Users are already sometimes writing predicated tails today, so I don't see the need to block this. I think it would be better to just expose the couple additional APIs that would help with writing predicated tails instead. Exposing a nearly identical type that supports predication would just make adding predication support harder and would increase the amount of code the typical user needs to maintain.

Without explicit VectorMask types you're losing type checking.

In my opinion, this is largely a non-issue. Downlevel users already need to do their masking/predication using VectorXXX<T> and so they're already used to Vector128.GreaterThan returning a Vector128<T> and not a VectorMask128<T>.

The JIT is then fully capable of knowing which methods return a "mask like value", which we already minimally track on downlevel hardware to allow some special optimizations.

The JIT is also fully capable of correctly tracking the type internally as TYP_MASK (rather than TYP_SIMD16) when it is not just "mask like" but an actual mask (for AVX-512 or SVE capable hardware). This makes the fact that it is actually a mask completely invisible to the end user.

The only requirement for the JIT here is that if it has a TYP_MASK and the API expects a TYP_SIMD16 (or vice-versa) it needs to insert an implicit conversion. On AVX-512, there are dedicated instructions for this. I don't believe there are for SVE, but there are still some options available that can make this very cheap/feasible.

A managed analyzer can then correctly surface to the end user when they are inefficiently handling masking (e.g. passing in a mask to something that doesn't expect one, or vice versa). This is very similar to the [ConstantExpected] analyzer and would likely not trigger for typical code users are writing.

TamarChristinaArm commented 1 year ago

There are actually some more that are available. If Linux would be an acceptable target I can send you an email with other options than the ones you have stated here.

Linux is great!. We just really need to ensure there are options we can use for both development and validation. This namely means they can't be exclusively mobile devices (i.e. phones). Having a list of viable devices would be good, all the sources I've found for SVE or Armv9 have shown to be phones, android devices, the latest version of Gravitron, or not yet released for general use.

Great, I've reached out to some people internally and will get back to you.

We went with this approach because it allowed easier access to the acceleration for a broader audience, worked well when considering all platforms (not just x86/x64), allowed existing code to improve without explicit user action, and allowed developers to iteratively extend their existing algorithms without having to add massive amounts of new platform specific code.

I don't disagree, but I don't see why you'd not want the ability for people to who want the full benefit of it to avoid generic code? For that reason we don't even see SVE and NEON as orthogonal things, and we have code that dips between the two sometimes on an instruction basis depending on what you need https://arm-software.github.io/acle/main/acle.html#arm_neon_sve_bridgeh so I think you're designing yourself into a box by not exposing SVE directly as well.

If AOT is equivalent to a native compiler's -mcpu=native than this won't be an issue as you'd know the vector length you're targetting.

The default for most AOT would be platform agnostic. For example, x64 AOT still targets the 2004 baseline and has opportunistic light-up (that is allows for a cached runtime based check to access SSE3-SSE4.2). Arm64 AOT still targets the v8.0 baseline and likewise has opportunistic light-up for some scenarios.

predicate all SVE operations to this.

Predicated instructions has an increased cost over non-predicated, correct?

Yes indeed, but this cost is amortized over the ability to vectorize code that you can't with Advanced SIMD. Too much emphasis is placed on Vector Length with SVE and people ignore the increased vector footprint SVE brings. Also like @a74nh the costs aren't as static as optimization guides make it look.

To that end, the general direction we'd like to maintain is that developers can continue writing (for the most part) more general cross platform code for the bulk of their algorithms. They would then utilize the platform specific intrinsics for opportunistic light-up within these cross platform algorithms where it makes a significant impact. Extreme power users could continue writing individual algorithms that are fine-tuned for each platform/architecture, but it would not necessarily be the "core" scenario we expect most users to be targeting (namely due to the complexity around doing this).

I believe that ensuring SVE/SVE2 can fit into this model will help increase its adoption and usage, particularly among developers who may not have access to the full breadth of "latest hardware" on which to test their code for every platform/architecture. It will also help ensure that users don't necessarily need to go and rewrite their code to benefit, they can simply roll forward to the latest version of .NET and implicitly see performance wins. The BCL and other high performance libraries (including domain specific scenarios, such as ML.NET, Image Processing, etc), would update their own algorithms to make targeted usage of the platform specific APIs where they help achieve significantly more performance than the general purpose code.

I do think we should be able to make SVE/SVE2 integrate with the same general model and I believe we should be able to do this around Vector<T> rather than introducing a new Arm64 exclusive vector type.

I don't think we're disagreeing here. All we want to do with this Issue though it highlight that we need to have a way to use SVE directly for those that do want to. The generic wrappers will always have some overhead and the way people write code using them won't be pattern that any core will be optimizing for, as it's not the way Arm promotes SVE and the way we've been porting high profile code such as codecs.

Given the samples in the OP, there's this callout:

Ideally, the user would write the predicated version and the Jit would optimise to this if required.

I actually think its the inverse and given the cross platform considerations, we want users to write code unpredicated with a tail loop. This is what they already have to do for NEON, it's what they have to do for WASM, it's what they have to do for SSE/AVX. This is also something we considered around AVX-512 masking support and one reason why we opted to not expose VectorMask<T> for right now.

Well for one, AVX-512 masking isn't as first class as SVE's. just compare the number of operations on masks between both ISAs for instance. Secondly SVE simply does not have unpredicated equivalences of all instructions, while SVE2 adds some more it's not a 1-1 thing. So even for your "unpredicated" loop you'll end up needing to predicate the instructions for use.

So going from predicated to unpredicated makes much more sense.

Instead, we could express the code such as:
public static unsafe int sum_sve_unpredicated_loop(int* srcBytes, int length)
{
    Vector<int> total = Vector<int>.Zero;
    (int iterations, int remainder) = int.DivRem(i, Vector<int>.Length);

    for (int i = 0; i < iterations; i++)
    {
        total += Vector.Load(srcBytes);
        srcBytes += Vector<int>.Length;
    }

    // Predicated tail.
    Vector<int> pred = Vector.CreateTrailingMask(remainder);
    total += Vector.MaskedLoad(pred, srcBytes, Vector<int>.Zero);

    return Vector.Sum(total);
}
We would still of course expose Sve.Add, Sve.AddAccross, Sve.LoadVector, Sve.WhileLessThan, etc. However, users wouldn't have to use it, it would just be an additional option if they wanted finer grained control over what codegen they'd get.

There are notably a few other ways this could be done as well, but the general point is that this results in very natural and easy to write code that works on any of the platforms, including on hardware that only supports NEON, without users having to write additional and separate versions of their algorithms to achieve it.

Don't really see how this would extend to non-masked ISAs? There's nothing you can do for the predicated tail for Advanced SIMD here. You're going to have to generate scalar. At the very most you can generate a vector epilogue with a scalar tail. But you could have done all that from a single loop anyway.

I think we've discussed this before, I still don't see why you can't generate the loop above from a more generic VLA friendly representation. It's just peeling the last vector iteration. I would argue that's easier to write for people as well, and easier to maintain since you don't need to have two instances of your loop body to maintain.

There is a note that CreateTrailingMask, GreaterThan, etc all return a Vector<T> and not a VectorMask<T> (or VectorSVEPred). This is something that we're doing with AVX-512 due to the consideration that VectorMask<T> could not be easily accelerated on downlevel platforms and that it would effectively need to be a Vector<T> in those cases anyways. Such a restriction meant users would have to explicitly rewrite their algorithms to use masking, and that was going to hinder adoption overall.

I'm having trouble grokking this part :) What would T be here? the same type as the input? i.e. elements? a maskbit? Can you give an example with _mm512_cmp_epi32_mask? What would say

if (a[i] >0  && b [i] < 0)

look like when vectorized in this representation?

The JIT handles this difference from hardware in lowering by doing some tracking to know if the value is actually a mask or vector and whether the consumer needs a mask or vector. This then translated very cleanly to existing patterns such as ConditionalSelect(mask, x + y, mergeVector) which can be trivially recognized and emitted as a MergeAdd instead. The same support translates to all other intrinsic APIs that support masking, so we didn't need to expose an additional 3800 API overloads to support masking (this was the minimum actual count of additional API's we'd of had to expose to exactly match what hardware supports).

Right, so a half way step between autovec and intrinsics. That's fine, in C code we treat predication as just normal masking anyway. typically we don't carry additional IL for it, just normal boolean operation on vector booleans.

We could also add support for the fully predicated loops and optimizing them to be rewritten as unpredicated with a post predicated handler. But that is more complicated and I believe would be better handled as a future consideration after we get the core support out.

Sure, but my worry here is that if you get people to rewrite their loops once using the unpredicated main body and predicated tail approach once, would they really rewrite it later again? I can't speak for whether it's more work or not obviously, but especially in case of SVE1 you'll have to forcibly predicate many instructions anyway to generate the "unpredicated" loop.

I'd like to see us add SVE support and so I'd hope going along the avenues we've found to work and help with adoption and implicit light up would also apply here.

TamarChristinaArm commented 1 year ago

If we inverse things, such that we expose VectorMask publicly then we start having to worry about observable differences in various edge cases. We start having to expose overloads that explicitly take the mask, which means most new APIs now have 2x the overloads. We also have to do significantly more work to ensure that the right conversions and optimizations are being done such that it remains efficient, etc.

Sure, but to again use a simple example, how would something like a a non-ifconvertible conditonal look? i.e.

if (a[i] > 0) {
  b[i] = c[i] * n;
  a[i] += b[i] - c[i];
  ...
}

without exposing vector mask, how does one write this?

From an implementation side, I'm thinking that predication makes things easier. The user writes a single predicated loop. There is enough information in the loop for the jit to 1) create a predicated loop with no tail or 2) create unpredicated main loop and predicated tail or 3) create unpredicated main loop and scalar tail.

I'd disagree on this. The general "how to write vectorized code" steps are:

Write your algorithm as you would normally (this is scalar, doing 1 operation per iteration)

Unroll your loop to do n operations per iteration (where n is typically sizeof(Vector) / sizeof(T), that is VectorXXX<T>.Count)

Squash your n operations into 1 SIMD operation per iteration (e.g. n scalar Add become 1x SIMD Add)

Predication/masking is effectively a more advanced scenario that can be used to handling more complex scenarios such as branching or handling the tail (remaining) elements that don't fill the vector. It's not something that's actually needed super often and generally represents a minority of the overall SIMD code you write.

I'd disagree with this :) Yes this is true for VLS but not VLA. The entire point of VLA is that you don't have to think of scalar at all, and conversely you don't need to think of vector length. For VLA the expectations is that

write your algorithm as scalar doing 1 operation per iteration.
write your algorithm as vector doing > n but unknown operations per iteration.

VLA is supposed to map more closely to an intuitive loop design where you're supposed to be able to map directly from scalar to vector without worrying about buffer overruns, overreads etc.

Without explicit VectorMask types you're losing type checking.

In my opinion, this is largely a non-issue. Downlevel users already need to do their masking/predication using VectorXXX<T> and so they're already used to Vector128.GreaterThan returning a Vector128<T> and not a VectorMask128<T>.

This has been hurting my head tbh :) I've been having trouble understanding with T would mean in this context. You know your users better than I do, but if I'm someone who knows the architecture, this would confuse me. I keep thinking this returns elements I can use directly. Could this not be a type alias?

The only requirement for the JIT here is that if it has a TYP_MASK and the API expects a TYP_SIMD16 (or vice-versa) it needs to insert an implicit conversion. On AVX-512, there are dedicated instructions for this. I don't believe there are for SVE, but there are still some options available that can make this very cheap/feasible.

Sorry don't quite follow this one, So if a user does

(a[i] > 0) + b[i]

what happens here? Yes It's nonsensical but the types return by the comparison indicate it should do something sane.

How about cases where it's ambiguous

(a[i] > 0) & b[i]

is this a predicate combination or bitwise or of values?

tannergooding commented 1 year ago

I think there's maybe some misunderstanding/miscommunication going on, so I'd like to try and reiterate my side a bit.

There are effectively 2 target audiences we want to capture here, noting that in bothof these cases the users will likely need to be considering multiple platforms (Arm64, x64, WASM, etc):

Extreme power users who are wanting to squeeze every ounce of performance out of their code
A much broader audience of perf minded users who want to provide good performance improvements to their important/hot code

To achieve 1, we have a need to expose "platform specific APIs". This includes Arm.AdvSimd today and will include Arm.Sve/Arm.Sve2 in the future. It also includes Wasm.PackedSimd, x86.Sse/x86.Avx/x86.Avx512F, and may include other platforms in the future (RiscV, LoongArch, etc).

To achieve 2, we have a need to expose "cross platform APIs". This is primarily Vector<T>, Vector128<T>, and the various APIs these types expose that are trivially available or implementable across all platforms. We also provide a number of APIs on System.Span<T> which are internally accelerated to provide optimized implementations of commonly vectorizable functionality.

For the first audience, these developers are likely more familiar with SIMD and the specific underlying platform they're targeting. They are likely willing to write code for each target platform and even code per ISA to get every bit of perf out. Since doing this can require a lot of platform specific knowledge/expertise, can require significant testing and hardware to show the wins, and since it may not provide significant real world wins over in all areas/domains; it is not something that a typical developer is likely going to be doing.

For the second audience, these developers may only be familiar with 1 target platform (i.e. only familiar with Arm64 or only familiar with x64). They may only be familiar with a subset of ISAs on a given platform (i.e. they know all about NEON, but not about SVE). They may even just be learning SIMD for the first time. These developers would like to see perf improvements across a broad range of platforms without the need to put in significant work or maintain multiple similar, but subtly different complex implementations and so are willing to lose some percentage of the total potential perf.

We want and need to target both of these audiences. However, it is very important to keep in mind that the second audience is a significantly larger target and we will see the most overall benefit across the ecosystem by ensuring new ISAs can be successfully used from here. If something like SVE is only available via platform-specific usage, then there are many libraries that could have been benefiting from SVE that never will.

To that end, I want to ensure that we expose SVE in a way that can most easily benefit both of these audiences. I'd also like to ensure that what we expose for the first audience doesn't significantly increase the implementation complexity of the BCL or JIT and that it remains "pay for play". To that end, for public static APIs exposed under the various System.Runtime.Intrinsics namespaces, we currently have:

Arm - 2191 APIs -- Doesn't include Arm32 or SVE/SVE2
Wasm - 467 APIs -- Still a WIP
x86 - 2037 APIs -- Includes most, but not all, of AVX512+

This general need to be pay for play and not significantly increase the complexity sometimes necessitates thinking about alternative ways to expose functionality. We also needed to factor in that newer ISAs are typically only available in a minority of hardware and it can take significant time for market saturation (such that you have a significantly higher chance of encountering the ISA) and that having a significantly different programming model to work with newer ISAs or functionality can hinder adoption and integration with general algorithms, particular for audience 2.

For example, we've been working on exposing Avx512 support in .NET 8 and if we mirrored the way C/C++ expose the surface area, we would have been adding somewhere between 1900-3800 new APIs to support masking/predication. This was actually exactly how our initial design was intended as well and as such we defined VectorMask<T> types and planned on exposing APIs like public static Vector128<float> Add(Vector128<float> mergeValues, Vector128Mask<float> mergeMask, Vector128<float> left, Vector128<float> right);.

However, it was quickly found that exposing Vector128Mask<T> represented a large number of problems, including but not limited to:

It would require both audiences to learn an entirely new programming model for SIMD
The programming model in question was only applicable to a subset of hardware
It wasn't clear that the programming model could easily be made efficient on downlevel hardware without native support
The general support was showing to be very non pay for play due to the additional type recognition and end to end integration the JIT was going to require
We were going to be doubling or tripling our currently exposed API surface area for x86
Existing code could not benefit and all existing SIMD algorithms would need to be adjusted or rewritten to take advantage

Given this, we took a step back and considered if we could achieve this in an alternative fashion. We considered how users, of both audiences, have to achieve the equivalent of masking support today on downlevel ISAs, what was feasible for the JIT to recognize/integrate, the costs/impact of the different approaches for both audiences, etc. Ultimately, we came to the conclusion that we could get all the same functionality a different way and could ensure it was explicitly documented.

One of the primary considerations was that public static Vector128<float> Add(Vector128<float> mergeValues, Vector128Mask<float> mergeMask, Vector128<float> left, Vector128<float> right); was simply equivalent to:

Vector128<float> result = Add(left, right);
return ConditionalSelect(mergeMask, result, mergeValues);

Likewise, that all downlevel platforms without explicit masking/predication support currently have their "comparison" and other similar intrinsics return a Vector<T> result where each element is either AllBitsSet (true) or Zero (false). This then trivially translates into the existing API surface and how developers currently handle masking and mask like functionality.

So the general thought process was that if we simply preserve this model, then we can trivially allow existing code patterns to trivially light up on hardware that supports masking/predication. We likewise can avoid exposing several thousand new API overloads that explicitly take masking parameters in favor of pattern recognition over ConditionalSelect(mergeMask, result, mergeValues).

The JIT would still have a TYP_MASK internally, but it would no longer have to do expensive tracking/recognition of managed types. Instead, we would take the existing ReturnsPerElementMask flags that were already being used to optimize some downlevel paths and use it to determine whether the IR node returns TYP_SIMD or TYP_MASK. The JIT would have a special helper that converts results from TYP_MASK up to TYP_SIMD and this would be the default. APIs that expect a TYP_MASK would be able to remove this helper call and consume the mask directly.

This resulted in something that was incredibly pay for play, had immediate benefits showing up to existing workloads without needing to touch the managed SIMD algorithm, and which still generated the correct and expected code. We simply need to continue expanding the pattern recognition to the full set of mask patterns.

My expectation is that a similar model would work very well for SVE/SVE2 and would allow its implicit usage in most existing workloads. It would likewise allow significantly easier integration of an SVE specific path into the same existing workloads where the developer is in the first audience and to do so without limiting the "full benefit".

tannergooding commented 1 year ago

so I think you're designing yourself into a box by not exposing SVE directly as well.

The intent was not to "not expose SVE". It was simply a question of whether SVE could be exposed slightly differently to better work for both audiences while allowing it to be more pay for play. This very much aligns with the expectation that developers will want and need to dip into both, sometimes even in the same algorithm.

Yes indeed, but this cost is amortized over the ability to vectorize code that you can't with Advanced SIMD.

Sorry, rather I meant that for many instructions SVE has both predicated and unpredicated versions. The use of the unpredicated version is preferred over the use of the predicated version when the predicate would be PTRUE P0, ALL (all elements are active). This is of course not possible for instructions which only have a predicated version.

AVX-512 operates in a similar manner but they have a special register (K0) which is used to indicate "no masking". Such instructions typically execute at least 1-cycle faster. Some hardware also have optimizations where they recognize "AllBitsSet" or "Zero" constants for registers and use that to improve execution.

Too much emphasis is placed on Vector Length with SVE and people ignore the increased vector footprint SVE brings.

Right. The same is true of AVX-512 where many people put a large emphasis on the 512-bit width and miss the incredibly expanded range of instruction support (including for 128-bit and 256-bit vectors) or the additional functionality support (including masking).

All we want to do with this Issue though it highlight that we need to have a way to use SVE directly for those that do want to. The generic wrappers will always have some overhead and the way people write code using them won't be pattern that any core will be optimizing for, as it's not the way Arm promotes SVE and the way we've been porting high profile code such as codecs.

👍. What I'm trying to emphasize is that as part of the design we really need to consider how the functionality is available for the second audience. If we only consider one side or the other, then we are likely to end up with something that is suboptimal.

There will be quite a lot of code and codebases that will want to take advantage of SVE without necessarily writing algorithms that explicitly use SVE. This includes many cases in the BCL when you consider, for example, something like IndexOf, Equals, or Sum where they are core and general purpose algorithms that need to be accelerated everywhere. None of these represent cases that can only be accelerated with SVE and none of them represent cases where SVE will provide a significant performance gain over NEON (outside the case where the vector length is larger). Many of them represent idempotent algorithms with little to no branching or other special considerations.

For such cases, we are even looking at ways to further improve the handling and collapse the algorithms further, such as via an ISimdVector interface that will allow you to write 1 algorithm that can still be used with explicitly sized vectors so you get the "best of both worlds". There is also often identical codegen between the "platform specific" and "cross platform" APIs in these cases as most of the operations being done are trivial and 1-to-1.

There will still be plenty of cases where developers will want or need SVE for Arm64, particularly in more domain specific areas like Machine Learning, Image Processing, etc. These are particularly prevalent for functionality that is not cross platform or which is only on a couple of the total target platforms (the matrix ISA extensions, half-precision floating-point, etc). However, developers will also have a general desire to ensure those same workloads are also accelerated on other platforms (x64) or on downlevel hardware where feasible, even if the degree of performance isn't on the same level and so making it easier to integrate between the two is often a plus.

Well for one, AVX-512 masking isn't as first class as SVE's. just compare the number of operations on masks between both ISAs for instance.

There's always going to be pro's and con's for each platform. There is functionality that Arm64 has that x64 does not and but also functionality that x64 has which Arm64 does not. There are places where Arm64 makes some operations significantly more trivial to do and there are places where x64 does the same over Arm64. It's a constant battle for both sides, but developers are going to want/need to support both. .NET therefore has a desire to allow developers to achieve the best on both as well, while also considering how developers can succeed while needing to support the ever growing sets of ISAs, platforms, and targets.

Secondly SVE simply does not have unpredicated equivalences of all instructions, while SVE2 adds some more it's not a 1-1 thing. So even for your "unpredicated" loop you'll end up needing to predicate the instructions for use.

Right, but the consideration is that predicate will often be PTRUE P#, ALL. Consider for example that FADD has both a predicated and unpredicated version. For the core loop of many algorithms you'll simply want to use the unpredicated version. However, FABS appears to only have a predicated version, so the typical value passed in for the same loop would be a hoisted PTRUE P#, ALL.

For these APIs, to be "strictly compatible" with what hardware defines, we'd define and expose the following (matching native):

VectorSve<float> Add(VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMerge(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMergeUnsafe(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMergeZero(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);

VectorSve<float> AbsMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1);
VectorSve<float> AbsMergeUnsafe(VectorSvePredicate pg, VectorSve<float> op1);
VectorSve<float> AbsMergeZero(VectorSvePredicate pg, VectorSve<float> op1);

Where *Merge takes the inactive value from inactive or op1 otherwise. Where *MergeUnsafe takes the inactive value from the existing value in the destination register. Where *MergeZero takes the inactive value to be zero

This pattern repeats for most instructions meaning we have 3-4x new APIs per instruction, giving us the same API explosion we were going to have for Avx512. It also represents a lot of new concepts for the user to consider and makes some APIs inconsistent.

We could collapse this quite a bit by instead exposing:

VectorSve<float> Add(VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);

VectorSve<float> Abs(VectorSve<float> op1);
VectorSve<float> AbsMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1);

The JIT could then trivially recognize the following:

AddMerge(op1, pg, op1, op2) == AddMerge(pg, op1, op2)
AddMerge(VectorSve<float>.Zero, pg, op1, op2) == AddMergeZero(pg, op1, op2)

AbsMerge(op1) == AbsMerge(VectorSvePredicate.All, op1)

Much as AddMergeZero (svadd_f32_z) is handled in native, inactive being something other than op1 or op2 means the instruction becomes RMW and we need an additional MOVPRFX emitted. This is trivially handled and allows any value to be used, thus not only simplifying what is exposed to users but also expanding what is trivially supported as compared to native.

We could then consider if this can be collapsed a bit more. We already have a variable width vector (Vector<T>), so consider exposing it as:

Vector<float> Add(Vector<float> op1, Vector<float> op2);
Vector<float> AddMerge(Vector<float> inactive, VectorSvePredicate pg, Vector<float> op1, Vector<float> op2);

Vector<float> Abs(Vector<float> op1);
Vector<float> AbsMerge(Vector<float> inactive, VectorSvePredicate pg, Vector<float> op1);

If we then take it a step further and do the same thing as what we've done with Avx512 then we simply have:

Vector<float> Add(Vector<float> op1, Vector<float> op2);
Vector<float> Abs(Vector<float> op1);
Vector<float> Select(VectorSvePredicate pg, Vector<float> op1, Vector<float> op2);

Developers would then access the predicated functionality as:

Select(pg, Add(op1, op2), inactive) == AddMerge(inactive, pg, op1, op2)

This means that we only have 1 new API to expose per SVE instruction, again while expanding the total support available, and centralizing the general pattern recognition required (and which we'll already need/want to support).

The only step that could be done here is to remove VectorSvePredicate which is exactly the consideration we had around Avx512 and its VectorMask type. This is what led us to determining we could simplify the logic down to ConditionalSelect(mask, Add(op1, op2), merge) and to specially recognize masks of AllBitsSet, Zero, and when its a vector produced by something like CompareGreaterThan/etc. It meant that we didn't need to introduce new concepts and still were able to achieve the full set of functionality available. -- This is what I think is most interesting to explore for SVE and if we can also remove this "last difference" here as well. My expectation is that it would be possible. That is we should be able to completely remove the concept of VectorPredicate at the managed layer and only expose it in terms of vectors. The JIT can internally track the type correctly, just as its doing for TYP_MASK for x64 and we get ultimately identical codegen in a much more pay for play and user friendly manner (and in one that trivially translates to existing concepts).

tannergooding commented 1 year ago

I think we've discussed this before, I still don't see why you can't generate the loop above from a more generic VLA friendly representation. It's just peeling the last vector iteration. I would argue that's easier to write for people as well, and easier to maintain since you don't need to have two instances of your loop body to maintain.

The JIT is, in general, time constrained. It cannot trivially do all the same optimizations that a native compiler could do. However, there are other optimizations that it can do more trivially particularly since it knows the exact machine it's targeting.

Loop peeling isn't necessarily complex, but it produces quite a bit more IR and requires a lot of additional handling and optimizations to ensure everything works correctly. It's not something we have a lot of support for around today and adding it would likely not be very "pay for play", especially if its required to get good performance/handling for code users are writing to be "perf critical".

Additionally, most developers are going to require the scalar loop to exist anyways so they have something to execute when SVE isn't supported. So it's mostly just a question of "do I fallthrough to the scalar path" -or- "do I do efficiently handle the tail elements".

For "efficient tail handling", you most typically have at least a full vector of elements and so developers will just backtrack by n so they read exactly 1 full vector to the end of the input. For idempotent operations you do 1 more iteration of the loop. For non-idempotent operations, you mask off the already processed elements. Both of these translate very cleanly to NEON and do not require SVE. They simply get additional performance on SVE capable hardware.

There are then some other more complex cases that aren't so trivially handled and which won't translate as cleanly. Particularly if you cannot backtrack (e.g. your entire input is less than a full vector or you don't know if the access will fault or not). These can really only be handle with SVE on Arm64 since NEON doesn't have a concept of "masked load".

What would T be here? the same type as the input? i.e. elements? a maskbit? Can you give an example with _mm512_cmp_epi32_mask? What would say ... look like when vectorized in this representation?

T is the same as the input. So Vector128<float> CompareGreaterThan(Vector128<float left, Vector128<float> right). It simply compares left to right and produces a vector where the corresponding element is AllBitsSet if left > right and Zero otherwise. -- This is exactly how AdvSimd.CompareGreaterThan works for Arm64 as well.

On AVX-512 capable hardware, this will end up actually being a mask register, such as K1, instead. If the consumer of the value takes a mask, it will do so directly. If the consumer of the value takes a vector, we emit the vpmovm2* instruction which converts the mask to a vector (i.e. that is it goes from a mask of bits where 1 is active and 0 is inactive to a vector where the element is AllBitsSet or Zero). Inversely if an API expected a mask and the user had a vector instead, we emit the vpmov*2m instruction which does the inverse operation. Most cases do not require any conversion and analyzers are planned to help alert users to potential places where they could do something better.

For if (a[i] >0 && b[i] < 0), could you clarify what a and b are in this case? The JIT is simply doing whatever is most efficient for most cases. So assuming a[i] and b[i] is simply the scalar representation, then the user writes:

Vector128<int> mask = Vector128.GreaterThan(va, Vector128<int>.Zero) & Vector128.GreaterThan(vb, Vector128<int>.Zero);

This generates then (assuming xmm0 == va and xmm1 == vb and neither are last use):

; Pre-AVX512 hardware
vxorps   xmm0, xmm0, xmm0
vpcmpgtd xmm3, xmm1, xmm0
vpcmpltd xmm4, xmm2, xmm0
vpand    xmm3, xmm4, xmm4

; AVX512 hardware
vxorps   xmm0, xmm0, xmm0
vpcmpgtd   k1, xmm1, xmm0
vpcmpltd   k2, xmm2, xmm0
kandd      k1, k2

Right, so a half way step between autovec and intrinsics. That's fine, in C code we treat predication as just normal masking anyway. typically we don't carry additional IL for it, just normal boolean operation on vector booleans.

This isn't really auto-vectorization as the user is still explicitly writing vectorized code. We're simply recognizing the common SIMD patterns and emitting better codegen for them on hardware where that can be done.

without exposing vector mask, how does one write this?

The same way you'd have to write it for NEON today (assuming again that the snippet provided is the scalar algorithm):

// va/vb/vc = current vector for a/b/c, would have been loaded using `Vector128.Load`

Vector128<int> mask = Vector128.GreaterThan(va, Vector128<int>.Zero);
vb = Vector128.ConditionalSelect(mask, vc * n, vb);
va += Vector128.ConditionalSelect(mask, vb - vc, Vector128<int>.Zero);

This simply uses SVE predicated instructions -or- AVX-512 masked instructions on capable hardware; otherwise it emits VBSL on NEON and PBLENDV on SSE/AVX hardware.

VLA is supposed to map more closely to an intuitive loop design where you're supposed to be able to map directly from scalar to vector without worrying about buffer overruns, overreads etc.

Right, but its not universal and therefore not a de-facto scenario for users. Users have to consider these scenarios for Arm64 NEON and WASM PackedSimd, they generally want to also consider it for x64 since maskmovdqu can be expensive and didn't have "well-defined behavior" until AVX.

VLA works nicely for many cases, but without the dedicate predication/masking support, it ends up limited (the same problems users encountered for years with Vector<T>). For users writing platform specific SVE loops directly, the model you propose will be ideal. For the much larger audience of developers that are writing mostly platform agnostic code instead, they will continue needing to think in the same terms as they do for AdvSimd/PackedSimd/Sse and so will have to deal with all these concepts.

Sorry don't quite follow this one, So if a user does ... what happens here?

It depends on the user of the value and what the underlying instructions allow. On AVX-512 capable hardware, we can either do a vpcmpgt; vpmov*2m; kadd (user needs a mask) or we can do a vpcmpgt; vpmovm2*; vpadd (user needs a vector). Given that SVE doesn't have a predicate addition instruction, it would in the worst case need cmpgt; sel; add; cmpeq. It would (unless a better option exists) use sel to convert from "predicate to vector" and cmpeq for "vector to predicate".

How about cases where it's ambiguous ... is this a predicate combination or bitwise or of values?

Same scenario. It depends on how the value is being used. You'd either need to convert the mask (a[i] > 0) to a vector (using vpmovm2* or sel) or convert the vector (b[i]) to a mask (vpmov*2m or cmpeq).

The typical cases, particularly in the real world, will be simply combining masks together and then using them as masks, so no additional "stuff" is needed. You get the exact same codegen as you would if masking was exposed as a "proper" type.

tannergooding commented 1 year ago

-- Should be done. Just wanted to reiterate the general point I'm coming from to try and clear up any confusion and then try to separately address your individual questions.

a74nh commented 9 months ago

I've been thinking (and discussing with others) how we could handle the type of a predicate variable. These are the options we came up with.

Let's assume we are using VectorT for SVE vectors.

`VectorT<T>`

Var example: VectorT<short> predicate; API example: VectorT<int> Add(VectorT<int> predicate, VectorT<int> op1, VectorT<int> op2);

Here a predicate register looks identical in type to a standard Vector variable. When looking at the API, users need to look at the name of the variable to know it is a predicate. Bad naming of variables can quickly make this confusing.

If a predicate register is used as a normal vectorT then this will error at runtime when the JIT compiles code.

It may be possible to catch some of these errors at C# compilation. Within the scope of a method, Roslyn can track the uses of a vector. If it is used both as a predicate and normal vector then Roslyn will error. However, Rosyln cannot track this for variables passed into a function.

We also want to prevent using standard aritmetic operations (eg +) on predicates. Again the scope will be lost across function boundaries.

`VectorT<bool>`

Var example: VectorT<bool> predicate; API example: VectorT<int> Add(VectorT<bool> predicate, VectorT<int> op1, VectorT<int> op2);

All predicates now have the same type. It is very obvious which variables are predicates.

It may be confusing to a user that vector<bool> does not have 8 times the number of entries of a vector<byte>. Instead, vector<bool> is only used to imply a predicate. This could have issues if in the future we want to support real vectors of bools.

It is possible for the user to incorrectly use a predicate. Create a predicate of ints, then use this in a function expecting a predicate of shorts. This behaviour is undefined in the Arm architecture, however the instruction will run. It would be hard to track errors correctly within Roslyn or the JIT.

Note C++ SVE intrinsics effectively uses this way.

`VectorT<Mask>`

Var example: VectorT<Mask> predicate; API example: VectorT<int> Add(VectorT<Mask> predicate, VectorT<int> op1, VectorT<int> op2);

This is the same as VectorT<bool>, but uses a new Mask type instead of bool. This stops the user confusing the predicate with a real vector of bools. But, it has the same other issues as VectorT<bool>.

`VectorTMask<T>`

Var example: VectorTMask<short> predicate; API example: VectorT<int> Add(VectorTMask<int> predicate, VectorT<int> op1, VectorT<int> op2);

This is explicit typing and prevents any of the errors mentioned earlier.

However, it requires a new VectorTMask type adding, which needs support in coreclr.

`VectorT<Tpred>`

Var example: VectorT<p_short> predicate; API example: VectorT<int> Add(VectorT<p_int> predicate, VectorT<int> op1, VectorT<int> op2);

Here we add an extra set of integer types that follow the standard integer types.

`VectorT<Mask<T>>`

Var example: VectorT<Mask<short>> predicate; API example: VectorT<int> Add(vectorT<Mask<int>> predicate, VectorT<int> op1, VectorT<int> op2);

Here the extra use Mask wraps the integer type. I'm not sure if this is valid C# syntax.

tannergooding commented 9 months ago

Based on the existing design we've found to work for AVX-512, which supports a similar predication/masking concept, the intended design here is to use Vector<T>. We are open to changing if it is determined that this is not sufficient for SVE, but that is not expected to be the case.

API example: Vector Add(Vector predicate, Vector op1, Vector op2);

Such an API would not be exposed. Instead we would only expose Vector<int> Add(Vector<int> op1, Vector<int> op2);, that would be usable with the existing Vector<int> ConditionalSelect(Vector<int> mask, Vector<int> left, Vector<int> right) such as the following:

Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), inactive)
Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), Vector<int>.Undefined) // Vector<int>.Undefined would be new
Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), Vector<int>.Zero)

This allows all forms of predication to be supported via ConditionalSelect, which is exactly how such operations would be required to be emitted for downlevel code (such as AdvSimd/Neon). This then also allows such existing downlevel code to implicitly use predication support when SVE is available.

If a predicate register is used as a normal vectorT then this will error at runtime when the JIT compiles code.

This is a non-issue as the JIT can emit an implicit conversion. For the most part, this is a non-issue given standard coding patterns around masks and when/where you would want to use them.

We also want to prevent using standard aritmetic operations (eg +) on predicates. Again the scope will be lost across function boundaries.

This should also be a non-issue. If the user wants to do arithmetic on the output of a Vector.LessThan, they should be able to do so and the JIT will emit the most efficient code pattern (it does this for AVX-512 already). There are valid algorithms where such operations are done.

It may not be the "best" pattern for an SVE specific code path, but the user can always opt to write an SVE specific path that does something more optimal.

a74nh commented 9 months ago

Such an API would not be exposed. Instead we would only expose Vector<int> Add(Vector<int> op1, Vector<int> op2);, that would be usable with the existing Vector<int> ConditionalSelect(Vector<int> mask, Vector<int> left, Vector<int> right) such as the following:

Ok, I'm mostly on board with this approach. I've been working my way through all the predicated instructions trying to see if they will work.

If we then take it a step further and do the same thing as what we've done with Avx512 then we simply have:
Vector<float> Add(Vector<float> op1, Vector<float> op2);
Vector<float> Abs(Vector<float> op1);
Vector<float> Select(VectorSvePredicate pg, Vector<float> op1, Vector<float> op2);

Looking also at AND all the variants are:

AND <Zd>.D, <Zn>.D, <Zm>.D  // Vector && Vector. Unpredicated (.D because the type doesn't matter)
AND <Zdn>.<T>, <Zdn>.<T>, #<const>. // Vector && Constant. Unpredicated.
AND <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T> // Vector && Vector. With merge predicate
AND <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B  // Predicate && Predicate. With zero predicate

This will can all be covered in C# via:

   public static unsafe Vector<T> And(Vector<T> left, int right);
   public static unsafe Vector<T> And(Vector<T> left, Vector<T> right);

Note there is a variant that works only on predicates. Allowing for:

      Vector<int> mask_c = Sve.And(mask_a, mask_b);

But there is no variant of ADD that works only on predicates. What will the following do?

      Vector<int> mask_c = Sve.Add(mask_a, mask_b);

Assuming this is allowed somehow, does there need to be something in the vector API to indicate to users that ANDing predicates is a standard practice, but ADDing predicates does something not normally expected in SVE?

What happens if the user tries:

  Vector<int> mask = Sve.Add(vec_a, mask_a);
  Vector<int> mask = Sve.Abs(mask_a);

Will this error? Or will vec_a be turned into a mask? Or will mask_a be turned into a vector?

There are also some instructions that work only on predicates. For example:

NAND <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B
NOR <Pd>.B, <Pg>/Z, <Pn>.B, <Pm>.B

generates into:

   public static unsafe Vector<T> Nand(Vector<T> left, Vector<T> right);
   public static unsafe Vector<T> Nor(Vector<T> left, Vector<T> right);

It's fairly easy to support these functions for standard vectors too, but it is worth noting.

tannergooding commented 9 months ago

Will this error? Or will vec_a be turned into a mask? Or will mask_a be turned into a vector?

mask_a will be turned into a vector. This is because the baseline Vector<T> Add(Vector<T>, Vector<T>) operation logically must do a full vector's width of addition to compute a deterministic result. Cases like Vector<T> And(Vector<T>, Vector<T>) are allowed to preserve the values as masks since they are bitwise and the logical result is the same, whether consumed as a mask or as a vector.

That being said, this is one of the more unique cases where there preserving the general support is desirable and where it can't be trivially done via pattern recognition. Thus it's a case we'd expose an API like Vector<T> AddMask(Vector<T>, Vector<T>). For this one, it does the inverse. That is, if given a mask_a, it uses the operand as is. If instead given a vector_a, it converts it to a mask. Thus allowing us to preserve the intended behavior via explicit opt-in without exploding the API surface area.

There are also some instructions that work only on predicates.

We can either keep the name simple or name them NandMask and NorMask to make it clear that they only exist as direct instructions for the masking case. If given two vectors we can then opt to emit the 2 instruction sequence rather than converting vector to mask or vice versa.

a74nh commented 9 months ago

We can either keep the name simple or name them NandMask and NorMask

Explicitly naming the methods that work on masks would be great. It ensures the meaning is clear to the user.

However, it will increase the number of methods. eg:

  public static unsafe Vector<sbyte> And(Vector<sbyte> left, Vector<sbyte> right);
  public static unsafe Vector<short> And(Vector<short> left, Vector<short> right);
  public static unsafe Vector<int> And(Vector<int> left, Vector<int> right);
  public static unsafe Vector<long> And(Vector<long> left, Vector<long> right);
  public static unsafe Vector<byte> And(Vector<byte> left, Vector<byte> right);
  public static unsafe Vector<ushort> And(Vector<ushort> left, Vector<ushort> right);
  public static unsafe Vector<uint> And(Vector<uint> left, Vector<uint> right);
  public static unsafe Vector<ulong> And(Vector<ulong> left, Vector<ulong> right);

  public static unsafe Vector<sbyte> AndMask(Vector<sbyte> left, Vector<sbyte> right);
  public static unsafe Vector<short> AndMask(Vector<short> left, Vector<short> right);
  public static unsafe Vector<int> AndMask(Vector<int> left, Vector<int> right);
  public static unsafe Vector<long> AndMask(Vector<long> left, Vector<long> right);
  public static unsafe Vector<byte> AndMask(Vector<byte> left, Vector<byte> right);
  public static unsafe Vector<ushort> AndMask(Vector<ushort> left, Vector<ushort> right);
  public static unsafe Vector<uint> AndMask(Vector<uint> left, Vector<uint> right);
  public static unsafe Vector<ulong> AndMask(Vector<ulong> left, Vector<ulong> right);

Curiously, this is a case where C# has more methods than C, due to C having a single svbool_t type for all the masks. So AndMask is simply:

svbool_t svand[_b]_z (svbool_t pg, svbool_t op1, svbool_t op2)

tannergooding commented 9 months ago

We don't really need And + AndMask because it logically works the same. That is, in a theoretical world where we had VectorMask:

Sve.AndMask(vector1.AsMask(), vector2.AsMask()).AsVector() == Sve.And(vector1, vector2);
Sve.AndMask(mask1, mask2) == Sve.And(mask1.AsVector(), mask2.AsVector()).AsMask();

Thus, in the world where we only expose Vector<T> we can simply expose Sve.And(Vector<T>, Vector<T>) and have the JIT pick the right instruction based on the input types (are they actually vectors or are they masks) and the consumer of the output (does it expect a vector or a mask). -- It basically boils down to, if both inputs are the same type, use that and then convert to the target type. If one input is a vector and one is a mask, decide based on the target type required by the consumer.

We only need to expose additional overloads for cases like Add and AddMask where the semantics differ based on whether its operating over a mask or a vector.

a74nh commented 9 months ago

After parsing through the FEAT_SVE, I've split into groups, trying to keep within 200 methods per group.

12 apiraw_FEAT_SVE__signextend.cs
16 apiraw_FEAT_SVE__address.cs
28 apiraw_FEAT_SVE__mask.cs
32 apiraw_FEAT_SVE__while.cs
35 apiraw_FEAT_SVE__duplicate.cs
38 apiraw_FEAT_SVE__reverse.cs
40 apiraw_FEAT_SVE__break.cs
40 apiraw_FEAT_SVE__prefetch.cs
60 apiraw_FEAT_SVE__storen.cs
63 apiraw_FEAT_SVE__fp.cs
64 apiraw_FEAT_SVE__fused.cs
64 apiraw_FEAT_SVE__store.cs
70 apiraw_FEAT_SVE__loadn.cs
70 apiraw_FEAT_SVE__shift.cs
72 apiraw_FEAT_SVE__minmax.cs
85 apiraw_FEAT_SVE__firstfault.cs
86 apiraw_FEAT_SVE__select.cs
90 apiraw_FEAT_SVE__multiply.cs
100 apiraw_FEAT_SVE__storescatter.cs
108 apiraw_FEAT_SVE__zip.cs
128 apiraw_FEAT_SVE__bitwise.cs
156 apiraw_FEAT_SVE__load.cs
158 apiraw_FEAT_SVE__firstfaultgather.cs
158 apiraw_FEAT_SVE__loadgather.cs
158 apiraw_FEAT_SVE__math.cs
176 apiraw_FEAT_SVE__count.cs
200 apiraw_FEAT_SVE__compare.cs

That's 2307 C# methods across 27 groups.

Here's a truncated apiraw_FEAT_SVE__math.cs

namespace System.Runtime.Intrinsics.Arm

public abstract class Sve : AdvSimd /// Feature: FEAT_SVE  Category: math
{
    /// Abs : Absolute value

    /// svfloat32_t svabs[_f32]_m(svfloat32_t inactive, svbool_t pg, svfloat32_t op) : "FABS Ztied.S, Pg/M, Zop.S" or "MOVPRFX Zresult, Zinactive; FABS Zresult.S, Pg/M, Zop.S"
    /// svfloat32_t svabs[_f32]_x(svbool_t pg, svfloat32_t op) : "FABS Ztied.S, Pg/M, Ztied.S" or "MOVPRFX Zresult, Zop; FABS Zresult.S, Pg/M, Zop.S"
    /// svfloat32_t svabs[_f32]_z(svbool_t pg, svfloat32_t op) : "MOVPRFX Zresult.S, Pg/Z, Zop.S; FABS Zresult.S, Pg/M, Zop.S"
  public static unsafe Vector<float> Abs(Vector<float> value);

    /// svfloat64_t svabs[_f64]_m(svfloat64_t inactive, svbool_t pg, svfloat64_t op) : "FABS Ztied.D, Pg/M, Zop.D" or "MOVPRFX Zresult, Zinactive; FABS Zresult.D, Pg/M, Zop.D"
    /// svfloat64_t svabs[_f64]_x(svbool_t pg, svfloat64_t op) : "FABS Ztied.D, Pg/M, Ztied.D" or "MOVPRFX Zresult, Zop; FABS Zresult.D, Pg/M, Zop.D"
    /// svfloat64_t svabs[_f64]_z(svbool_t pg, svfloat64_t op) : "MOVPRFX Zresult.D, Pg/Z, Zop.D; FABS Zresult.D, Pg/M, Zop.D"
  public static unsafe Vector<double> Abs(Vector<double> value);

    /// svint8_t svabs[_s8]_m(svint8_t inactive, svbool_t pg, svint8_t op) : "ABS Ztied.B, Pg/M, Zop.B" or "MOVPRFX Zresult, Zinactive; ABS Zresult.B, Pg/M, Zop.B"
    /// svint8_t svabs[_s8]_x(svbool_t pg, svint8_t op) : "ABS Ztied.B, Pg/M, Ztied.B" or "MOVPRFX Zresult, Zop; ABS Zresult.B, Pg/M, Zop.B"
    /// svint8_t svabs[_s8]_z(svbool_t pg, svint8_t op) : "MOVPRFX Zresult.B, Pg/Z, Zop.B; ABS Zresult.B, Pg/M, Zop.B"
  public static unsafe Vector<sbyte> Abs(Vector<sbyte> value);

    /// svint16_t svabs[_s16]_m(svint16_t inactive, svbool_t pg, svint16_t op) : "ABS Ztied.H, Pg/M, Zop.H" or "MOVPRFX Zresult, Zinactive; ABS Zresult.H, Pg/M, Zop.H"
    /// svint16_t svabs[_s16]_x(svbool_t pg, svint16_t op) : "ABS Ztied.H, Pg/M, Ztied.H" or "MOVPRFX Zresult, Zop; ABS Zresult.H, Pg/M, Zop.H"
    /// svint16_t svabs[_s16]_z(svbool_t pg, svint16_t op) : "MOVPRFX Zresult.H, Pg/Z, Zop.H; ABS Zresult.H, Pg/M, Zop.H"
  public static unsafe Vector<short> Abs(Vector<short> value);

    /// svint32_t svabs[_s32]_m(svint32_t inactive, svbool_t pg, svint32_t op) : "ABS Ztied.S, Pg/M, Zop.S" or "MOVPRFX Zresult, Zinactive; ABS Zresult.S, Pg/M, Zop.S"
    /// svint32_t svabs[_s32]_x(svbool_t pg, svint32_t op) : "ABS Ztied.S, Pg/M, Ztied.S" or "MOVPRFX Zresult, Zop; ABS Zresult.S, Pg/M, Zop.S"
    /// svint32_t svabs[_s32]_z(svbool_t pg, svint32_t op) : "MOVPRFX Zresult.S, Pg/Z, Zop.S; ABS Zresult.S, Pg/M, Zop.S"
  public static unsafe Vector<int> Abs(Vector<int> value);

    /// svint64_t svabs[_s64]_m(svint64_t inactive, svbool_t pg, svint64_t op) : "ABS Ztied.D, Pg/M, Zop.D" or "MOVPRFX Zresult, Zinactive; ABS Zresult.D, Pg/M, Zop.D"
    /// svint64_t svabs[_s64]_x(svbool_t pg, svint64_t op) : "ABS Ztied.D, Pg/M, Ztied.D" or "MOVPRFX Zresult, Zop; ABS Zresult.D, Pg/M, Zop.D"
    /// svint64_t svabs[_s64]_z(svbool_t pg, svint64_t op) : "MOVPRFX Zresult.D, Pg/Z, Zop.D; ABS Zresult.D, Pg/M, Zop.D"
  public static unsafe Vector<long> Abs(Vector<long> value);

    /// AbsoluteDifference : Absolute difference

    /// svfloat32_t svabd[_f32]_m(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svfloat32_t svabd[_f32]_x(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "FABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svfloat32_t svabd[_f32]_z(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; FABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<float> AbsoluteDifference(Vector<float> left, Vector<float> right);

    /// svfloat64_t svabd[_f64]_m(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svfloat64_t svabd[_f64]_x(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "FABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svfloat64_t svabd[_f64]_z(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; FABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<double> AbsoluteDifference(Vector<double> left, Vector<double> right);

    /// svint8_t svabd[_s8]_m(svbool_t pg, svint8_t op1, svint8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svint8_t svabd[_s8]_x(svbool_t pg, svint8_t op1, svint8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "SABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svint8_t svabd[_s8]_z(svbool_t pg, svint8_t op1, svint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; SABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
  public static unsafe Vector<sbyte> AbsoluteDifference(Vector<sbyte> left, Vector<sbyte> right);

    /// svint16_t svabd[_s16]_m(svbool_t pg, svint16_t op1, svint16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svint16_t svabd[_s16]_x(svbool_t pg, svint16_t op1, svint16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "SABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svint16_t svabd[_s16]_z(svbool_t pg, svint16_t op1, svint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; SABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
  public static unsafe Vector<short> AbsoluteDifference(Vector<short> left, Vector<short> right);

    /// svint32_t svabd[_s32]_m(svbool_t pg, svint32_t op1, svint32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svint32_t svabd[_s32]_x(svbool_t pg, svint32_t op1, svint32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "SABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svint32_t svabd[_s32]_z(svbool_t pg, svint32_t op1, svint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; SABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<int> AbsoluteDifference(Vector<int> left, Vector<int> right);

    /// svint64_t svabd[_s64]_m(svbool_t pg, svint64_t op1, svint64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svint64_t svabd[_s64]_x(svbool_t pg, svint64_t op1, svint64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "SABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svint64_t svabd[_s64]_z(svbool_t pg, svint64_t op1, svint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; SABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<long> AbsoluteDifference(Vector<long> left, Vector<long> right);

    /// svuint8_t svabd[_u8]_m(svbool_t pg, svuint8_t op1, svuint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svuint8_t svabd[_u8]_x(svbool_t pg, svuint8_t op1, svuint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "UABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svuint8_t svabd[_u8]_z(svbool_t pg, svuint8_t op1, svuint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; UABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
  public static unsafe Vector<byte> AbsoluteDifference(Vector<byte> left, Vector<byte> right);

    /// svuint16_t svabd[_u16]_m(svbool_t pg, svuint16_t op1, svuint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svuint16_t svabd[_u16]_x(svbool_t pg, svuint16_t op1, svuint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "UABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svuint16_t svabd[_u16]_z(svbool_t pg, svuint16_t op1, svuint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; UABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
  public static unsafe Vector<ushort> AbsoluteDifference(Vector<ushort> left, Vector<ushort> right);

    /// svuint32_t svabd[_u32]_m(svbool_t pg, svuint32_t op1, svuint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svuint32_t svabd[_u32]_x(svbool_t pg, svuint32_t op1, svuint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "UABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svuint32_t svabd[_u32]_z(svbool_t pg, svuint32_t op1, svuint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; UABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<uint> AbsoluteDifference(Vector<uint> left, Vector<uint> right);

    /// svuint64_t svabd[_u64]_m(svbool_t pg, svuint64_t op1, svuint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svuint64_t svabd[_u64]_x(svbool_t pg, svuint64_t op1, svuint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "UABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svuint64_t svabd[_u64]_z(svbool_t pg, svuint64_t op1, svuint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; UABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<ulong> AbsoluteDifference(Vector<ulong> left, Vector<ulong> right);

    /// svfloat32_t svabd[_n_f32]_m(svbool_t pg, svfloat32_t op1, float32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svfloat32_t svabd[_n_f32]_x(svbool_t pg, svfloat32_t op1, float32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "FABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svfloat32_t svabd[_n_f32]_z(svbool_t pg, svfloat32_t op1, float32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; FABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<float> AbsoluteDifference(Vector<float> left, float right);

    /// svfloat64_t svabd[_n_f64]_m(svbool_t pg, svfloat64_t op1, float64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svfloat64_t svabd[_n_f64]_x(svbool_t pg, svfloat64_t op1, float64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "FABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svfloat64_t svabd[_n_f64]_z(svbool_t pg, svfloat64_t op1, float64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; FABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<double> AbsoluteDifference(Vector<double> left, double right);

    /// svint8_t svabd[_n_s8]_m(svbool_t pg, svint8_t op1, int8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svint8_t svabd[_n_s8]_x(svbool_t pg, svint8_t op1, int8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "SABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svint8_t svabd[_n_s8]_z(svbool_t pg, svint8_t op1, int8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; SABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
  public static unsafe Vector<sbyte> AbsoluteDifference(Vector<sbyte> left, sbyte right);

    /// svint16_t svabd[_n_s16]_m(svbool_t pg, svint16_t op1, int16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svint16_t svabd[_n_s16]_x(svbool_t pg, svint16_t op1, int16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "SABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svint16_t svabd[_n_s16]_z(svbool_t pg, svint16_t op1, int16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; SABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
  public static unsafe Vector<short> AbsoluteDifference(Vector<short> left, short right);

    /// svint32_t svabd[_n_s32]_m(svbool_t pg, svint32_t op1, int32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svint32_t svabd[_n_s32]_x(svbool_t pg, svint32_t op1, int32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "SABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svint32_t svabd[_n_s32]_z(svbool_t pg, svint32_t op1, int32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; SABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<int> AbsoluteDifference(Vector<int> left, int right);

    /// svint64_t svabd[_n_s64]_m(svbool_t pg, svint64_t op1, int64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svint64_t svabd[_n_s64]_x(svbool_t pg, svint64_t op1, int64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "SABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svint64_t svabd[_n_s64]_z(svbool_t pg, svint64_t op1, int64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; SABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<long> AbsoluteDifference(Vector<long> left, long right);

    /// svuint8_t svabd[_n_u8]_m(svbool_t pg, svuint8_t op1, uint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svuint8_t svabd[_n_u8]_x(svbool_t pg, svuint8_t op1, uint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "UABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
    /// svuint8_t svabd[_n_u8]_z(svbool_t pg, svuint8_t op1, uint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; UABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
  public static unsafe Vector<byte> AbsoluteDifference(Vector<byte> left, byte right);

    /// svuint16_t svabd[_n_u16]_m(svbool_t pg, svuint16_t op1, uint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svuint16_t svabd[_n_u16]_x(svbool_t pg, svuint16_t op1, uint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "UABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
    /// svuint16_t svabd[_n_u16]_z(svbool_t pg, svuint16_t op1, uint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; UABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
  public static unsafe Vector<ushort> AbsoluteDifference(Vector<ushort> left, ushort right);

    /// svuint32_t svabd[_n_u32]_m(svbool_t pg, svuint32_t op1, uint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svuint32_t svabd[_n_u32]_x(svbool_t pg, svuint32_t op1, uint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "UABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
    /// svuint32_t svabd[_n_u32]_z(svbool_t pg, svuint32_t op1, uint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; UABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
  public static unsafe Vector<uint> AbsoluteDifference(Vector<uint> left, uint right);

    /// svuint64_t svabd[_n_u64]_m(svbool_t pg, svuint64_t op1, uint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svuint64_t svabd[_n_u64]_x(svbool_t pg, svuint64_t op1, uint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "UABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
    /// svuint64_t svabd[_n_u64]_z(svbool_t pg, svuint64_t op1, uint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; UABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
  public static unsafe Vector<ulong> AbsoluteDifference(Vector<ulong> left, ulong right);

....SNIP.....

    /// SubtractSaturate : Saturating subtract

    /// svint8_t svqsub[_s8](svint8_t op1, svint8_t op2) : "SQSUB Zresult.B, Zop1.B, Zop2.B"
  public static unsafe Vector<sbyte> SubtractSaturate(Vector<sbyte> left, Vector<sbyte> right);

    /// svint16_t svqsub[_s16](svint16_t op1, svint16_t op2) : "SQSUB Zresult.H, Zop1.H, Zop2.H"
  public static unsafe Vector<short> SubtractSaturate(Vector<short> left, Vector<short> right);

    /// svint32_t svqsub[_s32](svint32_t op1, svint32_t op2) : "SQSUB Zresult.S, Zop1.S, Zop2.S"
  public static unsafe Vector<int> SubtractSaturate(Vector<int> left, Vector<int> right);

    /// svint64_t svqsub[_s64](svint64_t op1, svint64_t op2) : "SQSUB Zresult.D, Zop1.D, Zop2.D"
  public static unsafe Vector<long> SubtractSaturate(Vector<long> left, Vector<long> right);

    /// svuint8_t svqsub[_u8](svuint8_t op1, svuint8_t op2) : "UQSUB Zresult.B, Zop1.B, Zop2.B"
  public static unsafe Vector<byte> SubtractSaturate(Vector<byte> left, Vector<byte> right);

    /// svuint16_t svqsub[_u16](svuint16_t op1, svuint16_t op2) : "UQSUB Zresult.H, Zop1.H, Zop2.H"
  public static unsafe Vector<ushort> SubtractSaturate(Vector<ushort> left, Vector<ushort> right);

    /// svuint32_t svqsub[_u32](svuint32_t op1, svuint32_t op2) : "UQSUB Zresult.S, Zop1.S, Zop2.S"
  public static unsafe Vector<uint> SubtractSaturate(Vector<uint> left, Vector<uint> right);

    /// svuint64_t svqsub[_u64](svuint64_t op1, svuint64_t op2) : "UQSUB Zresult.D, Zop1.D, Zop2.D"
  public static unsafe Vector<ulong> SubtractSaturate(Vector<ulong> left, Vector<ulong> right);

    /// svint8_t svqsub[_n_s8](svint8_t op1, int8_t op2) : "SQSUB Ztied1.B, Ztied1.B, #op2" or "SQADD Ztied1.B, Ztied1.B, #-op2" or "SQSUB Zresult.B, Zop1.B, Zop2.B"
  public static unsafe Vector<sbyte> SubtractSaturate(Vector<sbyte> left, sbyte right);

    /// svint16_t svqsub[_n_s16](svint16_t op1, int16_t op2) : "SQSUB Ztied1.H, Ztied1.H, #op2" or "SQADD Ztied1.H, Ztied1.H, #-op2" or "SQSUB Zresult.H, Zop1.H, Zop2.H"
  public static unsafe Vector<short> SubtractSaturate(Vector<short> left, short right);

    /// svint32_t svqsub[_n_s32](svint32_t op1, int32_t op2) : "SQSUB Ztied1.S, Ztied1.S, #op2" or "SQADD Ztied1.S, Ztied1.S, #-op2" or "SQSUB Zresult.S, Zop1.S, Zop2.S"
  public static unsafe Vector<int> SubtractSaturate(Vector<int> left, int right);

    /// svint64_t svqsub[_n_s64](svint64_t op1, int64_t op2) : "SQSUB Ztied1.D, Ztied1.D, #op2" or "SQADD Ztied1.D, Ztied1.D, #-op2" or "SQSUB Zresult.D, Zop1.D, Zop2.D"
  public static unsafe Vector<long> SubtractSaturate(Vector<long> left, long right);

    /// svuint8_t svqsub[_n_u8](svuint8_t op1, uint8_t op2) : "UQSUB Ztied1.B, Ztied1.B, #op2" or "UQSUB Zresult.B, Zop1.B, Zop2.B"
  public static unsafe Vector<byte> SubtractSaturate(Vector<byte> left, byte right);

    /// svuint16_t svqsub[_n_u16](svuint16_t op1, uint16_t op2) : "UQSUB Ztied1.H, Ztied1.H, #op2" or "UQSUB Zresult.H, Zop1.H, Zop2.H"
  public static unsafe Vector<ushort> SubtractSaturate(Vector<ushort> left, ushort right);

    /// svuint32_t svqsub[_n_u32](svuint32_t op1, uint32_t op2) : "UQSUB Ztied1.S, Ztied1.S, #op2" or "UQSUB Zresult.S, Zop1.S, Zop2.S"
  public static unsafe Vector<uint> SubtractSaturate(Vector<uint> left, uint right);

    /// svuint64_t svqsub[_n_u64](svuint64_t op1, uint64_t op2) : "UQSUB Ztied1.D, Ztied1.D, #op2" or "UQSUB Zresult.D, Zop1.D, Zop2.D"
  public static unsafe Vector<ulong> SubtractSaturate(Vector<ulong> left, ulong right);

  // total ACLE covered:      390
  // total method signatures: 158
  // total method names:      11
}

Adding FEAT_SVE2 and the other SVE extensions gets us to 4061 methods in 50 groups. Although I still need to do some further parsing for the other extensions.

There's lots of parsing to fix up namings and types. Plus predicates have been stripped out where necessary (eg pg in the methods above).

If there's nothing obvious that need fixing in the above, then I can start posting a few of them as API requests next week.

tannergooding commented 9 months ago

At a glance those look correct.

Noting that for the purposes of API review, having bigger issues is fine and a lot of the information can be compressed down.

Consider for example how we did AVX512F here: https://github.com/dotnet/runtime/issues/73604. We have all the necessary APIs, but we don't provide information unnecessary to API review such as what instruction or native API it maps to.

Given how large SVE and SVE2 are, I wouldn't consider it unreasonable to compress the data further. For example, you could imagine it represented as:

// where T is byte, sbyte, short, ushort, int, uint, long, and ulong
public static Vector<T> SubtractSaturate(Vector<T> left, Vector<T> right);

We aren't going to reject an API for a particular T here and so we really care mostly about the general signature, name, and exactly what T will be supported.

For public static Vector<T> SubtractSaturate(Vector<T> left, T right);, it looks like this is effectively a special encoding of SubtractSaturate(left, Broadcast(right)); when right is a constant in range (typically 0-255 or a multiply of 256), is that correct?

This is going to be very hard to represent to users "correctly" as it requires right to be a constant. We can support contiguous ranges using [ConstantExpected] but we don't have a way to represent something like "multiple of 256 between [256, 65280]". So I wonder if it would be better to simply expose those as SubtractSaturate(left, Broadcast(right)); and simply optimize it down when the input is valid. Thoughts?

a74nh commented 9 months ago

Given how large SVE and SVE2 are, I wouldn't consider it unreasonable to compress the data further. For example, you could imagine it represented as:
// where T is byte, sbyte, short, ushort, int, uint, long, and ulong
public static Vector<T> SubtractSaturate(Vector<T> left, Vector<T> right);

It's actually more complicated to reduce it down to T versions. I'm starting with the C ACLE which is already split into types.

For quite a few methods there are two types (usually one is twice the size of the other). Eg:

  public static unsafe Vector<uint> AbsoluteDifferenceAddWideningLower(Vector<uint> op1, Vector<ushort> op2, ushort op3);

And then there are others that take in scalars, an it's not immediately obvious should the scalar be a T or is it always fixed. Eg: for a load(), the index should always be long, but in a math operation a long is probably matched to T.

However, given everything below, this might become obvious.

We aren't going to reject an API for a particular T here and so we really care mostly about the general signature, name, and exactly what T will be supported.

Ok, that makes sense.

For public static Vector<T> SubtractSaturate(Vector<T> left, T right);, it looks like this is effectively a special encoding of SubtractSaturate(left, Broadcast(right)); when right is a constant in range (typically 0-255 or a multiply of 256), is that correct?

This is going to be very hard to represent to users "correctly" as it requires right to be a constant. We can support contiguous ranges using [ConstantExpected] but we don't have a way to represent something like "multiple of 256 between [256, 65280]". So I wonder if it would be better to simply expose those as SubtractSaturate(left, Broadcast(right)); and simply optimize it down when the input is valid. Thoughts?

The way this is done in the C ACLE is

When op2 is in range[1], this intrinsic can use: SQSUB Ztied1.S, Ztied1.S, #op2 // [2]

When -op2 is in range[1], this intrinsic can use: SQADD Ztied1.S, Ztied1.S, #-op2 // [2]

The general implementation is: SQSUB Zresult.S, Zop1.S, Zop2.S

[1] This is true if the value is in the range [0, 255] or is a multiple of 256 in the range [0, 65280]

[2] If instead result is in a different register from the inputs, the compiler can add a preceding MOVPRFX (unpredicated) instruction

.... which is effectively the same thing, but just different choices on when/how the optimal version is selected.

Given the C# API is higher level than just mapping directly down to the architecture, then it makes sense to implicitly optimise where possible, using just the vector only version.

A quick grep suggests that might reduce the API by another 1686 methods.