Support for intrinsics with multiple execution engines with divergent capabilities

lambdageek commented 2 years ago

See https://github.com/dotnet/runtime/issues/73454 (issue 2) for the motivation.

Currently the BCL intrinsics support is defined using static classes that provide one intrinsic function per operation (for example System.Runtime.Intrinsics.Arm.AdvSimd.AbsoluteDifferenceWideningUpperAndAdd(Vector128<Int32>, Vector128<Int16>, Vector128<Int16>)-system-runtime-intrinsics-vector128((system-int16))-system-runtime-intrinsics-vector128((system-int16)))) ) together with an IsSupported property that can be used to decide whether the whole class of intrinsic operations is supported or not.

A natural approach is to write two classes that implement the same interface: a specialized version that takes advantage of intrinsics, and a slow version that uses baseline scalar math.

  interface IMyAlgorithms {
     short MyMathAlgorithm(ReadOnlySpan<short> input);
  }

  class MyFastAlgorithms : IMyAlgorithms {
     short MyMathAlgorithm(ReadOnlySpan<short> input) {
        /* uses Arm.AdvSimd.AbsoluteDifferenceWideningUpperAndAdd(Vector128<Int32>, Vector128<Int16>, Vector128<Int16>) */
  }

  class MySlowAlgorithms : IMyAlgorithms {
    short MyMathAlgorithm(ReadOnlySpan<short> input) {
      /* uses normal scalar math operations */
    }
  }

  class Program {
    public static void Main () {
       IMyAlgorithms algo = Arm.AdvSimd.IsSupported ? new MyFastAlgorithms () : new MySlowAlgoritms ();
       /* read inputs, etc */
      algo.MyMathAlgorithm (inputs);
    }
  }

The decision of whether to use the fast or the slow version of the algorithms is made upfront based on processor capabilities available at runtime.

The problem is that in Mono (and CoreCLR) the code that makes the decision may run under a different execution engine than the code that actually implements the algorithm.

This can lead to problems if the "fallback" execution engine (a JIT or interpreter) does not support the intrinsics while the preferred execution engine (e.g. LLVM AOT) does support them.

If we cannot AOT compile MyFastAlgorithm.MyMathAlgorithm for some reason, but we can AOT compile Program.Main, we will get AdvSimd.IsSupported == true (because LLVM AOT supports the AdvSimd intrinsics), but at runtime calling MyFastAlgorithm.MyMathAlgorithm will throw PlatformNotSupportedException exceptions (because the interpreter and Mono JIT do not support the intrinsics).

There are two issues here:

If the preferred execution engine says IsSupported, the fallback execution engine cannot say !IsSupported. We must implement some kind of support for the intrinsics in the fallback execution engine.
There should be a mechanism to mark MyFastAlgorithm.MyMathAlgorithm so that the AOT compiler must successfully AOT it, or issue a diagnostic if that is not possible. The goal is to help algorithm authors write code that the AOT compiler can support so that the fallbacks from (1) are never used.

Support in the fallback execution engine.

We have a couple of options

Faithfully implement support for all the intrinsics in the mono interpreter and the mono JIT. This is possible but represents a huge amount of work. There are hundreds of intrinsics.
Modify the AOT compiler to always emit method bodies for the intrinsic functions and "support" the intrinsics in the JIT and interpreter by emitting calls to the AOTed versions. This will be slow, but it's probably the least duplicated work because we can take advantage of the work already done for the LLVM backend.
Implement fallback versions of the intrinsics in a separate C# class and teach the JIT and the interpreter to substitute calls to AdvSimd.AbsoluteDifferenceWideningUpperAndAdd to FallbackAdvSimd.AbsoluteDifferenceWideningUpperAndAdd. This is still a ton of work, but the C# versions are easier to implement than JIT&interp support in C. These won't be fast but they will at least provide a baseline (that could also be useful for testing).

Support marking certain methods as "must AOT".

We could add an attribute that informs the AOT compiler that a certain method must be AOTed, if certain conditions are true:

class MyFastAlgorithms : IMyAlgorithms {
    [MustAOTWhen (typeof(System.Runtime.Intrinsics.Arm.AdvSimd), nameof(IsSupported)]
    short MyMathAlgorithm(ReadOnlySpan<short> input) {
       /* uses Arm.AdvSimd.AbsoluteDifferenceWideningUpperAndAdd(Vector128<Int32>, Vector128<Int16>, Vector128<Int16>) */
    }
}

Any method marked with MustAOTWhen will be AOTed if the AOT compiler is in a mode such that the given property is statically known to be true. If the method uses an unsupported mechanism, the AOT compiler can issue a diagnostic.

lambdageek commented 2 years ago

/cc @davidwrighton @SamMonoRT @fanyang-mono @vargaz

lambdageek commented 2 years ago

This isn't an exhaustive list of options. Primarily I wanted to capture the problem. We may be able to come up with alternate approaches to resolving it

lambdageek commented 2 years ago

FYI @tannergooding, appreciate any insights you might have

tannergooding commented 2 years ago

Implement fallback versions of the intrinsics in a separate C# class and teach the JIT and the interpreter to substitute calls to AdvSimd.AbsoluteDifferenceWideningUpperAndAdd to FallbackAdvSimd.AbsoluteDifferenceWideningUpperAndAdd. This is still a ton of work, but the C# versions are easier to implement than JIT&interp support in C. These won't be fast but they will at least provide a baseline (that could also be useful for testing).

This is, IMO, the least desirable and most expensive option. I wouldn't agree that its easier to implement than JIT/interp support either.

Adding support for new intrinsics to RyuJIT is "trivial" the vast majority of the time. It is almost exclusively table driven and effectively just requires that you support the VEX encoding.

We are about to add support for AVX-512 in .NET 8 as well and it is expected that once we add EVEX encoding support to the emitter, that each .NET API with a unique name is effectively just a new line in hwintrinsiclistxarch.h and potentially new encoding metadata in instrsxarch.h for instructions which are entirely new (and not just exposing a 512-bit variant).

Mono could, and likely should, have a similar metadata driven approach in which case it would be equally simple to add new intrinsic support there.

In the case of something like the interpreter, such a metadata table could have a mapping to the corresponding C intrinsic (e.g. Sse.Add maps to _mm_add_ps). This would similarly allow a fast and entirely table driven approach that does the "right" thing.

Faithfully implement support for all the intrinsics in the mono interpreter and the mono JIT. This is possible but represents a huge amount of work. There are hundreds of intrinsics.

As indicated above, I don't think this is as much work as thought provided that Mono goes about this in a mechanism similarly to how RyuJIT did. It's still work, and quite a bit, but its not like doing this in C# where we'd have to determine the actual behavior of each instruction and exactly emulate it + any relevant quirks.

Modify the AOT compiler to always emit method bodies for the intrinsic functions and "support" the intrinsics in the JIT and interpreter by emitting calls to the AOTed versions. This will be slow, but it's probably the least duplicated work because we can take advantage of the work already done for the LLVM backend.

I think this would be viable as well and is effectively required for "indirect" invocation anyways (e.g. if a user calls Sse.Add via reflection or a delegate, etc). It's slow, but correct.

Any method marked with MustAOTWhen

Conceivably the AOT compiler could just say that any method which uses a hardware intrinsic must be AOT'd.

It could also say that any method which uses hardware intrinsics not within a corresponding if (Isa.IsSupported) check must be AOT'd if it wanted to be slightly less restrictive.

tannergooding commented 2 years ago

I believe CG2 handles this specially for corelib and other libraries differently.

corelib is treated specially and we assume that we always do the right thing and the two paths are equivalent.

While it assumes other libraries may differ in behavior between two paths and so I believe it forces jitting for various IsSupported checks were present that couldn't be emitted as a dynamic check.

We of course can rely on the JIT being present for that scenario, so its fine for us. NAOT doesn't have any issues because it exactly targets a given baseline.

lambdageek commented 2 years ago

Conceivably the AOT compiler could just say that any method which uses a hardware intrinsic must be AOT'd.

That will slow down AOT compilation time. We would have to look at every method in every assembly.

jkotas commented 2 years ago

That will slow down AOT compilation time. We would have to look at every method in every assembly.

Yes, how much? A simple scan of all methods in the app, without generating any code, should be pretty fast.

lambdageek commented 2 years ago

Mono could, and likely should, have a similar metadata driven approach in which case it would be equally simple to add new intrinsic support there.

In theory Mono is using some kind of table-driven approach for this, too. but evidently we still end up with a lot of manual plumbing. it would be good to investigate where the gap is @SamMonoRT

Modify the AOT compiler to always emit method bodies for the intrinsic functions and "support" the intrinsics in the JIT and interpreter by emitting calls to the AOTed versions. This will be slow, but it's probably the least duplicated work because we can take advantage of the work already done for the LLVM backend.

I think this would be viable as well and is effectively required for "indirect" invocation anyways (e.g. if a user calls Sse.Add via reflection or a delegate, etc). It's slow, but correct.

I think this is probably the right place to start in net8 for Mono. That will at least get us back to "correct". And we can measure AOT size regressions and make a decision whether it makes sense to add faster support to the fallback execution engines.

A simple scan of all methods in the app, without generating any code, should be pretty fast.

This is also probably a good place to start - we could make a table of "must AOT" methods and measure the impact on compilation time and see if there are false positives.

lambdageek commented 2 years ago

/cc @vargaz @BrzVlad

vargaz commented 2 years ago

Implementing the intrinsics in the mono JIT is doable, the bigger problem is that the generated code will not be very good quality, since the SIMD code in the BCL assumes a good optimizing compiler.

dotnet / runtime

Support for intrinsics with multiple execution engines with divergent capabilities #74587