[mono] Investigate rewriting Scalar<T> methods so the linker can replace them with stubs

ivanpovazan commented 2 years ago

Scalar is defined with struct constraint, but additionally in many places (in different methods) there are runtime checks to check for a specific type support. If the type is not supported the method just throws. AOTing such pattern with Mono results in unnecessary large and slow methods.

Rewriting methods of Scalar in a way that linker can replace them with stubs can potentially reduce the size of the AOTed code by Mono.

This is related to #56385

cc @vargaz

tannergooding commented 2 years ago

The pattern that Scalar<T> follows is a fairly standard pattern throughout the .NET ecosystem, particularly in high performance code.

I think it would likely be better, long term, for Mono to adjust itself to better handle the pattern.

ivanpovazan commented 2 years ago

hi @tannergooding, the idea was to create a tracking issue for investigating whether it is possible to rewrite the Scalar<T> so its code could get trimmed down (especially since it is, as far as I understand, used only internally), which would help mono AOT to produce smaller binaries. Maybe I should have labeled it differently since it is an investigation rather than a problem (please let me know if you want me to change the labels).

On the other hand, yes, you are right, there is a more concrete solution tracked here: https://github.com/dotnet/runtime/issues/71431 while this one is a faster approach to see the potential gains.

tannergooding commented 2 years ago

Maybe I should have labeled it differently since it is an investigation rather than a problem (please let me know if you want me to change the labels).

I think the confusing part is simply that it says Rewrite rather than something like Investigate rewriting.

Scalar<T> is largely a helper type meant for platforms without SIMD acceleration and as a fallback for reflection, delegates, or other forms of "indirect invocation" (even on platforms with SIMD acceleration).

We could make it smaller for platforms with SIMD acceleration by making them recursive on one path (much like the Hardware Intrinsics do), that way the Scalar<T> path is "dead code" except for platforms like Arm32.

But otherwise, I don't expect there is a simplification, especially not one that is expressible in C#. I don't think representing the fallback path in the JIT is a good idea as it defeats the purpose of having a fallback and requires "more bringup" work for new or community supported platforms/architectures.

huoyaoyuan commented 2 years ago

Where is the opportunity to rewrite such methods with generic math? Such pattern is often used for high performance code. The current generic math type system lacks a way to represent the check around constraint. Further improvement should take this in mind.

tannergooding commented 2 years ago

Generic Math isn't possible for this scenario right now because there is no way to go from something "less constrained" to something "more constrained"

That is, given where T : struct, you can't call something where T : struct, INumber<T> because the language and runtime have no way to make this a valid check, etc.

Supporting this scenario in the future has been raised and will likely be taken into consideration

lambdageek commented 2 years ago

I think it would likely be better, long term, for Mono to adjust itself to better handle the pattern.

The problem with this pattern is that it causes us to generate a ton of fallback code for cases that can't happen. For any concrete instantiation with a supported type the code is fine: the typeswitch compiles down to a single branch and the overall function is usually small enough to get inlined into the caller, etc.

But we also have to generate a universal generic fallback. This thing by definition doesn't know anything about the generic type so it has to generate a huge type switch and use the slower gsharedvt (universal generics / boxing / dictionary passing) approach.

There are principled things we can do (especially if aided by hints to the AOT compiler about which specializations are profitable to generate), but getting there will take some engineering effort. In a closed universe like an iOS application, rewriting with the IL trimmer may give us an indication of the potential space savings if nothing else.

tannergooding commented 2 years ago

Sure, but my point is that this is a very common pattern in perf oriented code.

Within corelib its used by vector code, its used by generic math, its also used in various other hot paths to ensure specialization occurs. Outside the ecosystem its used by all sorts of libraries and applications.

So this is something that will ultimately have to be handled or it will continue presenting as a negative experience for Mono. Not just here, but throughout the entire ecosystem.

This thing by definition doesn't know anything about the generic type so it has to generate a huge type switch and use the slower gsharedvt (universal generics / boxing / dictionary passing) approach.

It seems like, at the very least you could recognize the if (typeof(T) == typeof(...)) pattern. For sealed types, like value types, this is definitively "live" or "dead". It's only for unsealed reference types where that can be truly unknown, but that scenario is also increasingly uncommon since RyuJIT only specializes on value types.

It would then seem reasonable to assume that with such a pattern, the specializations to consider are exactly the typeof(...) checks; most of which could be trimmed in a real application since they are unused.

In the case of things like Scalar<T> one of the paths (the fallback path) always throws and so the compiler should also be able to recognize this and keep that path as the "shared" path.

jkotas commented 2 years ago

It would then seem reasonable to assume that with such a pattern, the specializations to consider are exactly the typeof(...) checks; most of which could be trimmed in a real application since they are unused.

It is typically impossible to tell whether these specializations are unused thanks to reflection.

tannergooding commented 2 years ago

We have attributes and other annotations designed to assist with that. I don't think its unreasonable to require or expect those to be used here as well.

These types of patterns aren't going to go away. In the worst case the AOT codegen for this pattern could be n concrete (12 in this case) implementations and a central shared implementation that does the check and dispatch to the concrete implementation. Given the pattern, it would in general be no larger than a single shared implementation that contains all the paths, particularly given the generic on generic calls.

There are always exceptions, yes, but those tend to be rarer especially for this specific pattern.

jkotas commented 2 years ago

The existing trimming annotations do not track exact generic instantiations. They only allow you to say that MyType<T> is required, for unknown set of Ts.

These types of patterns aren't going to go away.

The problems that these patterns are trying to solve are not going away. I have doubts that the current popular patterns are the right way to solve these problems. These patterns are bad for startup, AOT or trimming. They are good only if your only performance metric is steady state RPS.

marek-safar commented 2 years ago

I get the feeling this is the old problem of how to tell AOT compiler that the method will never be called dynamically (e.g. reflection) again.

tannergooding commented 2 years ago

The existing trimming annotations do not track exact generic instantiations.

That sounds like a missing hole we should address.

I have doubts that the current popular patterns are the right way to solve these problems.

I'm not aware of any alternatives that provide equivalent, especially where we have back-compat requirements that prevent versioning. For example Scalar<T> could be simpler with generic math, but since changing a generic constraint is breaking and we don't have a way "bridge the gap" from less constrained to more constrained today, we're stuck with the "worse" implementation for right now.

If we have alternatives, then I'd be happy to move to them.

These patterns are bad for startup, AOT or trimming. They are good only if your only performance metric is steady state RPS.

I think that is dependent on whether reflection is used or not. At least personally, I've not seen any issues with generics and reflection free code and I've made quite heavy usage of it + NAOT and friends in various personal projects.

vitek-karas commented 2 years ago

Just wanted to add what trimming can do (and should work already): If there's a if (some == typeof(MyType)) this does not create a "hard reference" to MyType, meaning if this is the only mention of MyType linker will remove that type (and rewrite the check to be if (some == null)). Currently we don't have the ability to remove the branch in the if in this case, but I could imagine doing that one day if the size win would be interesting.

Figuring out that we don't need the fallback paths is much harder, but potentially doable, IF we assume the app will not produce any trim warnings. Given that currently most our verticals produce warnings, it would make such optimization rather risky.

vitek-karas commented 2 years ago

The existing trimming annotations do not track exact generic instantiations.

That sounds like a missing hole we should address.

It's not necessary for correctness, but it would come handy for size reduction in this case. So far we've been focusing mostly on correctness - "if it doesn't work, it really doesn't matter if it's small or not".

It basically boils down to usage of MakeGenericType - if the app doesn't do this, or only does it in a way where static analysis can figure out its inputs, then this is doable.

jkotas commented 2 years ago

Even harder problem is trimming of interface implementations that escape via casting. Once we see that an interface is used, we keep it on all types that implement that interface and expand it on all instantiations of that type if it is generic. It has been a classic problem with trimming of IEnumerable and other frequently implemented interfaces. It is likely going to become a problem with the new numeric interfaces once people start using them.

If we have alternatives, then I'd be happy to move to them.

I do not think we have any good alternatives today. One can avoid these problems by avoiding generics and interface abstractions, but that's not really viable alternative in many cases.

Creating better ways to solve these problems would require creating new fundamental language/runtime features.

ivanpovazan commented 2 years ago

Result

With a toy implementation of custom linker substitutions for replacing Scalar<T> method bodies with stubs, the reduction of approximately 55kb has been recorded for HelloWorld iOS application.

Assumption

The assumption for this attempt was that when targeting arm64 devices, Scalar<T> is unreachable, as the Vector*<T> methods using Scalar<T> as the fallback, would be become intrinsified on this architecture.

Problems

Problems with this assumption, apart from the indirect calls to Scalar<T> (eg: reflection), is that Vector256<T> intrinsics are not supported on arm64, which implies that Scalar<T> cannot be simply stripped out. There are some other cases for certain operations where hardware acceleration is also not supported even for Vector128<T>, which was pointed out by @tannergooding.

Conclusion

The size reduction of 55kb for a xamarin build (see comparison here), would mean a 0.89% improvement on the app size. (I am currently trying to perform the same measurements for .net7 MAUI).

From these 55kb, around 53kb is taken up by GSHAREDVT variants (generic sharing for value types) of Scalar<T> methods. The question is, whether these numbers would justify the effort for implementing more complex logic in the linker to do this kind of nontrivial trimming.

ivanpovazan commented 2 years ago

Moving to 8.0.0 milestone. This will be revisited for .NET8 planning

kotlarmilos commented 1 year ago

As we are progressing with Mono intrinsics, it is good to review this issue and try to conclude if the Scalar fallback code can be trimmed when SIMD is enabled. Initial idea was to add a substitution to the ILLinker for a fallback code. However, as @fanyang-mono mentioned, with intrinsics implemented in Vector classes, Scalar class will not be referenced, and thus the substitution might not be needed.

Another topic is reflection.

I get the feeling this is the old problem of how to tell AOT compiler that the method will never be called dynamically (e.g. reflection) again.

Concerns regarding Vector256<T> intrinsics not supported on arm64.

Problems with this assumption, apart from the indirect calls to Scalar<T> (eg: reflection), is that Vector256<T> intrinsics are not supported on arm64, which implies that Scalar<T> cannot be simply stripped out. There are some other cases for certain operations where hardware acceleration is also not supported even for Vector128<T>, which was pointed out by @tannergooding.

Let's continue the discussion and see if there is a case where size savings can be achieved.

cc @fanyang-mono @jandupej @ivanpovazan @LeVladIonescu @vargaz @lambdageek @SamMonoRT

tannergooding commented 1 year ago

Some of this was refactored. Generally speaking, we now have the following for the software fallbacks:

Vector512 is 2x Vector256 ops
Vector256 is 2x Vector128 ops
Vector128 is 2x Vector64 ops
Vector64 is a for loop over Scalar<T>

In all cases, we have a restriction that T be one of 12 specific types. Any other type will statically throw a NotSupportedException (the one exception is Vector###<T>.IsSupported which is constant true/false based on T):

byte, sbyte
short, ushort
int, uint
long, ulong
nint, nuint
float, double

For RyuJIT:

Arm32 - No acceleration exists today
Arm64 - Vector64 and Vector128 are accelerated as part of the "baseline ISA" requirements.*
x64 -Vector128 is accelerated as part of the "baseline ISA" requirements; Vector256 and Vector512 may be accelerated if hardware supports it; There is no acceleration for Vector64
x86 - Similar to x64, but no acceleration on Unix

Given the refactorings and the intrinsic recognition, I would expect we are getting the Scalar<T> "bloat" on Arm32, x86 Unix, and for the Vector64<T>/Vector128<T> fallback paths on x86/x64. We could improve this for x86 (non unix)/x64 by adding an if (Vector###.IsHardwareAccelerated) { RecuriveCall(args); } path.

Someone else can provide the support details for Mono/WASM, which are being brought online but which aren't as feature complete. Once the work to add the intrinsic support to Mono achieves "parity" with RyuJIT, then I'd expect this to be a non-issue for the same scenarios as RyuJIT.

That would not handle the case where no hardware acceleration exists for a given platform. I still believe the best case scenario for that is to ensure the AOT tooling understands the typeof(T) == typeof(...) pattern for value types and to allow specialization for such scenarios. This is a fairly typical pattern in perf oriented code and understanding it will be more broadly applicable.

kotlarmilos commented 1 year ago

@tannergooding Thank you for such thorough analysis. As you proposed, we are already working on optimizing generics to allow specialization and exclude instances that are not used.

@fanyang-mono According to the analysis, on arm64 it is possible to trim Scalar software callback if acceleration exists, while on x64 it is possible to Vector64 if acceleration exists. Does this make sense?

fanyang-mono commented 1 year ago

@kotlarmilos My understanding is that when SIMD support is fully there, there is nothing needed to be trimmed. When SIMD support is incomplete, which is Mono's current situation, we need them to produce the correct code. The same logic applies to both arm64 and x64.

Another thing that @tannergooding mentioned which we should look into is to make Mono AOT produce typeof(...) when sees typeof(T). Could you open an issue for that? And add it to your Mono AOT size reduction plan for .NET8.

vargaz commented 1 year ago

How is delegates/reflection going to be handled without retaining the fallback code ?

tannergooding commented 1 year ago

How is delegates/reflection going to be handled without retaining the fallback code ?

As per the above, these types were rewritten to:

Vector512 is 2x Vector256 ops
Vector256 is 2x Vector128 ops
Vector128 is 2x Vector64 ops
Vector64 is a for loop over Scalar

This means that on Arm64 with the full JIT support, Vector64 is the only type that has a fallback path that will execute Scalar<T>. The others will end up getting "intrinsified" by virtue of the smaller size being intrinsic.

For x64, this impacts both Vector64 and Vector128 since V64 isn't accelerated at all and the fallback path for V128 is 2x V64 ops. We could workaround this for V128 by adding in a if (V128.IsHardwareAccelerated) { RecursiveCall(...); } path. It would be up to the JIT to treat IsHardwareAccelerated as a constant (RyuJIT already does).

For V64 on x64, however, I don' think there is really a "good option" other than the compiler understanding typeof(T) == typeof(...) checks.

That being said, the common case will be V64 is completely unused on x64 and that there is no reflection or indirect invocation to these methods either. So in the common case the relevant code should be able to get trimmed out.

kotlarmilos commented 1 year ago

@kotlarmilos My understanding is that when SIMD support is fully there, there is nothing needed to be trimmed. When SIMD support is incomplete, which is Mono's current situation, we need them to produce the correct code. The same logic applies to both arm64 and x64.

Make sense. I see it from the implementation perspective as

if (IsHardwareAccelerated) {
   SIMDImplementation();
} else {
   softwareCallback();
}

Is it trimmed by a substitution in ILLinker?

Another thing that @tannergooding mentioned which we should look into is to make Mono AOT produce typeof(...) when sees typeof(T). Could you open an issue for that? And add it to your Mono AOT size reduction plan for .NET8.

I will open an issue and add it to https://github.com/dotnet/runtime/issues/80938.

ivanpovazan commented 1 year ago

@radekdoulik should we move this to .NET9 ?

dotnet / runtime