dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Allow specialization for typeof(T) pattern in Mono AOT compiler #80980

Open kotlarmilos opened 1 year ago

kotlarmilos commented 1 year ago

Description

In Mono AOT compiler, typeof(T) pattern results in unnecessary large and slow methods, especially GSHAREDVT variant (fallback method for any value-type). Making the AOT compiler understand the typeof(T) == typeof(...) pattern for value types and allow specialization for such scenarios could bring improvements.

It has been discussed in https://github.com/dotnet/runtime/issues/71431 and https://github.com/dotnet/runtime/issues/71430.

Tasks

We still have open questions related to the integration and what is required to introduce it as an experimental feature in .NET 8, so comments and feedback are welcome.

am11 commented 1 year ago

[ ] Investigate whether ILCompiler is able to detect the pattern

It does. For instance, this classlib project:

using System;
using System.Runtime.InteropServices;

#nullable disable
class C
{
    public static bool WithLocal<T>()
    {
        Type k = typeof(T);
        return k == typeof(sbyte) || k == typeof(byte);
    }

    public static bool WithoutLocal<T>()
    {
        return typeof(T) == typeof(sbyte) || typeof(T) == typeof(byte);
    }

    [UnmanagedCallersOnly(EntryPoint = nameof(BogusUsageToKeepFuncsInBinary))]
    public static void BogusUsageToKeepFuncsInBinary() =>
        Console.WriteLine(WithoutLocal<C>() && WithLocal<C>());
}

when built and inspected with:

# current runtime: linux-musl-arm64
$ dotnet8 publish -c Release -o dist --ucr -p:PublishAot=true

$ objdump -x dist/lib7.so | grep -E 'F.*With(out)?Local'
0000000000309730 l     F __managedcode  0000000000000060              .hidden lib7_C__WithLocal<System___Canon>
0000000000309790 l     F __managedcode  0000000000000014              .hidden lib7_C__WithoutLocal<System___Canon>

$ gdb dist/lib7.so -batch \
    -ex "disassemble lib7_C__WithoutLocal<System___Canon>" \
    -ex "disassemble lib7_C__WithLocal<System___Canon>"

gives:

Dump of assembler code for function lib7_C__WithoutLocal<System___Canon>:
   0x0000000000309790 <+0>: stp x29, x30, [sp, #-16]!
   0x0000000000309794 <+4>: mov x29, sp
   0x0000000000309798 <+8>: mov w0, wzr
   0x000000000030979c <+12>:    ldp x29, x30, [sp], #16
   0x00000000003097a0 <+16>:    ret
End of assembler dump.

Dump of assembler code for function lib7_C__WithLocal<System___Canon>:
   0x0000000000309730 <+0>: stp x29, x30, [sp, #-32]!
   0x0000000000309734 <+4>: str x19, [sp, #24]
   0x0000000000309738 <+8>: mov x29, sp
   0x000000000030973c <+12>:    str x0, [x29, #16]
   0x0000000000309740 <+16>:    ldr x0, [x0]
   0x0000000000309744 <+20>:    bl  0x2b30f0 <S_P_CoreLib_Internal_Runtime_CompilerHelpers_LdTokenHelpers__GetRuntimeType>
   0x0000000000309748 <+24>:    mov x19, x0
   0x000000000030974c <+28>:    nop
   0x0000000000309750 <+32>:    adr x0, 0x367f08
   0x0000000000309754 <+36>:    bl  0x2b30f0 <S_P_CoreLib_Internal_Runtime_CompilerHelpers_LdTokenHelpers__GetRuntimeType>
   0x0000000000309758 <+40>:    cmp x0, x19
   0x000000000030975c <+44>:    b.eq    0x309780 <lib7_C__WithLocal<System___Canon>+80>  // b.none
   0x0000000000309760 <+48>:    nop
   0x0000000000309764 <+52>:    adr x0, 0x366250
   0x0000000000309768 <+56>:    bl  0x2b30f0 <S_P_CoreLib_Internal_Runtime_CompilerHelpers_LdTokenHelpers__GetRuntimeType>
   0x000000000030976c <+60>:    cmp x0, x19
   0x0000000000309770 <+64>:    cset    x0, eq  // eq = none
   0x0000000000309774 <+68>:    ldr x19, [sp, #24]
   0x0000000000309778 <+72>:    ldp x29, x30, [sp], #32
   0x000000000030977c <+76>:    ret
   0x0000000000309780 <+80>:    mov w0, #0x1                    // #1
   0x0000000000309784 <+84>:    ldr x19, [sp, #24]
   0x0000000000309788 <+88>:    ldp x29, x30, [sp], #32
   0x000000000030978c <+92>:    ret
End of assembler dump.

WithLocal currently has inefficient codegen. Note that Roslyn generatesWithLocal-like code for switch-expressions: sharplab, so codegen of LessThan3, LessThan4 and LessThan5 from sharplab sample is (unexpectedly) bad with NativeAOT. Mono can improve both (disjoint and inlined) forms from the get-go.

vargaz commented 1 year ago

The mono AOT compiler does understand some of these patterns, i.e. by the code in intrinsics.c. The problem is generic sharing, which generates code where the type T is not exactly known, so a method like foo<int> is implemented by a shared method foo<T_INT> where T_INT is constrained to 'int' and enums whose base type is int. In that case, an expression like typeof(T)==typeof(byte) can be optimized away, but an expression like typeof(T)=typeof(int) cannot.

marek-safar commented 1 year ago

Investigate whether ILCompiler is able to detect the pattern

I don't think ILCompiler does anything here, it's all down to RyuJIT which does it.

kotlarmilos commented 1 year ago

The mono AOT compiler does understand some of these patterns, i.e. by the code in intrinsics.c. The problem is generic sharing, which generates code where the type T is not exactly known, so a method like foo<int> is implemented by a shared method foo<T_INT> where T_INT is constrained to 'int' and enums whose base type is int. In that case, an expression like typeof(T)==typeof(byte) can be optimized away, but an expression like typeof(T)=typeof(int) cannot.

Good point. With https://github.com/dotnet/runtime/issues/80941 we might be able to instruct Mono AOT compiler about referenced generic types in the program which are only statically reachable, and to allow the pattern specialization of generics.

I don't think ILCompiler does anything here, it's all down to RyuJIT which does it.

The proposed approach uses ILCompiler since we have already worked on its integration with Mono for iOS, which might timely confirm if it is feasible. Once it is confirmed, I suggest considering other options before the integration as well.

tannergooding commented 1 year ago

The problem is generic sharing, which generates code where the type T is not exactly known, so a method like foo is implemented by a shared method foo where T_INT is constrained to 'int' and enums whose base type is int. In that case, an expression like typeof(T)==typeof(byte) can be optimized away, but an expression like typeof(T)=typeof(int) cannot.

Can't you just not share in this case?

The general premise here is that there is a large amount of generic code that exists which follows the pattern of:

if (typeof(T) == typeof(...))
{
    // Logic for Type 1
}
else if (typeof(T) == typeof(...))
{
    // Logic for Type 2
}
else
{
    // Fallback path    
}

The reason it follows this is because RyuJIT has always specialized value types. The fallback path is sometimes an actual shared path and sometimes a path which purely throws (such as in Vector###<T>). In the case it the fallback is just a throw the prior (typeof(T) == typeof(...)) checks define the entire domain of n exact types that T can be. So for something like Vector###<T> there is no chance for it to be something like an Enum, it can only be int or uint. In some cases (like Vector###<T>.operator +) the int/uint paths are identical and could be shared and in other cases (like Vector###<T>.Abs) they should be disjoint methods that are generated.

The biggest risk for USG is bad codegen (perfwise) and the biggest risk for specialization is code bloat. There is always going to be a balance, but provided the compiler tries to recognize the common patterns devs target we should end up generally in the right place. We can always look at providing some attribute in System.Runtime.CompilerServices that allows devs to annotate the types they would like specialized as well if they have more context than the compiler. Such a feature would allow us to annotate the exact 12 types for Vector###<T> and Mono could then generate the shared path for everything else.