dotnet / runtimelab

This repo is for experimentation and exploring new ideas that may or may not make it into the main dotnet/runtime repo.
MIT License
1.42k stars 199 forks source link

[NativeAOT-LLVM] Use a universal stub for interface dispatch #2304

Closed SingleAccretion closed 1 year ago

SingleAccretion commented 1 year ago

Before, we got the stubs dynamically from the cell, and they were specialized on the size of the cache, like with other targets. However, for WASM, due to the cost of an indirect call and the general stub setup, it makes more sense to use one function for all cache sizes. Conveniently, the cache already contains the number of entries it has.

Performance results are positive:

Base:
  Bench_InterfaceDispatch_Monomorphic took: 217 ms
  Bench_InterfaceDispatch_Monomorphic took: 234 ms
  Bench_InterfaceDispatch_Monomorphic took: 224 ms
  Bench_InterfaceDispatch_Monomorphic took: 261 ms
  Bench_InterfaceDispatch_Monomorphic took: 211 ms
  Bench_InterfaceDispatch_Monomorphic took: 219 ms
  Bench_InterfaceDispatch_Monomorphic took: 218 ms
  Bench_InterfaceDispatch_Monomorphic took: 231 ms
  Bench_InterfaceDispatch_Monomorphic took: 237 ms
  Bench_InterfaceDispatch_Monomorphic took: 222 ms

Diff:
  Bench_InterfaceDispatch_Monomorphic took: 193 ms
  Bench_InterfaceDispatch_Monomorphic took: 220 ms
  Bench_InterfaceDispatch_Monomorphic took: 193 ms
  Bench_InterfaceDispatch_Monomorphic took: 191 ms
  Bench_InterfaceDispatch_Monomorphic took: 200 ms
  Bench_InterfaceDispatch_Monomorphic took: 192 ms
  Bench_InterfaceDispatch_Monomorphic took: 191 ms
  Bench_InterfaceDispatch_Monomorphic took: 193 ms
  Bench_InterfaceDispatch_Monomorphic took: 193 ms
  Bench_InterfaceDispatch_Monomorphic took: 194 ms

(Monomorphic callsites are the best-case scenario for the old scheme)

Codegen diffs are also good as expected:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3302676
Total bytes of diff: 3297731
Total bytes of delta: -4945 (-0.15% % of base)
Average relative delta: -4.75%
    diff is an improvement
    average relative diff is an improvement

Top methods only present in diff:
         120 (     ∞ of base) : 1172.dasm - RhpResolveInterfaceDispatch
          16 (     ∞ of base) : 1171.dasm - RuntimeResolveInterfaceDispatch

Top method improvements (percentages):
          -7 (-10.14% of base) : 1151.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetRawConstant$F1_Finally
          -7 (-10.14% of base) : 1055.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetRawDefaultValue$F1_Finally
          -7 (-10.14% of base) : 1157.dasm - S_P_CoreLib_System_Reflection_Runtime_BindingFlagSupport_Shared__GetImplicitlyOverriddenBaseClassMember<System___Canon>$F1_Finally
          -7 (-10.14% of base) : 1043.dasm - S_P_CoreLib_Internal_LowLevelLinq_LowLevelEnumerable__ToArray<System___Canon>$F1_Fault
          -7 (-10.14% of base) : 1037.dasm - S_P_TypeLoader_Internal_TypeSystem_CastingHelper__IsConstrainedAsGCPointer$F1_Finally
          -7 (-10.14% of base) : 1019.dasm - S_P_CoreLib_System_Reflection_Runtime_BindingFlagSupport_QueriedMemberList_1<System___Canon>__Create$F1_Fault
          -7 (-10.14% of base) : 1025.dasm - S_P_CoreLib_System_Collections_Generic_LowLevelList_1<System___Canon>__InsertRange$F1_Fault
          -7 (-10.14% of base) : 1146.dasm - S_P_TypeLoader_Internal_Runtime_TypeLoader_TypeLoaderEnvironment__RegisterDynamicGenericTypesAndMethods$F1_Fault
          -7 (-10.14% of base) : 1143.dasm - S_P_TypeLoader_Internal_TypeSystem_CastingHelper__CanCastGenericParameterTo$F1_Finally
          -7 (-10.14% of base) : 1053.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetDefaultValue$F1_Finally
          -7 (-10.14% of base) : 1133.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Int32>__Trim$F2_Fault
          -7 (-10.14% of base) : 1132.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Int32>__Trim$F1_Fault
          -7 (-10.14% of base) : 1082.dasm - S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_ImplementedInterfaces$F1_Fault
          -7 (-10.14% of base) : 1091.dasm - S_P_CoreLib_System_TimeZoneInfo__CompareTimeZoneFile$F1_Finally
          -7 (-10.14% of base) : 1095.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<UInt8>__Trim$F1_Fault
          -7 (-10.14% of base) : 1122.dasm - S_P_CoreLib_System_IO_File__ReadAllBytes$F1_Fault
          -7 (-10.14% of base) : 1147.dasm - S_P_TypeLoader_Internal_Runtime_TypeLoader_TypeLoaderEnvironment__RegisterDynamicGenericTypesAndMethods$F2_Fault
          -7 (-10.14% of base) : 1099.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F2_Fault
          -7 (-10.14% of base) : 1096.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<UInt8>__Trim$F2_Fault
          -7 (-10.14% of base) : 1098.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F1_Fault

Top methods only present in base:
         -16 (-100.00% of base) : 1000.dasm - RhpInitialDynamicInterfaceDispatch
         -48 (-100.00% of base) : 1001.dasm - RhpInterfaceDispatch1
         -92 (-100.00% of base) : 1002.dasm - RhpInterfaceDispatch2
        -410 (-100.00% of base) : 1005.dasm - RhpInterfaceDispatch16
        -136 (-100.00% of base) : 1003.dasm - RhpInterfaceDispatch4
        -778 (-100.00% of base) : 1006.dasm - RhpInterfaceDispatch32
        -176 (-100.00% of base) : 1007.dasm - RhpInterfaceDispatch64
        -226 (-100.00% of base) : 1004.dasm - RhpInterfaceDispatch8

173 total methods with Code Size differences (171 improved, 2 regressed)

Pretty much all look like this:

-func[4488] <S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F2_Fault>:
+func[4482] <S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F2_Fault>:
  02 40                      | block
  02 40                      |   block
  20 01                      |     local.get 1
@@ -21,14 +21,12 @@ func[4488] <S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F2_Fault>:
  6a                         |     i32.add
  20 01                      |     local.get 1
  41 98 dc 36                |     i32.const 896536
- 41 00                      |     i32.const 0
- 28 02 98 dc 36             |     i32.load 2 896536
- 11 03 00                   |     call_indirect 0 (type 3)
+ 10 f4 04                   |     call 628 <RhpResolveInterfaceDispatch>
  11 02 00                   |     call_indirect 0 (type 2)
  0b                         |   end
  0f                         |   return
  0b                         | end
  20 00                      | local.get 0
- 10 9c 34                   | call 6684 <S_P_CoreLib_Internal_Runtime_CompilerHelpers_ThrowHelpers__ThrowNullReferenceException>
+ 10 96 34                   | call 6678 <S_P_CoreLib_Internal_Runtime_CompilerHelpers_ThrowHelpers__ThrowNullReferenceException>
  00                         | unreachable
  0b                         | end
SingleAccretion commented 1 year ago

@dotnet/nativeaot-llvm