Before, we got the stubs dynamically from the cell, and they were specialized on the size of the cache, like with other targets. However, for WASM, due to the cost of an indirect call and the general stub setup, it makes more sense to use one function for all cache sizes. Conveniently, the cache already contains the number of entries it has.
Performance results are positive:
Base:
Bench_InterfaceDispatch_Monomorphic took: 217 ms
Bench_InterfaceDispatch_Monomorphic took: 234 ms
Bench_InterfaceDispatch_Monomorphic took: 224 ms
Bench_InterfaceDispatch_Monomorphic took: 261 ms
Bench_InterfaceDispatch_Monomorphic took: 211 ms
Bench_InterfaceDispatch_Monomorphic took: 219 ms
Bench_InterfaceDispatch_Monomorphic took: 218 ms
Bench_InterfaceDispatch_Monomorphic took: 231 ms
Bench_InterfaceDispatch_Monomorphic took: 237 ms
Bench_InterfaceDispatch_Monomorphic took: 222 ms
Diff:
Bench_InterfaceDispatch_Monomorphic took: 193 ms
Bench_InterfaceDispatch_Monomorphic took: 220 ms
Bench_InterfaceDispatch_Monomorphic took: 193 ms
Bench_InterfaceDispatch_Monomorphic took: 191 ms
Bench_InterfaceDispatch_Monomorphic took: 200 ms
Bench_InterfaceDispatch_Monomorphic took: 192 ms
Bench_InterfaceDispatch_Monomorphic took: 191 ms
Bench_InterfaceDispatch_Monomorphic took: 193 ms
Bench_InterfaceDispatch_Monomorphic took: 193 ms
Bench_InterfaceDispatch_Monomorphic took: 194 ms
(Monomorphic callsites are the best-case scenario for the old scheme)
Codegen diffs are also good as expected:
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 3302676
Total bytes of diff: 3297731
Total bytes of delta: -4945 (-0.15% % of base)
Average relative delta: -4.75%
diff is an improvement
average relative diff is an improvement
Top methods only present in diff:
120 ( ∞ of base) : 1172.dasm - RhpResolveInterfaceDispatch
16 ( ∞ of base) : 1171.dasm - RuntimeResolveInterfaceDispatch
Top method improvements (percentages):
-7 (-10.14% of base) : 1151.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetRawConstant$F1_Finally
-7 (-10.14% of base) : 1055.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetRawDefaultValue$F1_Finally
-7 (-10.14% of base) : 1157.dasm - S_P_CoreLib_System_Reflection_Runtime_BindingFlagSupport_Shared__GetImplicitlyOverriddenBaseClassMember<System___Canon>$F1_Finally
-7 (-10.14% of base) : 1043.dasm - S_P_CoreLib_Internal_LowLevelLinq_LowLevelEnumerable__ToArray<System___Canon>$F1_Fault
-7 (-10.14% of base) : 1037.dasm - S_P_TypeLoader_Internal_TypeSystem_CastingHelper__IsConstrainedAsGCPointer$F1_Finally
-7 (-10.14% of base) : 1019.dasm - S_P_CoreLib_System_Reflection_Runtime_BindingFlagSupport_QueriedMemberList_1<System___Canon>__Create$F1_Fault
-7 (-10.14% of base) : 1025.dasm - S_P_CoreLib_System_Collections_Generic_LowLevelList_1<System___Canon>__InsertRange$F1_Fault
-7 (-10.14% of base) : 1146.dasm - S_P_TypeLoader_Internal_Runtime_TypeLoader_TypeLoaderEnvironment__RegisterDynamicGenericTypesAndMethods$F1_Fault
-7 (-10.14% of base) : 1143.dasm - S_P_TypeLoader_Internal_TypeSystem_CastingHelper__CanCastGenericParameterTo$F1_Finally
-7 (-10.14% of base) : 1053.dasm - S_P_CoreLib_System_Reflection_Runtime_General_Helpers__GetDefaultValue$F1_Finally
-7 (-10.14% of base) : 1133.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Int32>__Trim$F2_Fault
-7 (-10.14% of base) : 1132.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Int32>__Trim$F1_Fault
-7 (-10.14% of base) : 1082.dasm - S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__get_ImplementedInterfaces$F1_Fault
-7 (-10.14% of base) : 1091.dasm - S_P_CoreLib_System_TimeZoneInfo__CompareTimeZoneFile$F1_Finally
-7 (-10.14% of base) : 1095.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<UInt8>__Trim$F1_Fault
-7 (-10.14% of base) : 1122.dasm - S_P_CoreLib_System_IO_File__ReadAllBytes$F1_Fault
-7 (-10.14% of base) : 1147.dasm - S_P_TypeLoader_Internal_Runtime_TypeLoader_TypeLoaderEnvironment__RegisterDynamicGenericTypesAndMethods$F2_Fault
-7 (-10.14% of base) : 1099.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F2_Fault
-7 (-10.14% of base) : 1096.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<UInt8>__Trim$F2_Fault
-7 (-10.14% of base) : 1098.dasm - S_P_CoreLib_System_Buffers_SharedArrayPool_1<Char>__Trim$F1_Fault
Top methods only present in base:
-16 (-100.00% of base) : 1000.dasm - RhpInitialDynamicInterfaceDispatch
-48 (-100.00% of base) : 1001.dasm - RhpInterfaceDispatch1
-92 (-100.00% of base) : 1002.dasm - RhpInterfaceDispatch2
-410 (-100.00% of base) : 1005.dasm - RhpInterfaceDispatch16
-136 (-100.00% of base) : 1003.dasm - RhpInterfaceDispatch4
-778 (-100.00% of base) : 1006.dasm - RhpInterfaceDispatch32
-176 (-100.00% of base) : 1007.dasm - RhpInterfaceDispatch64
-226 (-100.00% of base) : 1004.dasm - RhpInterfaceDispatch8
173 total methods with Code Size differences (171 improved, 2 regressed)
Before, we got the stubs dynamically from the cell, and they were specialized on the size of the cache, like with other targets. However, for WASM, due to the cost of an indirect call and the general stub setup, it makes more sense to use one function for all cache sizes. Conveniently, the cache already contains the number of entries it has.
Performance results are positive:
(Monomorphic callsites are the best-case scenario for the old scheme)
Codegen diffs are also good as expected:
Pretty much all look like this: