Closed dmitriyse closed 4 years ago
@dmitriyse thanks -- would still love to see an example where allowing this sort of inlining has a signfiicant performance impact. Right now the costs are likely dominated by the delegate invokes, whether or not TryInvoke
is inlined.
I am working on prototype of "C# coroutines" where each virtual/non virtual call gives performance degradation, unfortunately work state is far from benchmarks. But in Task based EventLoop implementation I was forced to maually inline this method (~ 5 times):
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void ExecuteEvent()
{
if (SimpleAction != null)
{
SimpleAction();
}
else if (Action != null)
{
// ReSharper disable once PossibleNullReferenceException
Action(State);
}
else
{
Debug.Assert(Task != null, "Task != null");
// ReSharper disable once PossibleNullReferenceException
// ReSharper disable once AssignNullToNotNullAttribute
_scheduler.TryExecuteTaskInternal(Task);
}
}
This gives me performance speed up from 23 Mops/sec to 30 Mops/sec. (On one Core)
In a "C# coroutines" prototype I am targeting at least 100 Mops/sec. So this type of inlining is even more critical.
at least 100 Mops/sec
That sounds a bit far fetched. What does an "operation" is supposed to consist of? With current CPUs it probably needs to consist of less than 100 instructions to reach such a speed. Even 100 is extremely optimistic...
Motivations and some benchmarks can be found here: https://github.com/dotnet/corefxlab/issues/2168 https://github.com/dotnet/corefx/issues/28025
@AndyAyersMS @mikedn
I suspect this restriction is from pre-2.0 times. I remember in .NET Framework 2.0 release notes it was mentioned that delegate calls had got significantly faster.
Nowadays delegate calls can be like 50% faster than interface calls, yet interface call-sites don't prevent the containing methods from being inlined.
Removing the restriction will let JIT do a little bit less work. No?
Could be. I have some work in progress to move expansion of delegate invoke earlier. As part of that I'll also look into what is behind this inlining block.
Looks like we could remove this block, but broadly speaking it looks like all that happens is that we increase code size -- knock-on benefits seem to be rare. Part of this is that the inlinee still ends up with a call, one that we can't inline, and with a calling sequence is usually more complex than the call to the inlinee.
Delegate invoke is not all that frequent, so net regression is small, but there are virtually no code size improvements.
ManagementQuery:FireIdentifierChanged
, these now expand to a conditinally executed delegate call, so perhaps some small perf win.PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 30821 (0.06% of base)
diff is a regression.
Top file regressions (bytes):
13367 : CommandLine.dasm (3.01% of base)
3334 : Microsoft.CodeAnalysis.dasm (0.19% of base)
2552 : System.Private.Xml.dasm (0.07% of base)
2157 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.07% of base)
2144 : System.Management.dasm (0.55% of base)
1275 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.02% of base)
1138 : System.Net.Sockets.dasm (0.58% of base)
634 : System.Threading.Tasks.Dataflow.dasm (0.08% of base)
530 : Newtonsoft.Json.dasm (0.06% of base)
484 : Microsoft.CodeAnalysis.CSharp.dasm (0.01% of base)
465 : System.Data.Common.dasm (0.03% of base)
336 : xunit.execution.dotnet.dasm (0.14% of base)
295 : Microsoft.Extensions.Primitives.dasm (1.12% of base)
233 : System.Runtime.Serialization.Formatters.dasm (0.24% of base)
215 : System.Security.Cryptography.Primitives.dasm (0.62% of base)
200 : System.Private.DataContractSerialization.dasm (0.03% of base)
193 : Microsoft.Extensions.Http.dasm (0.44% of base)
148 : System.Private.CoreLib.dasm (0.00% of base)
148 : xunit.runner.utility.netcoreapp10.dasm (0.08% of base)
116 : xunit.console.dasm (0.13% of base)
37 total files with Code Size differences (0 improved, 37 regressed), 227 unchanged.
Top method regressions (bytes):
2742 (78.25% of base) : CommandLine.dasm - Trial:Collect(IEnumerable`1):Result`2 (7 methods)
2742 (95.54% of base) : CommandLine.dasm - ResultExtensions:Collect(IEnumerable`1):Result`2 (7 methods)
2670 (49.74% of base) : CommandLine.dasm - ResultExtensions:Flatten(Result`2):Result`2 (7 methods)
643 ( 7.18% of base) : CommandLine.dasm - HelpText:AutoBuild(ParserResult`1,Func`2,Func`2,bool,int):HelpText (7 methods)
540 (22.21% of base) : CommandLine.dasm - Trial:Flatten(Result`2):Result`2 (7 methods)
531 (42.28% of base) : CommandLine.dasm - Trial:MergeMessages(IEnumerable`1,Result`2):Result`2 (7 methods)
478 (62.00% of base) : CommandLine.dasm - Tokenizer:PreprocessDashDash(IEnumerable`1,Func`2):Result`2
422 (23.25% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:ParseDocumentContentAsync():Task`1:this
402 (23.50% of base) : Microsoft.CodeAnalysis.dasm - <DescendantTriviaIntoTrivia>d__161:MoveNext():bool:this
369 (14.65% of base) : Microsoft.CodeAnalysis.dasm - <DescendantNodesAndTokensIntoTrivia>d__159:MoveNext():bool:this
340 (14.26% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - TraceLog:FastSerialization.IFastSerializable.ToStream(Serializer):this
333 (19.69% of base) : System.Private.Xml.dasm - XmlSchemaValidator:ElementValidationError(XmlQualifiedName,ValidationState,ValidationEventHandler,Object,String,int,int,XmlSchemaSet)
320 (96.68% of base) : Microsoft.CodeAnalysis.dasm - CryptographicHashProvider:TryGetAlgorithm(int):HashAlgorithm (2 methods)
295 (40.80% of base) : Microsoft.CodeAnalysis.dasm - AbstractSyntaxNavigator:GetPreviousToken(SyntaxTrivia,SyntaxTriviaList,Func`2,Func`2,byref):SyntaxToken:this
294 (25.61% of base) : CommandLine.dasm - <>c__DisplayClass17_0`1:<DisplayHelp>b__0(IEnumerable`1):this (7 methods)
275 (51.89% of base) : Microsoft.CodeAnalysis.dasm - AbstractSyntaxNavigator:GetLastToken(SyntaxTriviaList,Func`2,Func`2):SyntaxToken:this
273 ( 2.55% of base) : CommandLine.dasm - HelpText:AutoBuild(ParserResult`1,int):HelpText (7 methods)
244 (17.63% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:ParseElementContentAsync():Task`1:this
228 (11.57% of base) : System.Private.Xml.dasm - XmlWellFormedWriter:AdvanceStateAsync(int):Task:this
220 (15.88% of base) : Microsoft.CodeAnalysis.dasm - <DescendantTriviaOnly>d__160:MoveNext():bool:this
Top method improvements (bytes):
-21 (-0.83% of base) : System.Net.Http.WinHttpHandler.dasm - HttpKnownHeaderNames:TryGetHeaderName(__Canon,int,int,Func`3,Func`5,byref):bool
-4 (-0.22% of base) : CommandLine.dasm - <>c__DisplayClass0_2`1:<Build>b__15():__Canon:this
-2 (-0.41% of base) : xunit.performance.execution.dasm - BenchmarkEventSource:BenchmarkIterationStop(String,String,int,long):this
Top method regressions (percentages):
64 (914.29% of base) : System.Private.CoreLib.dasm - _ThreadPoolWaitOrTimerCallback:WaitOrTimerCallback_Context_f(Object)
64 (640.00% of base) : System.Private.CoreLib.dasm - _ThreadPoolWaitOrTimerCallback:WaitOrTimerCallback_Context_t(Object)
31 (620.00% of base) : System.Management.dasm - ManagementOptions:HandleIdentifierChange(Object,IdentifierChangedEventArgs):this
31 (387.50% of base) : System.Management.dasm - EventWatcherOptions:set_BlockSize(int):this
52 (288.89% of base) : System.Runtime.Serialization.Formatters.dasm - SerializationEvents:InvokeOnSerializing(Object,StreamingContext):this
52 (288.89% of base) : System.Runtime.Serialization.Formatters.dasm - SerializationEvents:InvokeOnDeserializing(Object,StreamingContext):this
52 (288.89% of base) : System.Runtime.Serialization.Formatters.dasm - SerializationEvents:InvokeOnDeserialized(Object,StreamingContext):this
31 (281.82% of base) : System.Management.dasm - ManagementScope:HandleIdentifierChange(Object,IdentifierChangedEventArgs):this
43 (215.00% of base) : Microsoft.CSharp.dasm - UnsafeMethods:IUnknownReleaseNotZero(long)
30 (214.29% of base) : System.Management.dasm - ConnectionOptions:set_Impersonation(int):this
30 (214.29% of base) : System.Management.dasm - ConnectionOptions:set_Authentication(int):this
30 (136.36% of base) : System.Management.dasm - ConnectionOptions:set_EnablePrivileges(bool):this
75 (131.58% of base) : System.Net.Sockets.dasm - SocketPal:Disconnect(Socket,SafeSocketHandle,bool):int
116 (116.00% of base) : CommandLine.dasm - MaybeExtensions:Map(Maybe`1,Func`2):Maybe`1 (7 methods)
116 (116.00% of base) : CommandLine.dasm - MaybeExtensions:Select(Maybe`1,Func`2):Maybe`1 (7 methods)
320 (96.68% of base) : Microsoft.CodeAnalysis.dasm - CryptographicHashProvider:TryGetAlgorithm(int):HashAlgorithm (2 methods)
2742 (95.54% of base) : CommandLine.dasm - ResultExtensions:Collect(IEnumerable`1):Result`2 (7 methods)
36 (94.74% of base) : System.Management.dasm - ObjectGetOptions:set_UseAmendedQualifiers(bool):this
90 (90.00% of base) : CommandLine.dasm - MaybeExtensions:Bind(Maybe`1,Func`2):Maybe`1 (7 methods)
26 (89.66% of base) : System.Management.dasm - ManagementNamedValueCollection:Remove(String):this
Top method improvements (percentages):
-21 (-0.83% of base) : System.Net.Http.WinHttpHandler.dasm - HttpKnownHeaderNames:TryGetHeaderName(__Canon,int,int,Func`3,Func`5,byref):bool
-2 (-0.41% of base) : xunit.performance.execution.dasm - BenchmarkEventSource:BenchmarkIterationStop(String,String,int,long):this
-4 (-0.22% of base) : CommandLine.dasm - <>c__DisplayClass0_2`1:<Build>b__15():__Canon:this
448 total methods with Code Size differences (3 improved, 445 regressed), 243342 unchanged.
So will likely hold off on this until we find more compelling motivation.
Prototype changes here: master...AndyAyersMS:AllowInlineeDelegateInvoke
There are many reasons I could think of doing this, just check sorting code, which has to duplicate code paths to get optimal code generation for different scenarios. This is a general problem in .NET due to the canonical representation of code for reference types, so typically you might have to have different code paths for value types vs. reference types. The injection of a value type as a functor
that can be fully inlined if needed (e.g. for value types) is a great pattern for this (which does not necessarily lead to overall increased code size, but instead much more relevant reduced source code size. The overhead for sorting is very much measurable.
So in my view there is compelling motivation :)
just check sorting code, which has to duplicate code paths to get optimal code generation for different scenarios
What duplicate sorting code would be removable if methods containing delegate invocations were inlined?
What duplicate sorting code would be removable if methods containing delegate invocations were inlined?
@stephentoub well this is all subject to interpretation, but the current sorting code is not optimal for a some code paths. Which I have touched upon before. Currently, there are three code paths Comparison<T>
, T where T : IComparable<T>
, TComparer where TComparer : IComparer<T>
depending on keys or keys and values and some other stuff. TComparer
isn't implemented as the original API proposal wanted it (that is me :grin:). None of these three code paths are actually optimal and on par with old sorting code for some primitives, that is the C++ version, which did not rely on IComparable<T>
.
I have been (occasionally) trying to see if these four versions could be consolidated to one single source code version, that is still optimal for all scenarios. Outside the context of the dotnet runtime, since it seemed to me this was not a priority. Using hacks, trickery and other probably frowned upon changes I believe there is way... having the below call be inlined, would "solve" one part of the problem.
readonly struct OpenDelegateObjectComparer : IComparer<object>
{
readonly Comparison<object> m_compare;
public OpenDelegateObjectComparer(Comparison<object> compare) =>
m_compare = compare;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int Compare(object x, object y) => m_compare(x, y);
}
The impact here compared to calling the comparison is significant going from 0.87 (Comparison<object>
) to 1.03 (OpenDelegateObjectComparer
) in a benchmark.
I understand that this is quite esoteric need, but in general having great support for struct
functors that are actually inlined is very much appreciated for low level code. This works mostly for value types. For reference types or when there is a delegate things fall apart and you have to duplicate code. Despite code being quite "generic".
Running the test from #37941, I see some small improvements in the delegate case, so I'll go ahead and put up a PR.
A few odd things about that test, though. Runtimes seem quite volatile-- the only overt diff is in the Comparison.Comparer
case, but timings for the other cases are seemingly impacted too. Perhaps something to do with hard to predict branches.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.900 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.4.20258.7
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
Job-QUQWJF : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Toolchain=CoreRun
;;;; BEFORE (Float)
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
|------------------- |---------:|----------:|----------:|------:|--------:|------:|------:|------:|----------:|----------:|
| CompareTo | 1.349 ns | 0.0504 ns | 0.0421 ns | 0.43 | 0.02 | - | - | - | - | 83 B |
| Comparer | 3.006 ns | 0.0623 ns | 0.0874 ns | 0.98 | 0.04 | - | - | - | - | 75 B |
| Comparison | 3.061 ns | 0.0748 ns | 0.1228 ns | 1.00 | 0.00 | - | - | - | - | 37 B |
| ComparisonComparer | 3.713 ns | 0.0446 ns | 0.0417 ns | 1.20 | 0.05 | - | - | - | - | 66 B |
;;;; AFTER (Float)
| Method | Mean | Error | StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |---------:|----------:|----------:|------:|--------:|----------:|------:|------:|------:|----------:|
| CompareTo | 1.013 ns | 0.0399 ns | 0.0374 ns | 0.31 | 0.01 | 83 B | - | - | - | - |
| Comparer | 2.873 ns | 0.0609 ns | 0.0508 ns | 0.88 | 0.02 | 75 B | - | - | - | - |
| Comparison | 3.263 ns | 0.0647 ns | 0.0606 ns | 1.00 | 0.00 | 37 B | - | - | - | - |
| ComparisonComparer | 3.416 ns | 0.0709 ns | 0.0663 ns | 1.05 | 0.03 | 34 B | - | - | - | - |
Also I'd inadvertently run the test initially with tired compilation disabled, and the prejitted code for the float comparisons seemingly performs better than the jitted code (eg Comparer 3ns vs 2.5ns).
;;;; BEFORE (no-tc, float)
| Method | Mean | Error | StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |---------:|----------:|----------:|------:|--------:|----------:|------:|------:|------:|----------:|
| CompareTo | 1.114 ns | 0.0510 ns | 0.0426 ns | 0.36 | 0.01 | 75 B | - | - | - | - |
| Comparer | 2.481 ns | 0.0634 ns | 0.0593 ns | 0.80 | 0.02 | 72 B | - | - | - | - |
| Comparison | 3.078 ns | 0.0150 ns | 0.0125 ns | 1.00 | 0.00 | 37 B | - | - | - | - |
| ComparisonComparer | 4.435 ns | 0.0292 ns | 0.0244 ns | 1.44 | 0.01 | 66 B | - | - | - | - |
Asm diff shows it's vex vs non-vex and here the non-vex is seemingly faster.
; System.Single.CompareTo(Single)
- movss xmm0,dword ptr [rcx]
- ucomiss xmm1,xmm0
+ vzeroupper
+ vmovss xmm0,dword ptr [rcx]
+ vucomiss xmm1,xmm0
ja short M01_L02
- ucomiss xmm0,xmm1
+ vucomiss xmm0,xmm1
ja short M01_L04
- ucomiss xmm0,xmm1
+ vucomiss xmm0,xmm1
jp short M01_L00
je short M01_L03
M01_L00:
- ucomiss xmm0,xmm0
+ vucomiss xmm0,xmm0
jp short M01_L01
je short M01_L04
M01_L01:
- ucomiss xmm1,xmm1
+ vucomiss xmm1,xmm1
jp short M01_L03
M01_L02:
mov eax,0FFFFFFFF
@@ -35,7 +36,7 @@ M01_L03:
M01_L04:
mov eax,1
ret
-; Total bytes of code 48
+; Total bytes of code 56
Note in the above we are doing 5 float compares worst-case, with 1 compare if the arg is greater, 2 if it's less, 3 if they're equal, and 4 or 5 if there are nans; I wonder if we can improve on that somehow.
cc @dotnet/jit-contrib
Method with delegate invocation problem:
https://sharplab.io/#v2:C4LghgzgtgPgAgBgARwIwG4CwAoRLUB0ASgK4B2wAllAKYEDCA9lAA6UA2NATgMrcBulAMY0IWbDjgBmFACYk9JAG8cSNUhZdK/MMBpIAsgE8ewLiSHAA2gF0kAfQhIAvEjI0A7oZNmL11Dbi6kiq6tIoACyGABQAlMqhwWoA9MkKJFxcNBTsRkjAABaUTrSFjAAmSMVujMBVZOyU7pUg0YlJjlYINgQAKlxGAJJk/IwA1jRxQeoAvjhzErgyEL6W3qbmljgq2MHhaPIAguLtVgY0ZeWDrOzR55fXLOwA8ixUjGQQBIcA5j9ZEAg2how0aZCaP1iNna+yi/SGI3Gk1i7R2SXUhym7WCmNi0zUCwWQA==
category:cq theme:inlining skill-level:expert cost:small