dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Allow to inline methods with delegate invocation (at least methods marked with AggressiveInlining) #10048

Closed dmitriyse closed 4 years ago

dmitriyse commented 6 years ago

Method with delegate invocation problem:

using System;
using System.Runtime.CompilerServices;

public class C {
    private MyStruct[] _s = new MyStruct[1];

    public void M() {
        // Currently this method is not inlined :(
        _s[0].TryInvoke();
    }
}

public struct MyStruct
{
    public Action A;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void TryInvoke()
    {
        A();
        A();
    }
}

https://sharplab.io/#v2:C4LghgzgtgPgAgBgARwIwG4CwAoRLUB0ASgK4B2wAllAKYEDCA9lAA6UA2NATgMrcBulAMY0IWbDjgBmFACYk9JAG8cSNUhZdK/MMBpIAsgE8ewLiSHAA2gF0kAfQhIAvEjI0A7oZNmL11Dbi6kiq6tIoACyGABQAlMqhwWoA9MkKJFxcNBTsRkjAABaUTrSFjAAmSMVujMBVZOyU7pUg0YlJjlYINgQAKlxGAJJk/IwA1jRxQeoAvjhzErgyEL6W3qbmljgq2MHhaPIAguLtVgY0ZeWDrOzR55fXLOwA8ixUjGQQBIcA5j9ZEAg2how0aZCaP1iNna+yi/SGI3Gk1i7R2SXUhym7WCmNi0zUCwWQA==

category:cq theme:inlining skill-level:expert cost:small

AndyAyersMS commented 6 years ago

@dmitriyse thanks -- would still love to see an example where allowing this sort of inlining has a signfiicant performance impact. Right now the costs are likely dominated by the delegate invokes, whether or not TryInvoke is inlined.

dmitriyse commented 6 years ago

I am working on prototype of "C# coroutines" where each virtual/non virtual call gives performance degradation, unfortunately work state is far from benchmarks. But in Task based EventLoop implementation I was forced to maually inline this method (~ 5 times):

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public void ExecuteEvent()
            {
                if (SimpleAction != null)
                {
                    SimpleAction();
                }
                else if (Action != null)
                {
                    // ReSharper disable once PossibleNullReferenceException
                    Action(State);
                }
                else
                {
                    Debug.Assert(Task != null, "Task != null");

                    // ReSharper disable once PossibleNullReferenceException
                    // ReSharper disable once AssignNullToNotNullAttribute
                    _scheduler.TryExecuteTaskInternal(Task);
                }
            }

This gives me performance speed up from 23 Mops/sec to 30 Mops/sec. (On one Core)

In a "C# coroutines" prototype I am targeting at least 100 Mops/sec. So this type of inlining is even more critical.

mikedn commented 6 years ago

at least 100 Mops/sec

That sounds a bit far fetched. What does an "operation" is supposed to consist of? With current CPUs it probably needs to consist of less than 100 instructions to reach such a speed. Even 100 is extremely optimistic...

dmitriyse commented 6 years ago

Motivations and some benchmarks can be found here: https://github.com/dotnet/corefxlab/issues/2168 https://github.com/dotnet/corefx/issues/28025

omariom commented 4 years ago

@AndyAyersMS @mikedn

I suspect this restriction is from pre-2.0 times. I remember in .NET Framework 2.0 release notes it was mentioned that delegate calls had got significantly faster.

Nowadays delegate calls can be like 50% faster than interface calls, yet interface call-sites don't prevent the containing methods from being inlined.

Removing the restriction will let JIT do a little bit less work. No?

AndyAyersMS commented 4 years ago

Could be. I have some work in progress to move expansion of delegate invoke earlier. As part of that I'll also look into what is behind this inlining block.

AndyAyersMS commented 4 years ago

Looks like we could remove this block, but broadly speaking it looks like all that happens is that we increase code size -- knock-on benefits seem to be rare. Part of this is that the inlinee still ends up with a call, one that we can't inline, and with a calling sequence is usually more complex than the call to the inlinee.

Delegate invoke is not all that frequent, so net regression is small, but there are virtually no code size improvements.

Prototype changes here: master...AndyAyersMS:AllowInlineeDelegateInvoke

nietras commented 4 years ago

There are many reasons I could think of doing this, just check sorting code, which has to duplicate code paths to get optimal code generation for different scenarios. This is a general problem in .NET due to the canonical representation of code for reference types, so typically you might have to have different code paths for value types vs. reference types. The injection of a value type as a functor that can be fully inlined if needed (e.g. for value types) is a great pattern for this (which does not necessarily lead to overall increased code size, but instead much more relevant reduced source code size. The overhead for sorting is very much measurable.

So in my view there is compelling motivation :)

stephentoub commented 4 years ago

just check sorting code, which has to duplicate code paths to get optimal code generation for different scenarios

What duplicate sorting code would be removable if methods containing delegate invocations were inlined?

nietras commented 4 years ago

What duplicate sorting code would be removable if methods containing delegate invocations were inlined?

@stephentoub well this is all subject to interpretation, but the current sorting code is not optimal for a some code paths. Which I have touched upon before. Currently, there are three code paths Comparison<T>, T where T : IComparable<T>, TComparer where TComparer : IComparer<T> depending on keys or keys and values and some other stuff. TComparer isn't implemented as the original API proposal wanted it (that is me :grin:). None of these three code paths are actually optimal and on par with old sorting code for some primitives, that is the C++ version, which did not rely on IComparable<T>.

I have been (occasionally) trying to see if these four versions could be consolidated to one single source code version, that is still optimal for all scenarios. Outside the context of the dotnet runtime, since it seemed to me this was not a priority. Using hacks, trickery and other probably frowned upon changes I believe there is way... having the below call be inlined, would "solve" one part of the problem.

        readonly struct OpenDelegateObjectComparer : IComparer<object>
        {
            readonly Comparison<object> m_compare;

            public OpenDelegateObjectComparer(Comparison<object> compare) =>
                m_compare = compare;

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public int Compare(object x, object y) => m_compare(x, y);
        }

The impact here compared to calling the comparison is significant going from 0.87 (Comparison<object>) to 1.03 (OpenDelegateObjectComparer) in a benchmark.

I understand that this is quite esoteric need, but in general having great support for struct functors that are actually inlined is very much appreciated for low level code. This works mostly for value types. For reference types or when there is a delegate things fall apart and you have to duplicate code. Despite code being quite "generic".

AndyAyersMS commented 4 years ago

Running the test from #37941, I see some small improvements in the delegate case, so I'll go ahead and put up a PR.

A few odd things about that test, though. Runtimes seem quite volatile-- the only overt diff is in the Comparison.Comparer case, but timings for the other cases are seemingly impacted too. Perhaps something to do with hard to predict branches.


BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.900 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.4.20258.7
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
  Job-QUQWJF : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun  

;;;; BEFORE (Float)

|             Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
|------------------- |---------:|----------:|----------:|------:|--------:|------:|------:|------:|----------:|----------:|
|          CompareTo | 1.349 ns | 0.0504 ns | 0.0421 ns |  0.43 |    0.02 |     - |     - |     - |         - |      83 B |
|           Comparer | 3.006 ns | 0.0623 ns | 0.0874 ns |  0.98 |    0.04 |     - |     - |     - |         - |      75 B |
|         Comparison | 3.061 ns | 0.0748 ns | 0.1228 ns |  1.00 |    0.00 |     - |     - |     - |         - |      37 B |
| ComparisonComparer | 3.713 ns | 0.0446 ns | 0.0417 ns |  1.20 |    0.05 |     - |     - |     - |         - |      66 B |

;;;; AFTER (Float)

|             Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |---------:|----------:|----------:|------:|--------:|----------:|------:|------:|------:|----------:|
|          CompareTo | 1.013 ns | 0.0399 ns | 0.0374 ns |  0.31 |    0.01 |      83 B |     - |     - |     - |         - |
|           Comparer | 2.873 ns | 0.0609 ns | 0.0508 ns |  0.88 |    0.02 |      75 B |     - |     - |     - |         - |
|         Comparison | 3.263 ns | 0.0647 ns | 0.0606 ns |  1.00 |    0.00 |      37 B |     - |     - |     - |         - |
| ComparisonComparer | 3.416 ns | 0.0709 ns | 0.0663 ns |  1.05 |    0.03 |      34 B |     - |     - |     - |         - |

Also I'd inadvertently run the test initially with tired compilation disabled, and the prejitted code for the float comparisons seemingly performs better than the jitted code (eg Comparer 3ns vs 2.5ns).

;;;; BEFORE (no-tc, float)

|             Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |---------:|----------:|----------:|------:|--------:|----------:|------:|------:|------:|----------:|
|          CompareTo | 1.114 ns | 0.0510 ns | 0.0426 ns |  0.36 |    0.01 |      75 B |     - |     - |     - |         - |
|           Comparer | 2.481 ns | 0.0634 ns | 0.0593 ns |  0.80 |    0.02 |      72 B |     - |     - |     - |         - |
|         Comparison | 3.078 ns | 0.0150 ns | 0.0125 ns |  1.00 |    0.00 |      37 B |     - |     - |     - |         - |
| ComparisonComparer | 4.435 ns | 0.0292 ns | 0.0244 ns |  1.44 |    0.01 |      66 B |     - |     - |     - |         - |

Asm diff shows it's vex vs non-vex and here the non-vex is seemingly faster.

 ; System.Single.CompareTo(Single)
-       movss     xmm0,dword ptr [rcx]
-       ucomiss   xmm1,xmm0
+       vzeroupper
+       vmovss    xmm0,dword ptr [rcx]
+       vucomiss  xmm1,xmm0
        ja        short M01_L02
-       ucomiss   xmm0,xmm1
+       vucomiss  xmm0,xmm1
        ja        short M01_L04
-       ucomiss   xmm0,xmm1
+       vucomiss  xmm0,xmm1
        jp        short M01_L00
        je        short M01_L03
 M01_L00:
-       ucomiss   xmm0,xmm0
+       vucomiss  xmm0,xmm0
        jp        short M01_L01
        je        short M01_L04
 M01_L01:
-       ucomiss   xmm1,xmm1
+       vucomiss  xmm1,xmm1
        jp        short M01_L03
 M01_L02:
        mov       eax,0FFFFFFFF
@@ -35,7 +36,7 @@ M01_L03:
 M01_L04:
        mov       eax,1
        ret
-; Total bytes of code 48
+; Total bytes of code 56

Note in the above we are doing 5 float compares worst-case, with 1 compare if the arg is greater, 2 if it's less, 3 if they're equal, and 4 or 5 if there are nans; I wonder if we can improve on that somehow.

cc @dotnet/jit-contrib