dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

RyuJIT seems to recreate delegates for closures declared within generic methods #8349

Closed djluck closed 4 years ago

djluck commented 7 years ago

I've identified a situation where RyuJIT generates inefficient code that involve closures defined within generic methods.

For example, given this code:

public class LambdaTest
{
    [Benchmark]
    public string ClosureWithGeneric()
    {
        return Closure_GetStr<string>();
    }

    private string _aString = "a";

        //This is the troublesome method: Note the generic type parameter <T> and the closure, x
    private string Closure_GetStr<T>()
    {
        Func<string> x = () => { return _aString; };
        return x();
    }
}

void Main()
{
       // Benchmark using BenchmarkDotNet in the following environments:
    var config = DefaultConfig.Instance
        .With(new Job("X64-Clr-RyuJit", RunMode.Medium, EnvMode.Clr)
        {
            Env = { Runtime = Runtime.Clr, Platform = Platform.X64, Jit = Jit.RyuJit },

        })
        .With(new Job("X64-Clr-LegacyJit", RunMode.Medium, EnvMode.Clr)
        {
            Env = { Runtime = Runtime.Clr, Platform = Platform.X64, Jit = Jit.LegacyJit },

        });

    BenchmarkRunner
        .Run<LambdaTest>(config);
}

The results of running the above code are:

Method Job Jit Mean Error StdDev
ClosureWithGeneric X64-Clr-LegacyJit LegacyJit 10.71 ns 0.3135 ns 0.4595 ns
ClosureWithGeneric X64-Clr-RyuJit RyuJit 265.38 ns 3.6844 ns 5.1651 ns

RyuJIT is about 20x slower than the legacy JIT. I've confirmed that this issue exists in dotnet core 2.0.

The issue seems to be caused by heavy calls to COMDelegate::DelegateConstruct, which does not occur when using the legacy JIT:

image

Closures in non-generic methods and Lambdas in generic methods do not seem to be affected. It is only the combination of the two.

djluck commented 7 years ago

After a co-worker asked if this applied to any variables outside the scope of the closure or just to class members, I amended the benchmark to:

public class LambdaTest
{
    [Benchmark]
    public string ClosureWithGeneric()
    {
        return Closure_GetStr<string>();
    }

    private string _aString = "a";

    private string Closure_GetStr<T>()
    {
        //y now holds a reference to _aString, x is still a closure
        var y = _aString;
        Func<string> x = () => { return y; };
        return x();
    }
}

Which gave the times:

Method Job Jit Mean Error StdDev
ClosureWithGeneric X64-Clr-LegacyJit LegacyJit 18.15 ns 0.5168 ns 0.7412 ns
ClosureWithGeneric X64-Clr-RyuJit RyuJit 13.75 ns 0.8877 ns 1.2731 ns

So it seems it is specific to closures that directly reference class members.

mikedn commented 7 years ago

Both the legacy JIT and RyuJIT allocate that delegate. In general, none of the existing .NET Framework JITs eliminate delegate allocations.

The problem here seems to be in the delegate constructor, it's much slower when RyuJIT is used.

BruceForstall commented 7 years ago

@dotnet/jit-contrib

djluck commented 7 years ago

An expanded view of the PerfView trace looks like this:

image

If it's helpful, I can upload the original perfview trace.

mikedn commented 7 years ago

Look at the perfview trace of the fast version as well. You'll notice that COMDelegate::DelegateConstruct doesn't appear in that case.

jkotas commented 7 years ago

fgOptimizeDelegateConstructor is where call to COMDelegate::DelegateConstruct is supposed to be replaced with the much more efficient equivalent.

mikedn commented 7 years ago

fgOptimizeDelegateConstructor is where call to COMDelegate::DelegateConstruct is supposed to be replaced with the much more efficient equivalent.

Yeah, and that one expects the function address to follow a certain pattern (e.g. be GT_FTN_ADDR) but in this case the function address ends up in a lclvar. What seems strange is that runtime lookup is involved in this despite that neither class nor the delegate target are generic.

jkotas commented 7 years ago

ldftn instance string LambdaTest::'<Closure_GetStr>b__2_0'<!!0>() needs to load instantiating stub.

The fix maybe similar to dotnet/coreclr#10663 (note that this change was CoreRT ABI only): don't depend on pattern matching and always perform the optimization for verifiable delegate creation sequence by construction.

mikedn commented 7 years ago

ldftn instance string LambdaTest::'b__2_0'<!!0>() needs to load instantiating stub.

Right, the delegate target is actually a generic method. I wonder why the C# compiler does that but that's another problem.

RussKeldorph commented 7 years ago

@sandreenko FYI

sandreenko commented 7 years ago

There is an issue for R2R dotnet/runtime#7839 (optimize delegate constructor in R2R for a virtual fp).