dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.05k stars 4.69k forks source link

Inefficient codegen when a lambda references an instance member #40009

Open smoogipoo opened 4 years ago

smoogipoo commented 4 years ago

Description

We've recently encountered a method allocating even though there was no allocate-y code running. We've since tracked this down to the following:

IL shows the capturing lambda's DisplayClass always seems to be "allocated" as soon as the method is entered, but the JIT has the ability to control when the object is initialised. When the lambda captures both a local and this, the JIT is suddenly unable to forego this initialisation even if the method terminates execution early.

Here's a gist that shows the issue: https://gist.github.com/smoogipoo/810c033962b8a95289dbdabd9d84801b And a SharpLab repro.

Is this a valid and/or known issue, or something we should be wary of going forward?

Configuration + Data

BenchmarkDotNet=v0.12.1, OS=manjaro 
AMD Ryzen 9 3950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.103
  [Host]        : .NET Core 2.2.8 (CoreCLR 4.6.28207.03, CoreFX 4.6.28208.02), X64 RyuJIT
  .NET Core 2.2 : .NET Core 2.2.8 (CoreCLR 4.6.28207.03, CoreFX 4.6.28208.02), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
Method Job Runtime Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
SafeMethod .NET Core 2.2 .NET Core 2.2 545.8 ns 0.47 ns 0.41 ns 1.00 0.00 - - - -
AllocatingMethod .NET Core 2.2 .NET Core 2.2 7,151.4 ns 135.02 ns 112.75 ns 13.10 0.20 24.4064 - - 32000 B
SafeMethod .NET Core 3.1 .NET Core 3.1 276.8 ns 0.26 ns 0.23 ns 1.00 0.00 - - - -
AllocatingMethod .NET Core 3.1 .NET Core 3.1 9,072.8 ns 179.96 ns 438.05 ns 32.64 1.40 0.9460 - - 32000 B

I've also run this on mono, which seems to always initialise the object (see Gen 0, Benchmark.Net doesn't seem to have allocation tracking on mono).

BenchmarkDotNet=v0.12.1, OS=manjaro
AMD Ryzen 9 3950X, 1 CPU, 32 logical and 16 physical cores
  [Host]     : Mono 6.4.0 (makepkg/fe64a4765e6 Sat), X64
  Job-CKTNGU : Mono 6.4.0 (makepkg/fe64a4765e6 Sat), X64

Runtime=Mono
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SafeMethod 4.491 us 0.0069 us 0.0054 us 1.00 5.7449 - - -
AllocatingMethod 5.031 us 0.0076 us 0.0064 us 1.12 7.6904 - - -

category:cq theme:dead-code skill-level:expert cost:medium

jaredpar commented 4 years ago

It's unclear what you're asking for here. Do you believe there is a bug in C# compiler, the JIT, etc ...

smoogipoo commented 4 years ago

It seems to be a JIT issue, though I don't think I'm fully qualified to be able to answer that.

IL-wise they generate very similar code:

    .method private hidebysig 
        instance int32 safeMethod () cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 8 (0x8)
        .maxstack 3
        .locals init (
            [0] class C/'<>c__DisplayClass2_0' 'CS$<>8__locals0'
        )

        // sequence point: hidden
        IL_0000: newobj instance void C/'<>c__DisplayClass2_0'::.ctor()
        IL_0005: stloc.0
        IL_0006: ldc.i4.0
        IL_0007: ret
    } // end of method C::safeMethod

    .method private hidebysig 
        instance int32 allocatingMethod () cil managed 
    {
        // Method begins at RVA 0x2064
        // Code size 15 (0xf)
        .maxstack 3
        .locals init (
            [0] class C/'<>c__DisplayClass3_0' 'CS$<>8__locals0'
        )

        // sequence point: hidden
        IL_0000: newobj instance void C/'<>c__DisplayClass3_0'::.ctor()
        IL_0005: stloc.0
        IL_0006: ldloc.0
        IL_0007: ldarg.0
        IL_0008: stfld class C C/'<>c__DisplayClass3_0'::'<>4__this'
        IL_000d: ldc.i4.0
        IL_000e: ret
    } // end of method C::allocatingMethod

Whereas the JIT ASM is very different:

C.safeMethod()
    L0000: xor eax, eax
    L0002: ret

C.allocatingMethod()
    L0000: push esi
    L0001: mov esi, ecx
    L0003: mov ecx, 0x9c0cadc
    L0008: call 0x05c230cc
    L000d: lea edx, [eax+4]
    L0010: call 0x0f7fa3b0
    L0015: xor eax, eax
    L0017: pop esi
    L0018: ret

With the only difference being that an instance field is referenced in the allocatingMethod path.

jaredpar commented 4 years ago

Moving back to runtime as the issue is about the JIT generation for the pattern emitted by Roslyn, not the Roslyn pattern itself. At least that is my reading at this point. If the discussion comes back to the pattern emitted here we can move back to Roslyn.

stephentoub commented 4 years ago

Isn't the concern here that the display class is new'd up at the beginning of the method? That's entirely due to the C# compiler's IL code gen and is not something the JIT would change. I don't see how this is a runtime or JIT issue.

(The asm is different because in one case the JIT can actually see that the method is a nop and makes it such.)

jaredpar commented 4 years ago

@stephentoub I've been reading it as they want the JIT to special case and optimize this pattern. @smoogipoo is acknowledging the IL difference but in the face of that still wants the JIT ASM to be the same.

AndyAyersMS commented 4 years ago

Seems like incorrect odd IL to me -- for this source

    private int allocatingMethod()
    {
        return 0;

        int local = 2;
        return list.Count(_ => field > local);
    }

CSC produces this IL:

IL_0000  73 08 00 00 06    newobj       0x6000008
IL_0005  0a                stloc.0     
IL_0006  06                ldloc.0     
IL_0007  02                ldarg.0     
IL_0008  7d 04 00 00 04    stfld        0x4000004
IL_000d  16                ldc.i4.0    
IL_000e  2a                ret         

Now, given that IL, the jit should arguably be able to optimize away the newobj, but this is blocked because the jit doesn't propagate non-nullness for local 0 fast enough (it could/should happen during local AP in morph), and so the jit thinks the stfld might cause an exception, and that keeps the newobj live long enough that it can't be optimized away.

stephentoub commented 4 years ago

I've been reading it as they want the JIT to special case and optimize this pattern

Why couldn't Roslyn similarly avoid emitting the dead code?

AndyAyersMS commented 4 years ago

@smoogipoo I take it this is distilled from some more complex example? If so, do you have examples where Roslyn hoists the closure construction up above more interesting constructs than return 0?

Eager construction is likely deemed safe as the only observable effect is extra heap allocation, but as you note this allocation is not expected and seemingly should be deferred until the point of plausible use.

jaredpar commented 4 years ago

Why couldn't Roslyn similarly avoid emitting the dead code?

Roslyn certainly could. Assuming this wasn't done initially because it can complicate code gen a bit and it's a scenario for which the user has received a warning. Generally that gets prioritized fairly low.

Eager construction is likely deemed safe as the only observable effect is extra heap allocation, but as you note this allocation is not expected and seemingly should be deferred until the point of plausible use.

Agree Roslyn would ideally delay the closure allocation until it was actually used. This has come up before and is tracked by issue 20777. There have been a few discussions over the years but historically there has been concern the extra complexity will outweigh the wins. May be worth taking another look at once we get past C# 9.

In terms of resolving this issue. If we want to look at it as a Roslyn request we can transfer back to roslyn and I can dupe it against 20777 + add parts of the scenario. If want to have it track the potential JIT optimization discussed then we can leave it here. Either way is good by me.

smoogipoo commented 4 years ago

@AndyAyersMS The example I posted was hoisted from https://github.com/ppy/osu-framework/blob/c857318bdf8d3bb06b2d806bc957be571bded0ed/osu.Framework/Graphics/Cursor/TooltipContainer.cs#L195-L253.

The code which allocates (L235-252) runs <1% of the time, with majority of exits being done through L214.

The resultant IL: https://gist.github.com/smoogipoo/0ae07bebe7478c6960da22b16983a67b

AndyAyersMS commented 4 years ago

@jaredpar since you already have an issue for this we can keep this one to cover the codegen aspect. Marking as future.

stephentoub commented 4 years ago

@AndyAyersMS, but that won't help with the actual example, no? I don't see how the JIT can eliminate the allocation in the real case, unless it's going to stack alloc or something.

AndyAyersMS commented 4 years ago

The jit is already optimizing away heap allocations (as in safeMethod) when object construction doesn't have side effects and the object is unreferenced after construction.

removing stmt with no side effects

Removing statement STMT00000 (IL 0x000...0x005)
N005 ( 17, 16) [000003] -AC-----R---              *  ASG       ref   
N004 (  1,  1) [000002] D------N----              +--*  LCL_VAR   ref    V03 tmp1         
N003 ( 17, 16) [000001] --C---------              \--*  CALL help ref    HELPER.CORINFO_HELP_NEWSFAST
N002 (  3, 10) [000000] ------------ arg0 in rcx     \--*  CNS_INT(h) long   0xd1ffab1e method
 in BB01 as useless:

resulting code:

; Assembly listing for method BenchmarkMethodDelegateAllocations:safeMethod():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd
;* V01 loc0         [V01    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 0

G_M42318_IG01:
                        ;; bbWeight=1    PerfScore 0.00
G_M42318_IG02:
       xor      eax, eax
                        ;; bbWeight=1    PerfScore 0.25
G_M42318_IG03:
       ret      
                        ;; bbWeight=1    PerfScore 1.00

so the jit issue is to try and recognize these cases more broadly, removing completely dead allocations, and perhaps using code motion to sink allocations when objects are conditionally live when created.

I don't know how common these dead or partially dead allocation are, so a first step would be to run an analysis screen and see how often we see these.

Outside of exceptions and writes to statics, there can be other observable effects of allocation and construction. For allocation, the runtime tells the jit if the associated new helper has side effects so that the jit won't allocate away dead finalizable allocations, and similar cases. For construction, the jit currently must inline the entire constructor path before it can prove construction is side effect free.

jakobbotsch commented 1 month ago

We cannot get rid of the object allocation today because we consider it used due to a write to one of its field.

Presumably object stack allocation would be able to kick in and remove the allocation if we allowed stack allocating objects in loops.

smoogipoo commented 2 weeks ago

I was reading the recent .NET 9 performance improvements blog post (https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-9/#object-stack-allocation) and had the above comment in the back of my mind.

If I go back to my benchmark and add MethodImplOptions.NoInlining to the methods, then the results are:

| Method           | Job      | Runtime  | Mean     | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|----------------- |--------- |--------- |---------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| SafeMethod       | .NET 8.0 | .NET 8.0 | 1.126 us | 0.0205 us | 0.0182 us |  1.00 |    0.00 |      - |         - |          NA |
| AllocatingMethod | .NET 8.0 | .NET 8.0 | 4.745 us | 0.0077 us | 0.0060 us |  4.22 |    0.07 | 5.0964 |   32000 B |          NA |
|                  |          |          |          |           |           |       |         |        |           |             |
| SafeMethod       | .NET 9.0 | .NET 9.0 | 1.115 us | 0.0094 us | 0.0084 us |  1.00 |    0.00 |      - |         - |          NA |
| AllocatingMethod | .NET 9.0 | .NET 9.0 | 1.111 us | 0.0036 us | 0.0030 us |  1.00 |    0.01 |      - |         - |          NA |

... which is already awesome! This covers probably 95+% of cases where this comes up for us, because we're "looping" at a very high level (global game loop).