CodeGenerator shouldn't apply newobj --> call optimization

Oh hi! This is my first issue on dotnet/roslyn, so I apologize if I made any rookie mistakes.

Version Used: csc 1.3.2 (from the Microsoft.Net.Compilers NuGet package)

Steps to Reproduce:

Compile the following C# source code (I used csc InPlaceConstruction.cs, tested with both /optimize+ and /optimize-):

using System;

public struct Point
{
    public Point(int X, int Y)
    {
        this = default(Point);
        this.X = X;
        this.Y = Y;
    }
    public Point(ref Point Other)
    {
    this = default(Point);
        this.X = Other.X;
        this.Y = Other.Y;
    }

    public readonly int X;
    public readonly int Y;

    public override string ToString()
    {
        return "(" + X + ", " + Y + ")";
    }
}

public static class Program
{
    public static void Main(string[] Args)
    {
        Point pt1, pt2;
        pt1 = new Point(1, 2);
        pt2 = new Point(ref pt1);
        pt1 = new Point(ref pt1);
        Console.WriteLine(pt1);
        Console.WriteLine(pt2);
    }
}

Run the resulting program.

Expected Behavior:

I expect this to print

(1, 2)
(1, 2)

I base this interpretation on the section on new-object-expressions in the (tentative) GitHub C# spec. Specifically, it states that:

The run-time processing of an object_creation_expression of the form new T(A), where T is class_type or a struct_type and A is an optional argument_list, consists of the following steps:

If T is a class_type:

...

If T is a struct_type:

An instance of type T is created by allocating a temporary local variable. Since an instance constructor of a struct_type is required to definitely assign a value to each field of the instance being created, no initialization of the temporary variable is necessary.

The instance constructor is invoked according to the rules of function member invocation (Compile-time checking of dynamic overload resolution). A reference to the newly allocated instance is automatically passed to the instance constructor and the instance can be accessed from within that constructor as this.

Actual Behavior:

The resulting program prints

(0, 0)
(1, 2)

I don't think that can be reconciled with what the spec says.

I have investigated the issue a little. The disassembly for Main is

    // Methods
    .method public hidebysig static 
        void Main (
            string[] Args
        ) cil managed 
    {
        // Method begins at RVA 0x20e0
        // Code size 53 (0x35)
        .maxstack 3
        .locals init (
            [0] valuetype Point,
            [1] valuetype Point
        )

        IL_0000: nop                  // Do nothing (No operation)
        IL_0001: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_0003: ldc.i4.1             // Push 1 onto the stack as int32
        IL_0004: ldc.i4.2             // Push 2 onto the stack as int32
        IL_0005: call instance void Point::.ctor(int32, int32) // Call method indicated on the stack with arguments
        IL_000a: ldloca.s 1           // Load address of local variable with index indx, short form
        IL_000c: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_000e: call instance void Point::.ctor(valuetype Point&) // Call method indicated on the stack with arguments
        IL_0013: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_0015: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_0017: call instance void Point::.ctor(valuetype Point&) // Call method indicated on the stack with arguments
        IL_001c: ldloc.0              // Load local variable 0 onto stack
        IL_001d: box Point            // Convert a boxable value to its boxed form
        IL_0022: call void [mscorlib]System.Console::WriteLine(object) // Call method indicated on the stack with arguments
        IL_0027: nop                  // Do nothing (No operation)
        IL_0028: ldloc.1              // Load local variable 1 onto stack
        IL_0029: box Point            // Convert a boxable value to its boxed form
        IL_002e: call void [mscorlib]System.Console::WriteLine(object) // Call method indicated on the stack with arguments
        IL_0033: nop                  // Do nothing (No operation)
        IL_0034: ret                  // Return from method, possibly with a value
    } // end of method Program::Main

The interesting bit here is that all instances are initialized by call instructions instead of newobj/stloc. This is a neat optimization for the first and second calls, but it produces unexpected behavior when the last constructor call is executed.

What seems to happen is that PartialCtorResultCannotEscape in Microsoft.CodeAnalysis.CSharp.CodeGen.CodeGenerator doesn't realize that the this pointer passed to the Point(ref Point Other) constructor aliases/may alias the Other parameter. It then decides that it's okay to use a call instruction to (re-)initialize pt1, rather than going with the more conservative newobj/stloc combo. And that sort of breaks my intuition/my reading of the spec.

I would also like to note that a program compiled by mcs prints the same output here. I suspect that they're using the same heuristic as PartialCtorCannotEscape, but I haven't looked at the relevant source code (yet).

Is this something that you're interested in fixing? I'm asking because I'm writing a C# compiler myself, and I'm not sure which semantics I should implement. (For the moment, I'm sticking with the expected behavior.)

Anyway, thanks for reading through this issue! :)

Weird. IIRC, newobj can be used to allocate a struct and push it on the stack but that's not the usual way that is done and I don't believe that I've ever seen the C# compiler do that. The portion of the spec that you quoted seems to fit with the behavior that the compiler is exhibiting, namely that it loads the address of a local onto the stack and then invokes the constructer directly. The verbiage is a little confusing since it mentions allocation, but that wouldn't be a separate step with newobj.

Assuming the above behavior is deemed inappropriate what would the compiler be expected to emit and how would the compiler detect this kind of situation?

The spec specifically states that (emphasis mine)

An instance of type T is created by allocating a temporary local variable

And clearly that's not happening: the last new expression loads the address of an existing, non-temporary local variable onto the stack. As far as I can tell, this is purely an optimization: the code generator will only replace newobj/stloc with ldloca/call if both of the following are true:

The target variable is not on the heap, i.e., it's a local or a parameter. This requirement tries to ensure that the this pointer doesn't alias with any other reachable pointer (but fails to).
The target variable is either not in a try block, or it is defined immediately inside the innermost try block. This requirement makes sure that a local does not end up in a half-constructed state because an exception was thrown in the constructor.

I would expect the constructor call to be compiled as if one of the above requirements had not been satisfied. That is, I expect the Main method above to produce the same results as

public static void Main(string[] Args)
{
    Point pt1 = default(Point), pt2 = default(Point);
    try
    {
        pt1 = new Point(1, 2);
        pt2 = new Point(ref pt1);
        pt1 = new Point(ref pt1);
    }
    catch (Exception)
    { }
    Console.WriteLine(pt1);
    Console.WriteLine(pt2);
}

Which outputs

(1, 2)
(1, 2)

Wrapping code in a try block shouldn't have this kind of observable difference if no exceptions are thrown, right?

Interestingly, mcs still produces a program that prints

(0, 0)
(1, 2)

I think this may be dangerous if new Point throws an exception, but that's kind of off-topic here.

The disassembly for the Main method I just wrote and compiled with csc is

    .method public hidebysig static 
        void Main (
            string[] Args
        ) cil managed 
    {
        // Method begins at RVA 0x20e0
        // Code size 75 (0x4b)
        .maxstack 2
        .locals init (
            [0] valuetype Point,
            [1] valuetype Point
        )

        IL_0000: nop                  // Do nothing (No operation)
        IL_0001: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_0003: initobj Point        // Initialize the value at address dest
        IL_0009: ldloca.s 1           // Load address of local variable with index indx, short form
        IL_000b: initobj Point        // Initialize the value at address dest
        IL_0011: nop                  // Do nothing (No operation)
        IL_0012: ldc.i4.1             // Push 1 onto the stack as int32
        IL_0013: ldc.i4.2             // Push 2 onto the stack as int32
        IL_0014: newobj instance void Point::.ctor(int32, int32) // Allocate an uninitialized object or value type and call ctor
        IL_0019: stloc.0              // Pop a value from stack into local variable 0
        IL_001a: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_001c: newobj instance void Point::.ctor(valuetype Point&) // Allocate an uninitialized object or value type and call ctor
        IL_0021: stloc.1              // Pop a value from stack into local variable 1
        IL_0022: ldloca.s 0           // Load address of local variable with index indx, short form
        IL_0024: newobj instance void Point::.ctor(valuetype Point&) // Allocate an uninitialized object or value type and call ctor
        IL_0029: stloc.0              // Pop a value from stack into local variable 0
        IL_002a: nop                  // Do nothing (No operation)
        IL_002b: leave.s IL_0032      // Exit a protected region of code, short form
        IL_002d: pop                  // Pop value from the stack
        IL_002e: nop                  // Do nothing (No operation)
        IL_002f: nop                  // Do nothing (No operation)
        IL_0030: leave.s IL_0032      // Exit a protected region of code, short form
        IL_0032: ldloc.0              // Load local variable 0 onto stack
        IL_0033: box Point            // Convert a boxable value to its boxed form
        IL_0038: call void [mscorlib]System.Console::WriteLine(object) // Call method indicated on the stack with arguments
        IL_003d: nop                  // Do nothing (No operation)
        IL_003e: ldloc.1              // Load local variable 1 onto stack
        IL_003f: box Point            // Convert a boxable value to its boxed form
        IL_0044: call void [mscorlib]System.Console::WriteLine(object) // Call method indicated on the stack with arguments
        IL_0049: nop                  // Do nothing (No operation)
        IL_004a: ret                  // Return from method, possibly with a value

        Try IL_0011-IL_002d Catch class [mscorlib]System.Exception IL_002d-IL_0032
    } // end of method Program::Main

So, to answer your question, I propose that the compiler emit the IL below for the final assignment to pt1 in my original Main method.

newobj instance void Point::.ctor(valuetype Point&)
stloc.0

Detecting this sort of situation is hard – which is, I'm assuming, why PartialCtorResultCannotEscape fails to return false in this case. I can see three ways to get around this problem:

Always use newobj/stloc. This will always produce the result I'd expect, but it's not as efficient as a simple ldloca/call.
Iterate over all parameters to the constructor before trying to apply the optimization, and don't apply the optimization if one of them is a ref T, where T is the type that's being constructed.

However, I'm fairly sure this will still break unsafe code. For example, the constructor below will still be invoked inappropriately if the third new-expression in my original program is changed to pt1 = new Point(&pt1);.
```
public unsafe Point(Point* Other)
{
    this = default(Point);
    this.X = Other->X;
    this.Y = Other->Y;
}
```
I suppose the code generator could also try to check if a T* parameter is in the parameter list, but that will still break if someone uses pointer casts to obscure that the pointer actually points to a T.
Analyze the constructor to find out if it loads values from a pointer/reference to T, and only emit a call if that's not the case. This is strictly speaking an inter-procedural analysis, and I'm not entirely sure how good Roslyn is at those right now. So this solution may be the hardest one to implement, but it will likely net you the best results.

Hi again. I have implemented option three in my own compiler, and I thought you might be interested in how I decide when to use ldloca/call instead of newobj/stloc. (If you're not, then that's fine too!)

Disclaimer

Let me start with this little disclaimer: I think my approach will always produce correct results, but haven't proven that this is actually the case. If you spot any flaws in my reasoning, then please don't hesitate to point them out – in fact I'd be very grateful if you did.

Also, I'll pretend like this is a pointer for the remainder of this comment. In C# it's not, but it is in IL, so I don't think that's all that unfair of me.

When is the `newobj` --> `call` transformation safe?

My interpretation of the spirit of the spec (yes, I know that's incredibly vague, but bear with here :) ) is that it wants to make sure that every new-object expression initializes a unique region of storage, i.e., when the constructor is run, there is no pointer other than this that points to *this. In other words: the this pointer cannot alias any other pointers.

Now, that's a fairly strict constraint. And clearly using call to initialize a struct breaks that rule, because you need to have a pointer to the region of storage you'd like to initialize before you can pass it as an argument to the constructor. But what PartialCtorResultCannotEscape does – that is, it checks if TargetIsNotOnHeap – is actually pretty devious. It basically boils down to stating that: "if the constructor cannot access any pointers that point to *this but are not the this pointer itself, then it is as if a unique region of storage is initialized."

Checking TargetIsNotOnHeap is admittedly a flawed way of verifying whether this property holds for a given new-object expression but I do believe that the general idea of performing the newobj --> call transformation if and only if no pointer that does not derive from this may alias the this pointer is a good starting point.

When I say that a pointer derives from another pointer, I mean that either that pointer, or the result of applying field access to a derived pointer. So &a.b.c derives from both &a.b and &a.

To recap: suppose that the following statements are written in the context of a constructor:

// `this.x` (partially) aliases `this`, but is derived from `this`.
this.x = 1; // Does not preclude newobj --> call.

// `y` may point to anything, _including_ some part of `this`.
// It may (partially) alias `this`, and is not derived from `this`.
this.x = *y; // Precludes newobj --> call

How do we check that this property holds?

I decided to invert the problem. Instead of checking if the newobj --> call transformation is safe at the call site, I considered it to be more appropriate to check if the transformation is safe at the constructor definition site.

Specifically, I walk through all statements and expressions in the constructor, and check if they may (partially) alias this without being derived from this. If that is the case, then I mark the constructor as ineligible for the newobj --> call transformation.

But wait, there's more

Function calls and new-object expressions in the constructor are a special case. Unfortunately, they're also pretty common: accessing a property amounts to a function call. So we can't just bail out if we encounter one.

First off, if I don't have a function/constructor's source code, then I do bail out, since it's the only sensible thing to do. I mark the constructor that calls it as ineligible for the newobj --> call transformation.

But if I do have the source code, then I try to find all pointers derived from the this pointer as either the this argument of the call, or as (pointer) argument to the call/object instantiation. If if find any, then I apply the analysis from the previous section to the callee, but substitute appropriate parameter(s) to said function for the this pointer.

Full disclosure: this is a slight simplification. I actually use two sets of pointer expressions: a root set and an alias set. That allows me to check if the callee does not prohibit the newobj --> call expression even when I can't fully map my root set, i.e., the this pointer, to function parameters. But I think the root and alias sets make understanding how my algorithm works much harder and don't add much conceptual value.

Does any of this even work?!

I think so. My results seem to indicate so.

You're welcome to browse my implementation of the analysis that I just tried to describe on my GitHub. (Sorry, no syntax highlighting.)

One nice byproduct of my algorithm is that it allows me to go beyond what C# compilers have done in the past:

var xs = new Vector2[1];
xs[0] = new Vector2(1, 2);

gets compiled to

IL_005b:  ldc.i4.1 
IL_005c:  newarr Vector2
IL_0061:  stloc.3 
IL_0062:  ldloc.3 
IL_0063:  ldc.i4.0 
IL_0064:  ldelema Vector2
IL_0069:  ldc.r8 5.
IL_0072:  ldc.r8 4.
IL_007b:  call instance void valuetype Vector2::'.ctor'(float64, float64)

No newobj. I'm pretty happy about that. :)

I think we should suppress the in-place optimization if the ctor takes anything by reference - ref/out/_arglist/TypedReference etc..

Otherwise the constructor can do indirect assignments that would observably break the create-then-assign semantics.

It does not seem to be worth doing alias tracking. Struct ctors taking refs are not common, perf penalty is not huge. Besides with addition of ref locals the tracking could be fairly hard to do. It woudl be easier to only allow the optimization when all parameters are by-value.

Right, but wouldn't that break unsafe code? Consider the following example.

using System;

public struct Point
{
    public Point(int X, int Y)
    {
        this = default(Point);
        this.X = X;
        this.Y = Y;
    }
    public unsafe Point(int X)
    {
        this = default(Point);
        this.X = X;
        this.Y = Program.ptr->Y;
    }

    public int X { get; private set; }
    public int Y { get; private set; }

    public override string ToString()
    {
        return "(" + X + ", " + Y + ")";
    }
}

public static unsafe class Program
{
    public static Point* ptr;

    public static void Main(string[] Args)
    {
        Point vec;
        vec = new Point(1, 2);
        ptr = &vec;
        vec = new Point(1);
        Console.WriteLine(vec.X);
        Console.WriteLine(vec.Y);
    }
}

It's basically a copy. I expect this to print 1 and then 2, but it'll print 1 and then 0 if the second assignment is optimized to a call.

But the parameter list for Point(int) doesn't include any pointers at all, so there's just no way to tell if the newobj --> call optimization is legal just by looking at the parameter list.

@jonathanvdc there are many safeguards in the language that unsafe code can get around. It disrespects scoping and lifetimes of variables. I.E. - you can get a ptr to a local and use that outside of the scope of that local or even its method. The code might still work depending on how locals/stack are implemented, but it would be very implementation dependent.

So, no - Ability of pointers to expose optimizations and implementation details is not a big concern.

Thanks for your reply!

However, I don't quite see what pertinence variable lifetimes have here – my example program does not try to use any storage locations beyond their point of deallocation, regardless of the implementation.

I do think the notion of undefined behavior from C/C++ might be useful here; I would agree with you that it is not entirely unreasonable to optimize based on undefined behavior. For example, if I had used a pointer whose pointee had already been deallocated, then it wouldn't have been incorrect for the compiler to optimize my code to something that does an entirely different thing from what I intended it to – its behavior was undefined, after all.

On the other hand, there is absolutely nothing that leads me to believe that an externally-provided pointer might alias the this pointer in a constructor, especially since the spec specifically states that the this pointer always points to a temporary that was specifically allocated for the constructor call. So there cannot, by definition, be any pointers that (observably) alias this when the constructor is called.

Additionally, I have no knowledge of any rule in C# that would make using a valid pointer in a constructor undefined behavior. In fact, I would even go as far as to say that I don't believe that my program contains any undefined behavior.

So the optimization you propose would – at least for my trivial example program – break the promise made by the spec that calling a constructor through a new-object expression will result in a unique this pointer, even in the absence of undefined behavior. This is not an aggressive optimization or an implementation detail; it's an incorrect optimization which breaks both the spec and people's intuition of how the C# language works.

If you were to transliterate my example's source code to C++ and compile it using a relatively standards-compliant compiler such as gcc or clang, then that'd yield you an executable with the expected behavior. This situation is not unique to C#, but somehow the incorrect optimization seemingly is. Frankly, I think that's just embarrassing to all of us.

In conclusion: I implore you to either do the newobj --> call optimization properly or not do it at all. There's little point in piling yet another layer of heuristics on top of a fundamentally broken optimization technique; that's just a waste of time. If you don't want to go through the pain of analyzing constructor bodies – and believe me, I can sympathize with that – then please do consider just dropping the optimization altogether. Ask yourself this: does the additional copy that a newobj opcode incurs over a call opcode truly have an observable effect on an average program? The only use-case that I know of where the newobj --> call optimization might truly be useful is high-performance computing. You don't seem to be moving in that direction with the C# programming language – at least not for now. So who exactly stands to gain from having this broken optimization anyway? Surely trading correctness for performance – which is exactly what this optimization does – is a no-go, especially in C#?

@jonathanvdc - variable lifetimes is just an example of a language concept that can be completely disrespected in unsafe code. Such violations expose underlying implementation, can result in behavior sensitive to optimizations and can go against the intuition of how C# language works.
Preventing that, when achieved via unsafe code, is not a goal.

The part of the bug that can be reproduced without resorting to unsafe code/pointers/reflection (to the scope described in #16364) is actionable. The rest is not.

@VSadov Hello again! Thanks for your reply!

I'm not entirely convinced by your argument that it's okay for unsafe code to expose arbitrary implementation details. I'm not aware of any statements in the example program I provided that violate the rules of the C# language – if you've spotted any, please do point them out! Short of that, it seems like you're saying that all unsafe code is undefined behavior. I'm fairly sure that's not what the spec says.

I understand that it's a burden to implement and maintain the newobj --> call transformation in a way that does not break the spec for unsafe code. But it's not impossible: I already have a proof-of-concept implementation in my compiler.

Regardless, I think the real question we should be asking ourselves is: "what are the run-time performance implications of the newobj --> call optimization?"

To that end, I've measured the performance of a benchmark with and without the optimization. I took the fractal benchmark from the ecsc test suite (which is actually based on Kevin Frei's fractal benchmark) and tweaked it to print the total time. I also upped its resolution so it'd take longer to compute a result. My modified version can be found here.

I included my input and output below. I used my own compiler in -O2 mode, which produces IL that is similar to csc /optimize+ /debug-. The -fno-optimize-new-struct turns off the newobj --> call optimization. (I didn't use csc because it doesn't have a switch to turn off only the newobj --> call optimization and nothing else – sorry!)

$ mono --version
Mono JIT compiler version 4.8.0 (Stable 4.8.0.495/e4a3cf3 Wed Feb 22 18:30:58 UTC 2017)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
    TLS:           __thread
    SIGSEGV:       altstack
    Notifications: epoll
    Architecture:  amd64
    Disabled:      none
    Misc:          softdebug 
    LLVM:          supported, not enabled.
    GC:            sgen

$ ecsc Fractal.cs -platform clr -O2
$ for i in {1..10}; do mono bin/Fractal.exe; done
Mandelbrot:27337929 Julia:25753505 takes 798.667 ms
Mandelbrot:27337929 Julia:25753505 takes 793.24 ms
Mandelbrot:27337929 Julia:25753505 takes 796.536 ms
Mandelbrot:27337929 Julia:25753505 takes 792.53 ms
Mandelbrot:27337929 Julia:25753505 takes 794.116 ms
Mandelbrot:27337929 Julia:25753505 takes 798.246 ms
Mandelbrot:27337929 Julia:25753505 takes 797.948 ms
Mandelbrot:27337929 Julia:25753505 takes 796.128 ms
Mandelbrot:27337929 Julia:25753505 takes 792.201 ms
Mandelbrot:27337929 Julia:25753505 takes 792.807 ms

$ ecsc Fractal.cs -platform clr -O2 -fno-optimize-new-struct
$ for i in {1..10}; do mono bin/Fractal.exe; done
Mandelbrot:27337929 Julia:25753505 takes 781.789 ms
Mandelbrot:27337929 Julia:25753505 takes 784.175 ms
Mandelbrot:27337929 Julia:25753505 takes 780.436 ms
Mandelbrot:27337929 Julia:25753505 takes 776.888 ms
Mandelbrot:27337929 Julia:25753505 takes 775.225 ms
Mandelbrot:27337929 Julia:25753505 takes 777.721 ms
Mandelbrot:27337929 Julia:25753505 takes 781.725 ms
Mandelbrot:27337929 Julia:25753505 takes 777.09 ms
Mandelbrot:27337929 Julia:25753505 takes 783.697 ms
Mandelbrot:27337929 Julia:25753505 takes 778.738 ms

There's only a slight difference in runtimes, but I find it striking that the optimized version is actually slower. I don't really know why that is. Perhaps the JIT does a better job at optimizing the code without the newobj --> call transformation.

But my point is that this kind of evidence is not exactly in favor of the newobj --> call transformation. In fact, I've never seen any evidence that newobj --> call resulted in a significant performance improvement in any reasonably-sized program. And this relatively struct-heavy benchmark shows that newobj --> call can actually diminish the runtime performance of a compiled program.

So here's my suggestion: just drop this optimization and emit newobj whenever new T(...) appears. Leave this optimization to the JIT. That'll solve all of your corner cases: always emitting newobj will work for safe and unsafe code, it'll honor the spec and respect people's intuition of what new does. I'm not aware of any evidence that this would degrade performance in real-life applications (I wouldn't suggest dropping the optimization otherwise) and you couldn't wish for a lower-maintenance solution. What's not to love?

@VSadov I moved this to 15.6 to evaluate if it is fixed by #20087. I am evaluating that now.

@jonathanvdc The original issue appears to be fixed in 15.6 by #20087. If there are other scenarios that you feel should be fixed, please file them as new issues describing exactly how to repro them.

dotnet / roslyn