JIT: ldflda + add not properly being optimized in certain cases

GrabYourPitchforks commented 5 years ago

More context at https://github.com/dotnet/coreclr/pull/23783, specifically the comment at https://github.com/dotnet/coreclr/pull/23783#discussion_r272806980.

We have code in Memory<T>.Span that attempts to read an IntPtr field at offset 0 from an object. The way this is done is by getting a ref to the field at offset IntPtr.Size, then subtracting IntPtr.Size and dereferencing.

Today it results in this codegen:

; assume rcx := 'obj' and has already been validated as non-null
lea temp64, [rcx + 08h]  ; ref to obj._firstField
mov temp64, qword ptr [temp64 - 08h]  ; deref field at offset 0 in 'obj'
cmp 0, dword ptr [temp64]  ; compare *(int*)temp64 < 0

It should ideally result in this codegen:

; assume rcx := 'obj' and has already been validated as non-null
mov temp64, qword ptr [rcx]  ; deref field at offset 0 in 'obj'
cmp 0, dword ptr [temp64]  ; compare *(int*)temp64 < 0

This optimization would also obviate the need for the new intrinsic proposed as part of https://github.com/dotnet/coreclr/pull/23783.

category:cq theme:basic-cq skill-level:expert cost:large

AndyAyersMS commented 5 years ago

Negative values are not commonly seen in address computations. This may not be that hard to fix, though we might need to be wary of reassociation issues (as in dotnet/coreclr#23792).

Will put in 3.0 pending further investigation.

AndyAyersMS commented 5 years ago

Looking at a specific example:

using System;
using System.Buffers;
using System.Runtime.CompilerServices;

class X
{
    static int[] a;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Z(Memory<int> m)
    {
        Span<int> s = m.Span;
        return s[4] + 56;
    }

    public static int Main()
    {
        a = new int[] { 0, 1, 2, 3, 44, 5 };
        return Z(new Memory<int>(a));
    }
}

Codegen for Z is

G_M62456_IG02:
       xor      rdi, rdi
       xor      ebx, ebx
       mov      rcx, gword ptr [rsi]    // object field of mem
       test     rcx, rcx
       je       SHORT G_M62456_IG06
       lea      rdx, bword ptr [rcx+8]
       mov      rdx, qword ptr [rdx-8]
       cmp      dword ptr [rdx], 0     // check for has component size
       jge      SHORT G_M62456_IG03

AndyAyersMS commented 5 years ago

Probably not so easy to fix.

Because of the way ObjectHasComponentSize is factored, the jit breaks up what is logically one large expression tree into three parts.... the main reason being that when inlining the jit will not to "forward sub" large or complex argument expression trees into their use(s), but instead stages things via temps to make sure side effects are not improperly reordered. So we end up with:

[000286] ------------    *  STMT      void  (IL 0x000...  ???)
[000278] --CXG-------    |  /--*  ADDR      byref 
[000277] --CXG-------    |  |  \--*  FIELD     ubyte  Data
[000115] ------------    |  |     \--*  LCL_VAR   ref    V06 tmp4         
[000285] -ACXG-------    \--*  ASG       byref 
[000284] D------N----       \--*  LCL_VAR   byref  V12 tmp10        

[000302] ------------    *  STMT      void  (IL 0x000...  ???)
[000268] *-CXG-------    |  /--*  IND       long  
[000283] ------------    |  |  |  /--*  LCL_VAR   byref  V12 tmp10        
[000294] ------------    |  |  \--*  ADD       byref 
[000291] ------------    |  |     |  /--*  CAST      long <- int
[000290] ------------    |  |     |  |  \--*  CNS_INT   int    8
[000293] ------------    |  |     \--*  MUL       long  
[000292] ------------    |  |        \--*  CAST      long <- int
[000289] ------------    |  |           \--*  CNS_INT   int    -1
[000301] -ACXG-------    \--*  ASG       long  
[000300] D------N----       \--*  LCL_VAR   long   V13 tmp11        

[000123] ------------    *  STMT      void  (IL 0x000...  ???)
[000122] --C---------    \--*  JTRUE     void  
[000120] ------------       |  /--*  CNS_INT   int    0
[000121] --C---------       \--*  EQ        int   
[000250] ------------          |  /--*  CNS_INT   int    0
[000251] --CXG-------          \--*  LT        int   
[000249] *-CXG-------             \--*  IND       int   
[000299] ----G-------                \--*  FIELD     long   _value
[000298] ------------                   \--*  ADDR      byref 
[000297] ------------                      \--*  LCL_VAR   long   V13 tmp11

with these inlines:

    [3 IL=0216 TR=000116 0600363D] [aggressive inline attribute] RuntimeHelpers:ObjectHasComponentSize(ref):bool
      [4 IL=0001 TR=000241 0600363E] [aggressive inline attribute] RuntimeHelpers:GetObjectMethodTablePointer(ref):long
        [5 IL=0001 TR=000254 06003624] [below ALWAYS_INLINE size] JitHelpers:GetRawData(ref):byref
          [6 IL=0001 TR=000271 06005745] [aggressive inline attribute] Unsafe:As(ref):ref
        [7 IL=0006 TR=000258 06005746] [aggressive inline attribute] Unsafe:As(byref):byref
        [8 IL=0012 TR=000263 06005747] [aggressive inline attribute] Unsafe:Add(byref,int):byref
      [9 IL=0006 TR=000245 060013D9] [below ALWAYS_INLINE size] IntPtr:op_Explicit(long):long

At that point we're stuck, since:

There is no phase in the jit that tries to regrow trees
Value numbering does not special-case ADD(ADD(x, c1), c2) as ADD(x, c1+c2), so does not realize that the result of the add at 294 is tmp4.
Even if it did, copyprop would not extend the lifetime of tmp4 to replace the add.
We can't clean this up via emitter late peephole, because the only thing we can do there currently is suppress about to be emitted instructions, not modify ones that have already been emitted.

During inlining, we might be able to do limited forward sub of a complex arg tree, if the callee is a static method with just one arg, and that arg has just one use, and that use is the first instruction in the callee. That would eliminate tmp10 in the above, as Unsafe.As has exactly that pattern. But not sure we want to go this route either -- I don't know how common that pattern is.

We might also be able to clean this up somewhere else, but that would be a new thing.

Since the sequence above always ends up with a three-level chain of dependent reads (read object, read object's method table, read method table's flag bits) I'm not convinced one extra add makes a lot of difference in overall perf.

Let me think about it a bit more before I move this out of 3.0. Maybe @dotnet/jit-contrib has ideas?

mikedn commented 5 years ago

G_M62456_IG02: xor rdi, rdi xor ebx, ebx mov rcx, gword ptr [rsi] // object field of mem test rcx, rcx je SHORT G_M62456_IG06 lea rdx, bword ptr [rcx+8] mov rdx, qword ptr [rdx-8] cmp dword ptr [rdx], 0 // check for has component size jge SHORT G_M62456_IG03

The simple forward substitution experiment I have generates:

G_M33495_IG02:
       33FF                 xor      rdi, rdi
       33DB                 xor      ebx, ebx
       488B0E               mov      rcx, gword ptr [rsi]
       4885C9               test     rcx, rcx
       7447                 je       SHORT G_M33495_IG06
       488B11               mov      rdx, qword ptr [rcx]
       833A00               cmp      dword ptr [rdx], 0
       7D09                 jge      SHORT G_M33495_IG03

The main problem with what I have now is that it merges arbitrary trees into bigger trees. IMO that increases the chance of a stack overflow in the JIT. The solution would probably be to be more careful and not merge entire trees but just move "interesting" nodes across the trees. Or get rid of GT_LIST, use GenTreeVisitor consistently and make it non-recursive. More work but merging entire trees may have an additional advantage beyond enabling additional expression optimizations: it can reduce the number of lclvars.

mikedn commented 5 years ago

but instead stages things via temps to make sure side effects are not improperly reordered

I'm not sure what the exact issue is in this case but I find the importer handling of calls generally suspect. Here's a simple example showing this:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int A() => 42;
    [MethodImpl(MethodImplOptions.NoInlining)]
    static int B() => 42;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Test() => A() + B();

The importer generates:

***** BB01, stmt 1
               [000004] ------------              *  STMT      void  (IL 0x000...0x00B)
               [000003] -AC-G-------              \--*  ASG       int   
               [000002] D------N----                 +--*  LCL_VAR   int    V01 tmp1         
               [000000] --C-G-------                 \--*  CALL      int    X.A

***** BB01, stmt 2
               [000008] ------------              *  STMT      void  (IL   ???...  ???)
               [000007] --C-G-------              \--*  RETURN    int   
               [000006] --C-G-------                 \--*  ADD       int   
               [000005] ------------                    +--*  LCL_VAR   int    V01 tmp1         
               [000001] --C-G-------                    \--*  CALL      int    X.B

The 2 calls ended up in separate trees and a temp variable was introduced. It's not clear why the JIT does that. Normal tree traversal order should preserve the original order without having to create 2 trees. You'd get the order reversed only if GT_ADD ends up with GTF_REVERSE_OPS being set but that should not happen because its operands have side effects.

Such tree splitting might be necessary for nested calls:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int C(int a, int b) => a + b;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Test() => C(A(), B());

***** BB01, stmt 1
               [000004] ------------              *  STMT      void  (IL 0x000...0x00F)
               [000003] -AC-G-------              \--*  ASG       int   
               [000002] D------N----                 +--*  LCL_VAR   int    V01 tmp1         
               [000000] --C-G-------                 \--*  CALL      int    X.A

***** BB01, stmt 2
               [000010] ------------              *  STMT      void  (IL   ???...  ???)
               [000009] --C-G-------              \--*  RETURN    int   
               [000006] --C-G-------                 \--*  CALL      int    X.C
               [000005] ------------ arg0               +--*  LCL_VAR   int    V01 tmp1         
               [000001] --C-G------- arg1               \--*  CALL      int    X.B

but even this is questionable, at least on platforms with fixed out args, where the call argument ordering is more flexible compared to x86.

briansull commented 5 years ago

It probably would be easy to teach ValueNumbering to do a few arithmetic identities like:

Value numbering does not special-case ADD(ADD(x, c1), c2) as ADD(x, c1+c2), so does not realize that the result of the add at 294 is tmp4.

AndyAyersMS commented 5 years ago

Such tree splitting might be necessary for nested calls

I changed importer spill behavior in dotnet/coreclr#7923, and it was a correctness fix for a nested case.

There is also some benefit to behaving the same for inline candidates (where we always spill) and non-inline candidates (where we almost always spill), as inline candidacy is somewhat arbitrary (for example, you wouldn't expect codegen to change by adding a noinline attribute to a method that was not getting inlined anyways).

AndyAyersMS commented 5 years ago

Fixing this seems out of scope for 3.0, so will move to future.

AndyAyersMS commented 2 years ago

6.0 codegen seems better here: via sharplab.

and jitdump (7.0):

G_M50085_IG02:              ;; offset=0017H
       33FF                 xor      edi, edi
       33DB                 xor      ebx, ebx
       488B2E               mov      rbp, gword ptr [rsi]
       4885ED               test     rbp, rbp
       0F84B8000000         je       G_M50085_IG08
                                                ;; bbWeight=1    PerfScore 3.75
G_M50085_IG03:              ;; offset=0027H
       488B5500             mov      rdx, qword ptr [rbp]
       F70200000080         test     dword ptr [rdx], 0xFFFFFFFF80000000

Not sure when this improved.

dotnet / runtime

JIT: ldflda + add not properly being optimized in certain cases #12436