Open GrabYourPitchforks opened 5 years ago
Negative values are not commonly seen in address computations. This may not be that hard to fix, though we might need to be wary of reassociation issues (as in dotnet/coreclr#23792).
Will put in 3.0 pending further investigation.
Looking at a specific example:
using System;
using System.Buffers;
using System.Runtime.CompilerServices;
class X
{
static int[] a;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Z(Memory<int> m)
{
Span<int> s = m.Span;
return s[4] + 56;
}
public static int Main()
{
a = new int[] { 0, 1, 2, 3, 44, 5 };
return Z(new Memory<int>(a));
}
}
Codegen for Z
is
G_M62456_IG02:
xor rdi, rdi
xor ebx, ebx
mov rcx, gword ptr [rsi] // object field of mem
test rcx, rcx
je SHORT G_M62456_IG06
lea rdx, bword ptr [rcx+8]
mov rdx, qword ptr [rdx-8]
cmp dword ptr [rdx], 0 // check for has component size
jge SHORT G_M62456_IG03
Probably not so easy to fix.
Because of the way ObjectHasComponentSize
is factored, the jit breaks up what is logically one large expression tree into three parts.... the main reason being that when inlining the jit will not to "forward sub" large or complex argument expression trees into their use(s), but instead stages things via temps to make sure side effects are not improperly reordered. So we end up with:
[000286] ------------ * STMT void (IL 0x000... ???)
[000278] --CXG------- | /--* ADDR byref
[000277] --CXG------- | | \--* FIELD ubyte Data
[000115] ------------ | | \--* LCL_VAR ref V06 tmp4
[000285] -ACXG------- \--* ASG byref
[000284] D------N---- \--* LCL_VAR byref V12 tmp10
[000302] ------------ * STMT void (IL 0x000... ???)
[000268] *-CXG------- | /--* IND long
[000283] ------------ | | | /--* LCL_VAR byref V12 tmp10
[000294] ------------ | | \--* ADD byref
[000291] ------------ | | | /--* CAST long <- int
[000290] ------------ | | | | \--* CNS_INT int 8
[000293] ------------ | | \--* MUL long
[000292] ------------ | | \--* CAST long <- int
[000289] ------------ | | \--* CNS_INT int -1
[000301] -ACXG------- \--* ASG long
[000300] D------N---- \--* LCL_VAR long V13 tmp11
[000123] ------------ * STMT void (IL 0x000... ???)
[000122] --C--------- \--* JTRUE void
[000120] ------------ | /--* CNS_INT int 0
[000121] --C--------- \--* EQ int
[000250] ------------ | /--* CNS_INT int 0
[000251] --CXG------- \--* LT int
[000249] *-CXG------- \--* IND int
[000299] ----G------- \--* FIELD long _value
[000298] ------------ \--* ADDR byref
[000297] ------------ \--* LCL_VAR long V13 tmp11
with these inlines:
[3 IL=0216 TR=000116 0600363D] [aggressive inline attribute] RuntimeHelpers:ObjectHasComponentSize(ref):bool
[4 IL=0001 TR=000241 0600363E] [aggressive inline attribute] RuntimeHelpers:GetObjectMethodTablePointer(ref):long
[5 IL=0001 TR=000254 06003624] [below ALWAYS_INLINE size] JitHelpers:GetRawData(ref):byref
[6 IL=0001 TR=000271 06005745] [aggressive inline attribute] Unsafe:As(ref):ref
[7 IL=0006 TR=000258 06005746] [aggressive inline attribute] Unsafe:As(byref):byref
[8 IL=0012 TR=000263 06005747] [aggressive inline attribute] Unsafe:Add(byref,int):byref
[9 IL=0006 TR=000245 060013D9] [below ALWAYS_INLINE size] IntPtr:op_Explicit(long):long
At that point we're stuck, since:
During inlining, we might be able to do limited forward sub of a complex arg tree, if the callee is a static method with just one arg, and that arg has just one use, and that use is the first instruction in the callee. That would eliminate tmp10 in the above, as Unsafe.As
has exactly that pattern. But not sure we want to go this route either -- I don't know how common that pattern is.
We might also be able to clean this up somewhere else, but that would be a new thing.
Since the sequence above always ends up with a three-level chain of dependent reads (read object, read object's method table, read method table's flag bits) I'm not convinced one extra add makes a lot of difference in overall perf.
Let me think about it a bit more before I move this out of 3.0. Maybe @dotnet/jit-contrib has ideas?
G_M62456_IG02: xor rdi, rdi xor ebx, ebx mov rcx, gword ptr [rsi] // object field of mem test rcx, rcx je SHORT G_M62456_IG06 lea rdx, bword ptr [rcx+8] mov rdx, qword ptr [rdx-8] cmp dword ptr [rdx], 0 // check for has component size jge SHORT G_M62456_IG03
The simple forward substitution experiment I have generates:
G_M33495_IG02:
33FF xor rdi, rdi
33DB xor ebx, ebx
488B0E mov rcx, gword ptr [rsi]
4885C9 test rcx, rcx
7447 je SHORT G_M33495_IG06
488B11 mov rdx, qword ptr [rcx]
833A00 cmp dword ptr [rdx], 0
7D09 jge SHORT G_M33495_IG03
The main problem with what I have now is that it merges arbitrary trees into bigger trees. IMO that increases the chance of a stack overflow in the JIT. The solution would probably be to be more careful and not merge entire trees but just move "interesting" nodes across the trees. Or get rid of GT_LIST
, use GenTreeVisitor
consistently and make it non-recursive. More work but merging entire trees may have an additional advantage beyond enabling additional expression optimizations: it can reduce the number of lclvars.
but instead stages things via temps to make sure side effects are not improperly reordered
I'm not sure what the exact issue is in this case but I find the importer handling of calls generally suspect. Here's a simple example showing this:
[MethodImpl(MethodImplOptions.NoInlining)]
static int A() => 42;
[MethodImpl(MethodImplOptions.NoInlining)]
static int B() => 42;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => A() + B();
The importer generates:
***** BB01, stmt 1
[000004] ------------ * STMT void (IL 0x000...0x00B)
[000003] -AC-G------- \--* ASG int
[000002] D------N---- +--* LCL_VAR int V01 tmp1
[000000] --C-G------- \--* CALL int X.A
***** BB01, stmt 2
[000008] ------------ * STMT void (IL ???... ???)
[000007] --C-G------- \--* RETURN int
[000006] --C-G------- \--* ADD int
[000005] ------------ +--* LCL_VAR int V01 tmp1
[000001] --C-G------- \--* CALL int X.B
The 2 calls ended up in separate trees and a temp variable was introduced. It's not clear why the JIT does that. Normal tree traversal order should preserve the original order without having to create 2 trees. You'd get the order reversed only if GT_ADD
ends up with GTF_REVERSE_OPS
being set but that should not happen because its operands have side effects.
Such tree splitting might be necessary for nested calls:
[MethodImpl(MethodImplOptions.NoInlining)]
static int C(int a, int b) => a + b;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => C(A(), B());
***** BB01, stmt 1
[000004] ------------ * STMT void (IL 0x000...0x00F)
[000003] -AC-G------- \--* ASG int
[000002] D------N---- +--* LCL_VAR int V01 tmp1
[000000] --C-G------- \--* CALL int X.A
***** BB01, stmt 2
[000010] ------------ * STMT void (IL ???... ???)
[000009] --C-G------- \--* RETURN int
[000006] --C-G------- \--* CALL int X.C
[000005] ------------ arg0 +--* LCL_VAR int V01 tmp1
[000001] --C-G------- arg1 \--* CALL int X.B
but even this is questionable, at least on platforms with fixed out args, where the call argument ordering is more flexible compared to x86.
It probably would be easy to teach ValueNumbering to do a few arithmetic identities like:
Such tree splitting might be necessary for nested calls
I changed importer spill behavior in dotnet/coreclr#7923, and it was a correctness fix for a nested case.
There is also some benefit to behaving the same for inline candidates (where we always spill) and non-inline candidates (where we almost always spill), as inline candidacy is somewhat arbitrary (for example, you wouldn't expect codegen to change by adding a noinline attribute to a method that was not getting inlined anyways).
Fixing this seems out of scope for 3.0, so will move to future.
6.0 codegen seems better here: via sharplab.
and jitdump (7.0):
G_M50085_IG02: ;; offset=0017H
33FF xor edi, edi
33DB xor ebx, ebx
488B2E mov rbp, gword ptr [rsi]
4885ED test rbp, rbp
0F84B8000000 je G_M50085_IG08
;; bbWeight=1 PerfScore 3.75
G_M50085_IG03: ;; offset=0027H
488B5500 mov rdx, qword ptr [rbp]
F70200000080 test dword ptr [rdx], 0xFFFFFFFF80000000
Not sure when this improved.
More context at https://github.com/dotnet/coreclr/pull/23783, specifically the comment at https://github.com/dotnet/coreclr/pull/23783#discussion_r272806980.
We have code in
Memory<T>.Span
that attempts to read anIntPtr
field at offset 0 from an object. The way this is done is by getting a ref to the field at offsetIntPtr.Size
, then subtractingIntPtr.Size
and dereferencing.Today it results in this codegen:
It should ideally result in this codegen:
This optimization would also obviate the need for the new intrinsic proposed as part of https://github.com/dotnet/coreclr/pull/23783.
category:cq theme:basic-cq skill-level:expert cost:large