Open adamsitnik opened 4 years ago
@kunalspathak please look into this.
CC @dotnet/jit-contrib
Noting down some observations, not necessarily related to the regression.
For SliceSlice()
because of following code:
https://github.com/dotnet/runtime/blob/6072e4d3a7a2a1493f514cdf4be75a3d56580e84/src/coreclr/src/jit/importer.cpp#L1917
Another observation is in .NET 3.1, we didn't JIT SequenceEqual(), but I see that we JIT that method in .NET 5. It is contrary to the fact that it was marked with AggresiveOptimization
in the past and was removed in https://github.com/dotnet/runtime/pull/32371 to get it JIT during R2R.
The benchmarks that regressed operates on Slice<string>
and Slice<byte>
. The calls to Slice<string>
don't get inlined as expected, but the ones for Slice<byte>
gets inlined. My following analysis is for benchmarks for Slice<string>
particularly ReadOnlyMemoryStart() but most likely it is the cause for other benchmarks for Slice<string>
. I will share my findings for benchmarks under Slice<byte>
separately, once I do that analysis.
The jit assembly for Slice()
is unchanged, but the regression is inside the benchmark code ReadOnlyMemoryStart()
. The underlying issue is that in .NET 3.1, we do a null check for obj
and if it is null, call HELPER_RUNTIMELOOKUP
. If not, we will just operate on that object. This is how we generated code in .NET 3.1 for such checks:
G_M18749_IG07:
add x0, fp, #32 // [V01 loc0]
mov w2, #5
bl ReadOnlyMemory`1:Slice(int):struct:this
str x0, [fp,#16] // [V02 loc1]
str x1, [fp,#24] // [V02 loc1+0x08]
mov x0, x20
ldr x2, [x21,#16]
cbnz x2, G_M18749_IG08 ; <----------- This condition checks if obj != null
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
mov x2, x0
G_M18749_IG08:
add x1, fp, #16 // [V02 loc1]
mov x0, x2
bl Slice`1:Consume(byref)
mov x0, x20
ldr x1, [x21,#24]
cbnz x1, G_M18749_IG09
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
mov x1, x0
For happy path (where obj != null
), we would jump to G_M18749_IG08
and use the obj
. Otherwise, we would call CORINFO_HELP_RUNTIMEHANDLE_CLASS
. However, in .NET 5, we have flipped the condition of this check:
G_M6359_IG14:
add x0, fp, #64 // [V01 loc0]
mov w2, #5
bl ReadOnlyMemory`1:Slice(int):ReadOnlyMemory`1:this
str x0, [fp,#48] // [V02 loc1]
str x1, [fp,#56] // [V02 loc1+0x08]
mov x0, x20
ldr x1, [x21,#24]
cbz x1, G_M6359_IG16 ; <----------- This condition now checks if obj == null
;; bbWeight=1 PerfScore 8.50
G_M6359_IG15:
str x1, [fp,#32] // [V14 tmp11]
b G_M6359_IG17
;; bbWeight=0.25 PerfScore 0.50
G_M6359_IG16:
movz x1, #0xd1ffab1e
movk x1, #0xd1ffab1e LSL #16
movk x1, #0xd1ffab1e LSL #32
bl CORINFO_HELP_RUNTIMEHANDLE_CLASS
str x0, [fp,#32] // [V14 tmp11]
;; bbWeight=0.25 PerfScore 0.88
G_M6359_IG17:
add x1, fp, #48 // [V02 loc1]
ldr x0, [fp,#32] // [V14 tmp11]
bl Slice`1:Consume(byref)
mov x0, x20
ldr x1, [x21,#32]
cbz x1, G_M6359_IG19
Now, in happy path, we check for obj == null
condition and if true, would call CORINFO_HELP_RUNTIMEHANDLE_CLASS
, otherwise, we would goto G_M6359_IG15
and do a jump to G_M6359_IG17
. Not only we introduce extra jumps in .NET 5, we also have to spill the values on stack. The local frame size in .NET 3.1 is just 40 vs. 168 for .NET 5.0.
The branches in G_M6359_IG15
and G_M6359_IG16
are weighted same as 0.25
. As per @AndyAyersMS , I tried to bump up the weight of non-call branch higher, but that doesn't change the condition. I will further see what else we can do here.
Edit:
In .NET 3.1, the condition would just have 1 arm and so it gets flipped, however in .NET 5, we have 2 arms on that condition, one coming from "expandable generic dictionaries" work that we did in .NET 5, so most like given that, bumping the non-call branch won't make a difference.
.NET 3.1 tree:
[000163] ------------ * STMT void (IL ???... ???)
[000162] -AC-G------- \--* ASG long
[000161] D------N---- +--* LCL_VAR long V13 tmp9
[000160] --C-G------- \--* QMARK long
[000157] Q----------- if +--* NE int
[000150] ------------ | +--* LCL_VAR long V13 tmp9
[000156] ------------ | \--* CNS_INT long 0
[000159] --C-G------- if \--* COLON long
[000155] --C-G------- else +--* CALL help long HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
[000138] ------------ arg0 | +--* LCL_VAR long V12 tmp8
[000152] ------------ arg1 | \--* CNS_INT(h) long 0xd1ffab1e token
[000158] ------------ then \--* NOP void
.NET 5.0 tree
[000109] -AC-G+------ * ASG long
[000108] D----+-N---- +--* LCL_VAR long V10 tmp7
[000107] --C-G+------ \--* QMARK long
[000097] J----+-N---- if +--* EQ int
[000093] n----+------ | +--* IND long
[000092] -----+------ | | \--* ADD long
[000090] #----+------ | | +--* IND long
[000089] #----+------ | | | \--* IND long
[000088] -----+------ | | | \--* ADD long
[000086] -----+------ | | | +--* LCL_VAR long V09 tmp6
[000087] -----+------ | | | \--* CNS_INT long 48
[000091] -----+------ | | \--* CNS_INT long 24
[000096] -----+------ | \--* CNS_INT long 0
[000106] --C-G+?----- if \--* COLON long
[000095] --C-G+?----- else +--* CALL help long HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
[000085] -----+?----- arg0 in x0 | +--* LCL_VAR long V09 tmp6
[000094] -----+?----- arg1 in x1 | \--* CNS_INT(h) long 0x7ffb0671e8e8 token
[000098] n----+?----- then \--* IND long
[000099] -----+?----- \--* ADD long
[000100] #----+?----- +--* IND long
[000101] #----+?----- | \--* IND long
[000102] -----+?----- | \--* ADD long
[000103] -----+?----- | +--* LCL_VAR long V09 tmp6
[000104] -----+?----- | \--* CNS_INT long 48
[000105] -----+?----- \--* CNS_INT long 24
Here is my analysis for benchmarks in Slice<byte>
. The regression is coming from the inlined Slice<byte>()
and the ctor for ReadOnlyMemory<byte>
present inside Slice()
as seen here.
Below is the assembly code for
Consume(memory.Slice(Size / 2)); // private const int Size = 10;
In .NET 3.1, here is disassembly. IG06
does this check and if within bounds, proceed to creating the ReadOnlyMemory()
object in IG07
as seen here.
G_M45388_IG06:
B94037B3 ldr w19, [fp,#52] // [V01 loc0+0x0c]
7100167F cmp w19, #5
540005A3 blo G_M45388_IG12
G_M45388_IG07:
F94017B4 ldr x20, [fp,#40] // [V01 loc0]
AA1403E0 mov x0, x20
51001675 sub w21, w19, #5
2A1503E1 mov w1, w21
528000A2 mov w2, #5
F9000FA0 str x0, [fp,#24] // [V21 tmp18] ; this._object = _object
B90023A2 str w2, [fp,#32] // [V22 tmp19] ; this._index = _index
B90027A1 str w1, [fp,#36] // [V23 tmp20] ; this._length = _length
910063A0 add x0, fp, #24 // [V02 loc1]
94000000 bl Slice`1:Consume(byref)
7100167F cmp w19, #5
54000483 blo G_M45388_IG13
In .NET 5, here is disassembly. There are more memory access in IG06
(4 vs. 7).
G_M6359_IG05:
B9402FA0 ldr w0, [fp,#44] // [V23 tmp20]
7100141F cmp w0, #5
540006C3 blo G_M6359_IG11
;; bbWeight=0.50 PerfScore 1.75
G_M6359_IG06:
B9402BA0 ldr w0, [fp,#40] // [V22 tmp19]
11001400 add w0, w0, #5 ; <-- In .NET 3.1, we constant prop value of _index and turns this to " = 5"
B9402FA1 ldr w1, [fp,#44] // [V23 tmp20] ; <-- could have skipped because we already load it in IG05, the way it is skipped in .NET 3.1
51001421 sub w1, w1, #5
F94013A2 ldr x2, [fp,#32] // [V21 tmp18]
F9000BA2 str x2, [fp,#16] // [V24 tmp21] ; this._object = _object
B9001BA0 str w0, [fp,#24] // [V25 tmp22] ; this._index = _index
B9001FA1 str w1, [fp,#28] // [V26 tmp23] ; this._length = _length
910043A0 add x0, fp, #16 // [V02 loc1]
94000000 bl Slice`1:Consume(byref)
B9402FA0 ldr w0, [fp,#44] // [V23 tmp20] ; <-- could have been skipped
7100141F cmp w0, #5
54000523 blo G_M6359_IG11
I believe some of them are happening because we fail to constant propagate the value of _index
which is zero. Instead we load the value from stack and add 5
to it. In .NET 3.1, we detected this and converted it to assigment.
.NET 3.1 dump:
***** BB05, stmt 11 (before)
N005 ( 5, 6) [000118] -A------R--- * ASG int
N004 ( 1, 1) [000117] D------N---- +--* LCL_VAR int V07 tmp4 d:2
N003 ( 5, 6) [000073] ------------ \--* ADD int
N001 ( 3, 4) [000070] ------------ +--* LCL_FLD int V01 loc0 u:6[+8] Fseq[_index]
N002 ( 1, 1) [000072] ------------ \--* CNS_INT int 5
VNApplySelectors:
VNForHandle(_index) is $142, fieldType is int
AX2: $142 != $143 ==> select([$283]store($281, $143, $2c1), $142) ==> select($281, $142).
AX1: select([$241]store($281, $142, $40), $142) ==> $40.
VNForMapSelect($380, $142):int returns $40 {IntCns 0}
VNApplySelectors:
VNForHandle(_index) is $142, fieldType is int
VNForMapSelect($380, $142):int returns $40 {IntCns 0}
N001 [000070] LCL_FLD V01 loc0 u:6[+8] Fseq[_index] => $40 {IntCns 0}
N002 [000072] CNS_INT 5 => $42 {IntCns 5}
N003 [000073] ADD => $42 {IntCns 5}
N004 [000117] LCL_VAR V07 tmp4 d:2 => $42 {IntCns 5}
N005 [000118] ASG => $42 {IntCns 5}
***** BB05, stmt 11 (after)
N005 ( 5, 6) [000118] -A------R--- * ASG int $42
N004 ( 1, 1) [000117] D------N---- +--* LCL_VAR int V07 tmp4 d:2 $42
N003 ( 5, 6) [000073] ------------ \--* ADD int $42
N001 ( 3, 4) [000070] ------------ +--* LCL_FLD int V01 loc0 u:6[+8] Fseq[_index] $40
N002 ( 1, 1) [000072] ------------ \--* CNS_INT int 5 $42
.NET 5.0 dump:
***** BB04, STMT00019(before)
N005 ( 3, 4) [000092] -A------R--- * ASG int
N004 ( 1, 1) [000091] D------N---- +--* LCL_VAR int V07 tmp4 d:2
N003 ( 3, 4) [000058] ------------ \--* ADD int
N001 ( 1, 1) [000056] ------------ +--* LCL_VAR int V10 tmp7
N002 ( 1, 2) [000057] ------------ \--* CNS_INT int 5
N001 [000056] LCL_VAR V10 tmp7 => $282 {282}
N002 [000057] CNS_INT 5 => $42 {IntCns 5}
N003 [000058] ADD => $206 {ADD($42, $282)}
N004 [000091] LCL_VAR V07 tmp4 d:2 => $206 {ADD($42, $282)}
N005 [000092] ASG => $206 {ADD($42, $282)}
***** BB04, STMT00019(after)
N005 ( 3, 4) [000092] -A------R--- * ASG int $206
N004 ( 1, 1) [000091] D------N---- +--* LCL_VAR int V07 tmp4 d:2 $206
N003 ( 3, 4) [000058] ------------ \--* ADD int $206
N001 ( 1, 1) [000056] ------------ +--* LCL_VAR int V10 tmp7 $282
N002 ( 1, 2) [000057] ------------ \--* CNS_INT int 5 $42
I am still trying to see why we fail to detect _index
being constant.
@CarolEidt pointed out that in .NET 3.1, this._index
remains LCL_FLD
and gets to SSA form while in .NET 5, we do struct promotion, but because of this condition, we don't convert it to SSA form. Because of this, we don't do constant propagation. We need to support multi-reg defs for SSA and might not be something we want to fix as part of perf regression fix. We should port this to .NET 6 (at least for benchmarks under Slice<byte>()
).
Also, it turns out that the fix for Slice<string>()
needs more thinking given the introduction of runtime lookup in .NET 5. We need to have a broader fix for this which should be done in .NET 6
Here are the benchmarks that regressed (taken from the description but filtered just the ones that regressed)
Class | Method | .NET5 / .NET 3.0 ratio |
---|---|---|
Slice |
ReadOnlyMemoryStart | 2.36 |
Slice |
MemoryStart | 1.97 |
Slice |
ReadOnlyMemoryStartLength | 1.81 |
Slice |
ReadOnlySpanStartLength | 1.77 |
Slice |
ReadOnlySpanStart | 1.75 |
Slice |
MemoryStart | 1.7 |
Slice |
ReadOnlyMemoryStart | 1.68 |
Slice |
MemoryStartSpan | 1.64 |
Slice |
ReadOnlyMemoryStartLength | 1.61 |
Slice |
ReadOnlyMemorySpanStartLength | 1.61 |
Slice |
MemoryStartLengthSpan | 1.6 |
Slice |
MemoryStartLength | 1.55 |
Slice |
ReadOnlyMemoryStartLengthSpan | 1.55 |
Slice |
MemoryStartLength | 1.5 |
Slice |
ReadOnlyMemoryStartSpan | 1.45 |
Slice |
SpanStartLength | 1.44 |
Slice |
SpanStart | 1.43 |
Slice |
ReadOnlyMemoryStartSpan | 1.43 |
Slice |
ReadOnlyMemorySpanStart | 1.37 |
Slice |
MemoryStartSpan | 1.36 |
Slice |
ReadOnlyMemoryStartLengthSpan | 1.31 |
Slice |
MemorySpanStart | 1.31 |
Slice |
MemoryStartLengthSpan | 1.3 |
Slice |
MemorySpanStartLength | 1.21 |
Slice |
ReadOnlyMemorySpanStart | 1.12 |
Slice |
ReadOnlyMemorySpanStartLength | 1.03 |
Slice |
MemorySpanStart | 1.02 |
Actionable item: Need to double check the numbers if we did anything in .NET 6.0 to address https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166.
Here is my analysis after comparing the assembly of .NET 6 vs. .NET 3.1. Here is the diff for ReadOnlyMemory.Slice<byte>
benchmark.
mov x1, #0
str x1, [fp,#48] // [V133 tmp130]
str w1, [fp,#56] // [V134 tmp131]
str w1, [fp,#60] // [V135 tmp132]
b G_M31903_IG05
Update: This PR is out - #52269
I noticed redundant ldr [fp, #48]
and they could have been replaced with mov x1, x19
. It can vary from case to case depending on the register availability, but clearly in below case, CSE could have avoided extra load from memory.
blo G_M31903_IG88
ldr x19, [fp,#48] // [V133 tmp130]
ldr w1, [fp,#56] // [V134 tmp131]
add w20, w1, #5
ldr w1, [fp,#60] // [V135 tmp132]
sub w21, w1, #5
ldr x1, [fp,#48] // [V133 tmp130]
cbz x1, G_M31903_IG13
Update: Related issue: https://github.com/dotnet/runtime/issues/6761
._object
fieldWe load value of ._object
again and again in .NET6
ldr x19, [fp,#48] // [V133 tmp130]
...
...
str x19, [fp,#32] // [V136 tmp133]
...
; occurs 16 times
But in .NET 3.1 we load it in x20
once and do mov x21, x20
to store it in final destination, although instead of mov, we could have just done str x20, [fp, #24]
.
ldr x20, [fp,#40] // [V01 loc0]
mov x21, x20
...
str x21, [fp,#24] // [V149 tmp146]
...
...
mov x21, x20
...
str x21, [fp,#24] // [V149 tmp146]
...
occurs 16 times.
Next, we still have problem of not const proping 5
in .NET 6 as discussed in https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166.
ldr w1, [fp,#56] // [V134 tmp131]
add w20, w1, #5
...
occurs 16 times.
But in .NEt3.1, we const prop and directly move 5
:
mov w0, #5
str x21, [fp,#24] // [V149 tmp146] ; ._object
str w0, [fp,#32] // [V150 tmp147]
sub
operationLastly, in .NET 6, we do not CSE _length - start
and do sub
every time.
ldr w1, [fp,#60] // [V135 tmp132]
sub w21, w1, #5
...
str w21, [fp,#44] // [V138 tmp135]
...
...
sub w21, w1, #5
...
str w21, [fp,#44] // [V138 tmp135]
...
occurs 15 times
But in .NET3.1, we do subtraction once and CSE the result.
sub w22, w19, #5
mov w23, w22
...
str w23, [fp,#36] // [V151 tmp148]
...
...
mov w23, w22
...
str w23, [fp,#36] // [V151 tmp148]
...
15 times
I will try to investigate little more and open separate issues for each of them.
So most of the regression is happening due to CSE not working for add and sub. During value numbering, we name V22
with unique VN making it not possible to CSE Add(V22, 5)
operation.
***** BB05, STMT00027(before)
N005 ( 3, 4) [000138] -A------R--- * ASG int
N004 ( 1, 1) [000137] D------N---- +--* LCL_VAR int V08 tmp5 d:2
N003 ( 3, 4) [000069] ------------ \--* ADD int
N001 ( 1, 1) [000067] ------------ +--* LCL_VAR int V22 tmp19
N002 ( 1, 2) [000068] ------------ \--* CNS_INT int 5
N001 [000067] LCL_VAR V22 tmp19 => $242 {242}
N002 [000068] CNS_INT 5 => $42 {IntCns 5}
N003 [000069] ADD => $206 {ADD($42, $242)}
N004 [000137] LCL_VAR V08 tmp5 d:2 => $206 {ADD($42, $242)}
N005 [000138] ASG => $206 {ADD($42, $242)}
***** BB05, STMT00027(after)
N005 ( 3, 4) [000138] -A------R--- * ASG int $206
N004 ( 1, 1) [000137] D------N---- +--* LCL_VAR int V08 tmp5 d:2 $206
N003 ( 3, 4) [000069] ------------ \--* ADD int $206
N001 ( 1, 1) [000067] ------------ +--* LCL_VAR int V22 tmp19 $242
N002 ( 1, 2) [000068] ------------ \--* CNS_INT int 5 $42
====================================================================================================================================================================
***** BB12, STMT00048(before)
N005 ( 3, 4) [000250] -A------R--- * ASG int
N004 ( 1, 1) [000249] D------N---- +--* LCL_VAR int V16 tmp13 d:2
N003 ( 3, 4) [000181] ------------ \--* ADD int
N001 ( 1, 1) [000179] ------------ +--* LCL_VAR int V22 tmp19
N002 ( 1, 2) [000180] ------------ \--* CNS_INT int 5
N001 [000179] LCL_VAR V22 tmp19 => $24d {24d}
N002 [000180] CNS_INT 5 => $42 {IntCns 5}
N003 [000181] ADD => $20f {ADD($42, $24d)}
N004 [000249] LCL_VAR V16 tmp13 d:2 => $20f {ADD($42, $24d)}
N005 [000250] ASG => $20f {ADD($42, $24d)}
***** BB12, STMT00048(after)
N005 ( 3, 4) [000250] -A------R--- * ASG int $20f
N004 ( 1, 1) [000249] D------N---- +--* LCL_VAR int V16 tmp13 d:2 $20f
N003 ( 3, 4) [000181] ------------ \--* ADD int $20f
N001 ( 1, 1) [000179] ------------ +--* LCL_VAR int V22 tmp19 $24d
N002 ( 1, 2) [000180] ------------ \--* CNS_INT int 5 $42
The reason is what I mentioned in https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166. Since it is related to the struct multireg return, I would assign this to @sandreenko .
Multi-def SSA is a large work item that won't happen in 6.0. Marking it as Future for now.
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to slicing.
It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).
Repro
@kunalspathak is there any chance you could take a look at the produced assembly code and verify if this is an actual regression in code gen or not?
category:cq theme:ssa skill-level:expert cost:large