Closed EgorBo closed 1 year ago
Running with DOTNET_JitCloneLoops=0
to simplify things a little.
The problem goes away if optJumpThreadPhi
is disabled.
The problem goes away if CSE is diabled.
The actual bug is caused by this CSE:
Considering CSE #02 {$212, $28a} [def=52.182163, use=52.182163, cost= 6, call]
CSE Expression :
N006 ( 6, 6) CSE #02 (def)[000024] ---XG------ * ADD int <l:$215, c:$214>
N004 ( 4, 4) CSE #01 (def)[000092] ---XG------ +--* IND int <l:$210, c:$211>
N003 ( 2, 2) [000133] -------N--- | \--* ADD byref <l:$3c1, c:$3c2>
N001 ( 1, 1) [000020] ----------- | +--* LCL_VAR ref V02 loc1 u:3 <l:$287, c:$103>
N002 ( 1, 1) [000132] ----------- | \--* CNS_INT long 16 Fseq[_size] $382
N005 ( 1, 1) [000023] ----------- \--* CNS_INT int -2 $cc
Moderate CSE Promotion (CSE is live across a call) (156.546490 >= 100.000000)
cseRefCnt=156.546490, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=52.182163, useCnt=52.182163, cost=6, size=6, LiveAcrossCall
def_cost=2, use_cost=1, extra_no_cost=10, extra_yes_cost=100
CSE cost savings check (323.092979 >= 256.546490) passes
Promoting CSE:
lvaGrabTemp returning 10 (V10 rat0) (a long lifetime temp) called for CSE - moderate.
CSE #02 is single-def, so associated CSE temp V10 will be in SSA
New refCnts for V10: refCnt = 2, refCntWtd = 1.04
New refCnts for V10: refCnt = 3, refCntWtd = 1.57
CSE #02 def at [000024] replaced in BB03 with def of V10
optValnumCSE morphed tree:
N010 ( 13, 11) [000025] DA-XG------ * STORE_LCL_VAR int V04 loc3 d:2 <l:$28d, c:$28c>
N009 ( 13, 11) [000227] -A-XG------ \--* COMMA int <l:$215, c:$214>
N007 ( 10, 9) CSE #02 (def)[000225] DA-XG------ +--* STORE_LCL_VAR int V10 cse0 d:1 $VN.Void
N006 ( 6, 6) [000024] ---XG------ | \--* ADD int <l:$215, c:$214>
N004 ( 4, 4) CSE #01 (def)[000092] ---XG------ | +--* IND int <l:$210, c:$211>
N003 ( 2, 2) [000133] -------N--- | | \--* ADD byref <l:$3c1, c:$3c2>
N001 ( 1, 1) [000020] ----------- | | +--* LCL_VAR ref V02 loc1 u:3 <l:$287, c:$103>
N002 ( 1, 1) [000132] ----------- | | \--* CNS_INT long 16 Fseq[_size] $382
N005 ( 1, 1) [000023] ----------- | \--* CNS_INT int -2 $cc
N008 ( 3, 2) [000226] ----------- \--* LCL_VAR int V10 cse0 u:1 <l:$215, c:$214>
Working on the replacement of the CSE #02 use at [000054] in BB11
Unmark CSE use #01 at [000116]: 1 -> 0
optValnumCSE morphed tree:
N002 ( 3, 3) [000055] DA--G------ * STORE_LCL_VAR int V05 loc4 d:4 <l:$28d, c:$28c>
N001 ( 3, 2) [000228] ----------- \--* LCL_VAR int V10 cse0 u:1 <l:$212, c:$213>
The value of V02._size
can change between the fetch in BB03 and the one in BB11, via this call
***** BB07
STMT00012 ( 0x033[E-] ... 0x03D )
N003 ( 16, 9) [000043] --CXG------ * CALL void Program:DoWork(System.Collections.Generic.List`1[NamedSet],int) $VN.Void
N001 ( 1, 1) [000041] ----------- arg0 in rcx +--* LCL_VAR ref V02 loc1 u:3 <l:$287, c:$103>
N002 ( 1, 1) [000042] ----------- arg1 in rdx \--* LCL_VAR int V04 loc3 u:3 $247
CSE availability propagates from BB03 through BB07 and BB10 to BB11.
If RBO's optJumpThreadPhi
is disabled, then, BB10 has an additional pred BB09 where the CSE is not available.
CSE then treats the instance in BB11 as a def instead of a use, and so no CSE happens.
CSE is matching liberal VNs here. So looking at VN:
BB03
N001 [000020] LCL_VAR V02 loc1 u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000132] CNS_INT 16 Fseq[_size] => $382 {LngCns: 16}
N003 [000133] ADD => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
VNForHandle(_size) is $44, fieldType is int, size = 4
VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000092] IND => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$211 {norm=$146 {MemOpaque:L01}, exc=$28b {NullPtrExc($103)}}>
N005 [000023] CNS_INT -2 => $cc {IntCns -2}
N006 [000024] ADD => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$214 {norm=$213 {ADD($cc, $146)}, exc=$28b {NullPtrExc($103)}}>
BB11
N001 [000050] LCL_VAR V02 loc1 u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000161] CNS_INT 16 Fseq[_size] => $382 {LngCns: 16}
N003 [000162] ADD => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
VNForHandle(_size) is $44, fieldType is int, size = 4
VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000116] IND => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$218 {norm=$147 {MemOpaque:L00}, exc=$28b {NullPtrExc($103)}}>
N005 [000053] CNS_INT -2 => $cc {IntCns -2}
N006 [000054] ADD => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$21a {norm=$219 {ADD($cc, $147)}, exc=$28b {NullPtrExc($103)}}>
We see the same field _size
fetched from the heap off of the same base $287.
As noted, the path from BB03 to BB11 includes call that takes V02
as an arg.
***** BB07, STMT00012(before)
N003 ( 16, 9) [000043] --CXG------ * CALL void Program:DoWork(System.Collections.Generic.List`1[NamedSet],int)
N001 ( 1, 1) [000041] ----------- arg0 in rcx +--* LCL_VAR ref V02 loc1 u:3
N002 ( 1, 1) [000042] ----------- arg1 in rdx \--* LCL_VAR int V04 loc3 u:3
N001 [000041] LCL_VAR V02 loc1 u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000042] LCL_VAR V04 loc3 u:3 => $247 {PhiDef($4, $3, $209)}
fgCurMemoryVN[GcHeap] assigned for CALL at [000043] to VN: $1c8.
N003 [000043] CALL => $VN.Void
Which should trash any indirs in the GC heap based off of V02
. It introduces a new memory VN.
But this memory VN merges back into the memory PHI in BB10 creating memory VN $307 which is used in the CSEs above. BB10 is pred to both BB03 and BB11. So per VN, they seem to be reading from the same memory.
Building phi application: $cb = SSA# 9.
Building phi application: $c6 = SSA# 6.
Building phi application: $305 = phi($c6, $cb).
Building phi application: $c5 = SSA# 5.
Building phi application: $306 = phi($c5, $305).
The SSA definition for GcHeap (#6) at start of BB10 is $307 {PhiMemoryDef($43, $306)}
There is a loop here from BB03->BB10, with entry at BB09. But inside there we see an unrecognized loop that excludes BB09?
=======================
So, options?
CSE #02
along the path from BB03 to BB11.
One question I've been asking myself here without any clear insight is why this issue is specific to VNs involving memory. Could we have the same situation with just a simple local that's conditionally defined? If so, it seems like for sure we would have hit this case already.
Also not clear how narrow/fragile the repro is; Egor suggests it is very sensitive to this code shape. Perhaps some experimentation will suggest which of the elements above is necessary?
Deleted the bot comment since it contained the large source code example and was bogging down viewing the bug.
Still not seeing a really clean fix here.
MapSelect
but we don't record info for loads that bottom out on opaque memory VNs or get blocked by PHIs. And it's not easy to walk the VN tree abstractly in general because of all those crazy special encodings for some VN func args (seems like we ought to fix this someday so we can have say a VN tree walker of sorts).
[13:35:47] --------------------------------------------------------------------------------
[13:35:47] 2,608 contexts with diffs (657 improvements, 1,608 regressions, 343 same size)
[13:35:47] -6,300/+19,074 bytes
Here are two graphs showing memory SSA states (MI = memory in, MO = memory out) and VNs. One is before RBO and the other after
After VN | After RBO |
Btw, when I was investigating this I limitted RBO to only run for one specific BB only (the one that cuased problems) and only do Jump-threading-Phi on it to minimze unrelated code movement, but maybe it's what you did here as well
Diffs from blocking optJumpThreadPhi
for blocks with any kind of memory phi:
I am going to test out an alternate fix, where if RBO bypasses a block with memory PHI, we mark that block so that it kills off any incoming CSEs. This seems fairly surgical but also a bit harder to justify as a solid fix. The assumption here is that this issue can only arise if there is a path from one CSE instance to another that passes through a memory PHI, so if we block propagation of all CSEs then that second instance will not be a use of the first.
Ideally perhaps we'd kill off only the impacted CSEs, but that seems harder to do...
@AndyAyersMS I am confused by one thing in the memory SSA picture:
How does the MI
(I am assuming "Memory Incoming") PHI end up having the value of $307
here, given $307
is one of its input arguments (or is that an artifact of the dumping method)?
How does the
MI
(I am assuming "Memory Incoming") PHI end up having the value of$307
here, given$307
is one of its input arguments (or is that an artifact of the dumping method)?
It is a result of the BB10->BB03->BB10 path where the code can (apparently) loop around without modifying memory.
Note that when we actually value number BB10 we won't have that VN available, but the VN memory phis hold only the SSA numbers, not the value numbers, so after the fact we can't be sure which VNs where there already when building the phi, and which weren't.
The following program throws
ArgumentOutOfRangeException
in Release and finishes normally in Debug:https://gist.github.com/EgorBo/bdd1c3c2c65a80681e96f840df40cd2a
cc @AndyAyersMS @dotnet/jit-contrib