dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.04k stars 4.68k forks source link

Program throws ArgumentOutOfRangeException in Release #88091

Closed EgorBo closed 1 year ago

EgorBo commented 1 year ago

The following program throws ArgumentOutOfRangeException in Release and finishes normally in Debug:

https://gist.github.com/EgorBo/bdd1c3c2c65a80681e96f840df40cd2a

cc @AndyAyersMS @dotnet/jit-contrib

AndyAyersMS commented 1 year ago

Running with DOTNET_JitCloneLoops=0 to simplify things a little.

The problem goes away if optJumpThreadPhi is disabled. The problem goes away if CSE is diabled.

The actual bug is caused by this CSE:

Considering CSE #02 {$212, $28a} [def=52.182163, use=52.182163, cost=  6, call]
CSE Expression : 
N006 (  6,  6) CSE #02 (def)[000024] ---XG------                         *  ADD       int    <l:$215, c:$214>
N004 (  4,  4) CSE #01 (def)[000092] ---XG------                         +--*  IND       int    <l:$210, c:$211>
N003 (  2,  2)              [000133] -------N---                         |  \--*  ADD       byref  <l:$3c1, c:$3c2>
N001 (  1,  1)              [000020] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1)              [000132] -----------                         |     \--*  CNS_INT   long   16 Fseq[_size] $382
N005 (  1,  1)              [000023] -----------                         \--*  CNS_INT   int    -2 $cc

Moderate CSE Promotion (CSE is live across a call) (156.546490 >= 100.000000)
cseRefCnt=156.546490, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=52.182163, useCnt=52.182163, cost=6, size=6, LiveAcrossCall
def_cost=2, use_cost=1, extra_no_cost=10, extra_yes_cost=100
CSE cost savings check (323.092979 >= 256.546490) passes

Promoting CSE:

lvaGrabTemp returning 10 (V10 rat0) (a long lifetime temp) called for CSE - moderate.
CSE #02 is single-def, so associated CSE temp V10 will be in SSA
New refCnts for V10: refCnt =  2, refCntWtd = 1.04
New refCnts for V10: refCnt =  3, refCntWtd = 1.57

CSE #02 def at [000024] replaced in BB03 with def of V10
optValnumCSE morphed tree:
N010 ( 13, 11)              [000025] DA-XG------                         *  STORE_LCL_VAR int    V04 loc3         d:2 <l:$28d, c:$28c>
N009 ( 13, 11)              [000227] -A-XG------                         \--*  COMMA     int    <l:$215, c:$214>
N007 ( 10,  9) CSE #02 (def)[000225] DA-XG------                            +--*  STORE_LCL_VAR int    V10 cse0         d:1 $VN.Void
N006 (  6,  6)              [000024] ---XG------                            |  \--*  ADD       int    <l:$215, c:$214>
N004 (  4,  4) CSE #01 (def)[000092] ---XG------                            |     +--*  IND       int    <l:$210, c:$211>
N003 (  2,  2)              [000133] -------N---                            |     |  \--*  ADD       byref  <l:$3c1, c:$3c2>
N001 (  1,  1)              [000020] -----------                            |     |     +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1)              [000132] -----------                            |     |     \--*  CNS_INT   long   16 Fseq[_size] $382
N005 (  1,  1)              [000023] -----------                            |     \--*  CNS_INT   int    -2 $cc
N008 (  3,  2)              [000226] -----------                            \--*  LCL_VAR   int    V10 cse0         u:1 <l:$215, c:$214>

Working on the replacement of the CSE #02 use at [000054] in BB11
Unmark CSE use #01 at [000116]:   1 ->   0
optValnumCSE morphed tree:
N002 (  3,  3)              [000055] DA--G------                         *  STORE_LCL_VAR int    V05 loc4         d:4 <l:$28d, c:$28c>
N001 (  3,  2)              [000228] -----------                         \--*  LCL_VAR   int    V10 cse0         u:1 <l:$212, c:$213>

The value of V02._size can change between the fetch in BB03 and the one in BB11, via this call

***** BB07
STMT00012 ( 0x033[E-] ... 0x03D )
N003 ( 16,  9) [000043] --CXG------                         *  CALL      void   Program:DoWork(System.Collections.Generic.List`1[NamedSet],int) $VN.Void
N001 (  1,  1) [000041] ----------- arg0 in rcx             +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1) [000042] ----------- arg1 in rdx             \--*  LCL_VAR   int    V04 loc3         u:3 $247

CSE availability propagates from BB03 through BB07 and BB10 to BB11.

image

If RBO's optJumpThreadPhi is disabled, then, BB10 has an additional pred BB09 where the CSE is not available. CSE then treats the instance in BB11 as a def instead of a use, and so no CSE happens.

CSE is matching liberal VNs here. So looking at VN:

BB03

N001 [000020]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000132]   CNS_INT   16 Fseq[_size] => $382 {LngCns:  16}
N003 [000133]   ADD       => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
    VNForHandle(_size) is $44, fieldType is int, size = 4
    VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
    VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000092]   IND       => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$211 {norm=$146 {MemOpaque:L01}, exc=$28b {NullPtrExc($103)}}>
N005 [000023]   CNS_INT   -2 => $cc {IntCns -2}
N006 [000024]   ADD       => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$214 {norm=$213 {ADD($cc, $146)}, exc=$28b {NullPtrExc($103)}}>

BB11

N001 [000050]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000161]   CNS_INT   16 Fseq[_size] => $382 {LngCns:  16}
N003 [000162]   ADD       => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
    VNForHandle(_size) is $44, fieldType is int, size = 4
    VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
    VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000116]   IND       => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$218 {norm=$147 {MemOpaque:L00}, exc=$28b {NullPtrExc($103)}}>
N005 [000053]   CNS_INT   -2 => $cc {IntCns -2}
N006 [000054]   ADD       => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$21a {norm=$219 {ADD($cc, $147)}, exc=$28b {NullPtrExc($103)}}>

We see the same field _size fetched from the heap off of the same base $287.

As noted, the path from BB03 to BB11 includes call that takes V02 as an arg.

***** BB07, STMT00012(before)
N003 ( 16,  9) [000043] --CXG------                         *  CALL      void   Program:DoWork(System.Collections.Generic.List`1[NamedSet],int)
N001 (  1,  1) [000041] ----------- arg0 in rcx             +--*  LCL_VAR   ref    V02 loc1         u:3
N002 (  1,  1) [000042] ----------- arg1 in rdx             \--*  LCL_VAR   int    V04 loc3         u:3

N001 [000041]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000042]   LCL_VAR   V04 loc3         u:3 => $247 {PhiDef($4, $3, $209)}
  fgCurMemoryVN[GcHeap] assigned for CALL at [000043] to VN: $1c8.
N003 [000043]   CALL      => $VN.Void

Which should trash any indirs in the GC heap based off of V02. It introduces a new memory VN.

But this memory VN merges back into the memory PHI in BB10 creating memory VN $307 which is used in the CSEs above. BB10 is pred to both BB03 and BB11. So per VN, they seem to be reading from the same memory.

  Building phi application: $cb = SSA# 9.
  Building phi application: $c6 = SSA# 6.
  Building phi application: $305 = phi($c6, $cb).
  Building phi application: $c5 = SSA# 5.
  Building phi application: $306 = phi($c5, $305).
The SSA definition for GcHeap (#6) at start of BB10 is $307 {PhiMemoryDef($43, $306)}

There is a loop here from BB03->BB10, with entry at BB09. But inside there we see an unrecognized loop that excludes BB09?

=======================

So, options?

One question I've been asking myself here without any clear insight is why this issue is specific to VNs involving memory. Could we have the same situation with just a simple local that's conditionally defined? If so, it seems like for sure we would have hit this case already.

Also not clear how narrow/fragile the repro is; Egor suggests it is very sensitive to this code shape. Perhaps some experimentation will suggest which of the elements above is necessary?

AndyAyersMS commented 1 year ago

Deleted the bot comment since it contained the large source code example and was bogging down viewing the bug.

AndyAyersMS commented 1 year ago

Still not seeing a really clean fix here.

AndyAyersMS commented 1 year ago

Here are two graphs showing memory SSA states (MI = memory in, MO = memory out) and VNs. One is before RBO and the other after

After VN After RBO
image image
EgorBo commented 1 year ago

Btw, when I was investigating this I limitted RBO to only run for one specific BB only (the one that cuased problems) and only do Jump-threading-Phi on it to minimze unrelated code movement, but maybe it's what you did here as well

AndyAyersMS commented 1 year ago

Diffs from blocking optJumpThreadPhi for blocks with any kind of memory phi:

Overall (+117,686 bytes)
|Collection|Base size (bytes)|Diff size (bytes)| |---|--:|--:| |aspnet.run.windows.x64.checked.mch|38,046,237|+30,260| |benchmarks.run.windows.x64.checked.mch|7,293,703|+3,252| |benchmarks.run_pgo.windows.x64.checked.mch|27,312,724|+24,278| |benchmarks.run_tiered.windows.x64.checked.mch|10,236,085|+2,421| |coreclr_tests.run.windows.x64.checked.mch|369,434,284|+26,015| |libraries.crossgen2.windows.x64.checked.mch|6,295,320|+217| |libraries.pmi.windows.x64.checked.mch|59,196,125|+10,403| |libraries_tests.pmi.windows.x64.checked.mch|115,763,833|+19,722| |realworld.run.windows.x64.checked.mch|13,758,228|+1,118| https://gist.github.com/AndyAyersMS/5f8bd9ce5f68d55e94346da3da76db98
AndyAyersMS commented 1 year ago

I am going to test out an alternate fix, where if RBO bypasses a block with memory PHI, we mark that block so that it kills off any incoming CSEs. This seems fairly surgical but also a bit harder to justify as a solid fix. The assumption here is that this issue can only arise if there is a path from one CSE instance to another that passes through a memory PHI, so if we block propagation of all CSEs then that second instance will not be a use of the first.

Ideally perhaps we'd kill off only the impacted CSEs, but that seems harder to do...

SingleAccretion commented 1 year ago

@AndyAyersMS I am confused by one thing in the memory SSA picture:

image

How does the MI (I am assuming "Memory Incoming") PHI end up having the value of $307 here, given $307 is one of its input arguments (or is that an artifact of the dumping method)?

AndyAyersMS commented 1 year ago

How does the MI (I am assuming "Memory Incoming") PHI end up having the value of $307 here, given $307 is one of its input arguments (or is that an artifact of the dumping method)?

It is a result of the BB10->BB03->BB10 path where the code can (apparently) loop around without modifying memory.

AndyAyersMS commented 1 year ago

Note that when we actually value number BB10 we won't have that VN available, but the VN memory phis hold only the SSA numbers, not the value numbers, so after the fact we can't be sure which VNs where there already when building the phi, and which weren't.