Closed kevin-montrose closed 1 year ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Author: | kevin-montrose |
---|---|
Assignees: | - |
Labels: | `area-GC-coreclr`, `untriaged` |
Milestone: | - |
It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.
For what it's worth, it reliably crashes in a few seconds with a SIGSEGV on a macOS M1, but only in release mode. This was on .NET 8 from a nightly a few days ago.
More notably, it crashes for me even with USE_POH = false
. So this may indicate the problem is not exclusive to the POH.
I was able to repro on net6 and net7-rc2 on Windows, Ryzen 9 5900HS in POH mode (only); I was ultimately unable to make any significant inroads into what was actually happening, other than "bad things"
I tried a few scenarios:
ConcurrentDictionary<,>.TryUpdateInternal
dict[keyIx] = newArr;
- fails (after more iterations, but often less time - indexer is much faster) with an assertion error on the value - typically seeing values like 2565199768464
where the zero was expecteddict.AddOrUpdate(keyIx, newArr, static (_, passed) => passed);
- again much faster, but fails eventuallyIt certainly feels like a GC tracking error that is specific to POH reachability checks for values only reachable by interior managed pointers; however - multiple simpler scenarios that targeted exactly that: did nothing interesting and worked as expected.
Perhaps relevant:
Unsafe.As
instead of the MemoryMarshal.Cast
for the punning: made no difference (but a little faster - sacrificing a length test)GC.KeepAlive(newArr);
after the assignment changed nothing, so presumably the issue isn't related to tracking the inbound value... but this is speculation2565199768464
value) it started taking significantly more iterations to fault (but was still unstable); that may suggest some interplay with JIT, or could just be related to inlining changes making it perform differentlyAnecdotal, but this might be JIT related. If I set COMPlus_JITMinOpts=1
, it does not reproduce anymore. It's been running several minutes without crashing.
@vcsjones still reproduces for me (.NET 7 RC, using POH, x64) with COMPlus_JITMinOpts=1
😞 - just a couple iterations, though they are slower iterations (as expected).
(adding a note to myself to check with Mono)
Update: didn't repro on desktop mono (JIT and interp). I tried some variants with adding enough recursion to confound conservative stack scanning, too. So it seems unlikely it's some logic error in at least the shared bits of CoreLib.
Random thought (I'm not at PC to test) - maybe a JIT size miscalculation due to the explicit struct layout? (not as random as it sounds - the last time I found a JIT bug was the "fixed buffers" size miscalculation)
Usually a heap corruption of this sort could be related to JIT optimizations. @AndyAyersMS in case you have seen cases like this
Random thought (I'm not at PC to test) - maybe a JIT size miscalculation due to the explicit struct layout? (not as random as it sounds - the last time I found a JIT bug was the "fixed buffers" size miscalculation)
In my experimentation removing explicit layout, or adding kinda random padding, made no difference. I default to explicit layouts when doing tricky things with structs since I can never remember what's actually guaranteed by the compiler - given how shrunk down this repro is, it could probably be removed safely...
Usually a heap corruption of this sort could be related to JIT optimizations. @AndyAyersMS in case you have seen cases like this
If this repros with minopts and seems to be related to POH it's a bit less likely to be a JIT issue. But still possible.
cc @dotnet/jit-contrib
Judging by the stack trace it crashes in ObjectEqualityComparer.Equals
Hits an assert with DOTNET_HeapVerify=1 when validating a ConcurrentDictionary's Node object, namely - its members in ValidateObjectMember
I assume this issue is has not been fixed in the last .net 8 preview ?
Description
This was a bear to diagnose, and I'm still not 100% on what exactly is happening but the scenario is:
byte[]
s allocated on the POHbyte[]
s are referenced by aConcurrentDictionary
byte[]
s, and punning it viaMemoryMarshal.Cast
ref byte
and some unsafe code, but I've removed the unsafe code to eliminate it as a possible causebyte[]
s from theConcurrentDictionary
I first discovered this as random looking pointers getting written into those
byte[]
arrays, but in the process of winnowing down to a smaller reproduction null reference exceptions, seg faults, and other "you've corrupted the process"-style errors became more likely. I interpret this as the same corruption happening, but because my punned arrays are smaller the corruption is more likely to hit something else.I first noticed this in .NET 7 RC (
7.0.0-rc.1.22427.1
specifically) but it has also been reproduced in .NET 6.Reproduction Steps
I have a gist I used to winnow down the repro some.
Latest is copied here:
This will fail either in
Check
, with a null ref in an impossible place (usuallyAddOrUpdate
), or with some variant of "runtime has become corrupt". The NRE is most common with the above, but earlier revisions usually failed inCheck
.In my testing this only happens if the POH is used (toggle
USE_POH
to verify), and at all (legal) sizes for thebyte[]
s (changeALLOC_SIZE
to verify).Expected behavior
I would expect the attached code to run fine forever.
Actual behavior
Crashes with some sort of data corruption.
Regression?
No, this reproduces (at least in part) on .NET 6.
Known Workarounds
Don't use the POH I guess?
Configuration
This was first noticed on:
It is also reproducing, at least in part, on .NET 6.
It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.
Other information
When I've found a corrupted
byte[]
(instead of a NRE or other crash), it looks very pointer-y but seems to point to memory outside of any heap.This makes me think some sort of GC bug, perhaps as part of growing or shrinking the POH, but that is ~98% guesswork.