Memory dump of AccessViolationException on gc_heap::mark_object_simple and heap corruption

When running our test suite we got a crash: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.:

Faulting application name: dotnet.exe, version: 6.0.222.6406, time stamp: 0x61e1d8df
Faulting module name: coreclr.dll, version: 6.0.222.6406, time stamp: 0x61e1d09e
Exception code: 0xc0000005
Fault offset: 0x000000000009b5fc
Faulting process id: 0x22b4
Faulting application start time: 0x01d823e188848226
Faulting application path: C:\Program Files\dotnet\dotnet.exe
Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\coreclr.dll

We have configured automatic memory dumps creation which resulted in creating the following memory dump:

https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing

In our project (github.com/ravendb/ravendb) we use unmanaged memory directly, so it might be that it's because of our code.

The following analysis was made so far in WinDBG.

Based on !analyze -v I the crashing stacktrace is:

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffdc50ab5fc (coreclr!WKS::gc_heap::mark_object_simple+0x000000000000011c)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000001
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000022da7b73000
Attempt to read from address 0000022da7b73000

coreclr!WKS::gc_heap::mark_object_simple+0x11c
coreclr!WKS::GCHeap::Promote+0x74
coreclr!GcEnumObject+0x76
coreclr!GcInfoDecoder::EnumerateLiveSlots+0x792
coreclr!EECodeManager::EnumGcRefs+0xe9
coreclr!GcStackCrawlCallBack+0x12f
coreclr!Thread::StackWalkFramesEx+0xee
coreclr!Thread::StackWalkFrames+0xae
coreclr!ScanStackRoots+0x7a
coreclr!GCToEEInterface::GcScanRoots+0x9f
coreclr!WKS::gc_heap::mark_phase+0x291
coreclr!WKS::gc_heap::gc1+0x98
coreclr!WKS::gc_heap::garbage_collect+0x1ad
coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b
coreclr!WKS::gc_heap::try_allocate_more_space+0x5c141
coreclr!WKS::gc_heap::allocate_more_space+0x31
coreclr!WKS::GCHeap::Alloc+0x84
coreclr!JIT_NewArr1+0x4bd
0x00007ffd`778e62c6
0x00007ffd`6f3e3770
0x00007ffd`741c1efb
...
0x00007ffd`67d765da
0x00007ffd`67deeff2
coreclr!CallDescrWorkerInternal+0x83
coreclr!DispatchCallSimple+0x80
coreclr!ThreadNative::KickOffThread_Worker+0x63
coreclr!ManagedThreadBase_DispatchMiddle+0x85
coreclr!ManagedThreadBase_DispatchOuter+0xae
coreclr!ThreadNative::KickOffThread+0x79
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

The heap is corrupted:

0:340> !verifyheap
object 0000022da400fff8: bad member 0000022D04C05821 at 0000022DA4010000
Last good object: 0000022DA400FFE0.

The last good object is:

0:340> !do 0000022DA400FFE0
Name:        Sparrow.Utils.TimeoutManager+<>c__DisplayClass6_0
MethodTable: 00007ffd67a1a6e0
EEClass:     00007ffd67a24988
Tracked Type: false
Size:        24(0x18) bytes
File:        c:\Jenkins\workspace\PR_Tests\s\test\SlowTests\bin\Release\net6.0\Sparrow.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd679b44a0  40002eb        8 ...Private.CoreLib]]  0 instance 0000022da4010160 onCancel

I see onCancel member so it's likely the following from TimeoutManager.cs:

var onCancel = new TaskCompletionSource<object>(TaskCreationOptions.RunContinuationsAsynchronously);
using (token.Register(tcs => onCancel.TrySetCanceled(), onCancel))
{
}

https://github.com/ravendb/ravendb/blob/193624d559fe2e6525cc383de362c83d19aacffd/src/Sparrow/Utils/TimeoutManager.cs#L139

Bad object is:

0:340> !do 0000022da400fff8
Name:        System.Action`1[[System.Object, System.Private.CoreLib]]
MethodTable: 00007ffd66a69428
EEClass:     00007ffd658f6788
Tracked Type: false
Size:        64(0x40) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd655a5678  40001ec        8        System.Object  0 instance 0000022d04c05821 _target
00007ffd655a5678  40001ed       10        System.Object  0 instance 0000000000000000 _methodBase
00007ffd65654228  40001ee       18        System.IntPtr  1 instance 00007FFD67569160 _methodPtr
00007ffd65654228  40001ef       20        System.IntPtr  1 instance 0000000000000000 _methodPtrAux
00007ffd655a5678  4000273       28        System.Object  0 instance 0000000000000000 _invocationList
00007ffd65654228  4000274       30        System.IntPtr  1 instance 0000000000000000 _invocationCount

It is System.Action1[[System.Object, System.Private.CoreLib]] so my suspicion is that it's this action tcs => onCancel.TrySetCanceled().

The attempt to get its _target results in:

!DumpObj /d 0000022d04c05821
<Note: this object has an invalid CLASS field>
Invalid object

The address matches the output of verifyheap - bad member 0000022D04C05821 at 0000022DA4010000 so we know that the corrupted member is _target.

I see that in our code we use directly onCancel variable in tcs => onCancel.TrySetCanceled() instead of using the callback action: tcs => ((TaskCompletionSource<object>)tcs).TrySetCanceled() but effectively it's the same thing. Could it cause any GC problems and result in something like that?

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details

When running our test suite we got a crash: `The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.`: ``` Faulting application name: dotnet.exe, version: 6.0.222.6406, time stamp: 0x61e1d8df Faulting module name: coreclr.dll, version: 6.0.222.6406, time stamp: 0x61e1d09e Exception code: 0xc0000005 Fault offset: 0x000000000009b5fc Faulting process id: 0x22b4 Faulting application start time: 0x01d823e188848226 Faulting application path: C:\Program Files\dotnet\dotnet.exe Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\coreclr.dll ``` We have configured automatic memory dumps creation which resulted in creating the following memory dump: https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing In our project ([github.com/ravendb/ravendb](https://github.com/ravendb/ravendb)) we use unmanaged memory directly, so it might be that it's because of our code. The following analysis was made so far in WinDBG. 1. Based on `!analyze -v` I the crashing stacktrace is: ``` EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ffdc50ab5fc (coreclr!WKS::gc_heap::mark_object_simple+0x000000000000011c) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000001 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 0000022da7b73000 Attempt to read from address 0000022da7b73000 ``` ``` coreclr!WKS::gc_heap::mark_object_simple+0x11c coreclr!WKS::GCHeap::Promote+0x74 coreclr!GcEnumObject+0x76 coreclr!GcInfoDecoder::EnumerateLiveSlots+0x792 coreclr!EECodeManager::EnumGcRefs+0xe9 coreclr!GcStackCrawlCallBack+0x12f coreclr!Thread::StackWalkFramesEx+0xee coreclr!Thread::StackWalkFrames+0xae coreclr!ScanStackRoots+0x7a coreclr!GCToEEInterface::GcScanRoots+0x9f coreclr!WKS::gc_heap::mark_phase+0x291 coreclr!WKS::gc_heap::gc1+0x98 coreclr!WKS::gc_heap::garbage_collect+0x1ad coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b coreclr!WKS::gc_heap::try_allocate_more_space+0x5c141 coreclr!WKS::gc_heap::allocate_more_space+0x31 coreclr!WKS::GCHeap::Alloc+0x84 coreclr!JIT_NewArr1+0x4bd 0x00007ffd`778e62c6 0x00007ffd`6f3e3770 0x00007ffd`741c1efb ... 0x00007ffd`67d765da 0x00007ffd`67deeff2 coreclr!CallDescrWorkerInternal+0x83 coreclr!DispatchCallSimple+0x80 coreclr!ThreadNative::KickOffThread_Worker+0x63 coreclr!ManagedThreadBase_DispatchMiddle+0x85 coreclr!ManagedThreadBase_DispatchOuter+0xae coreclr!ThreadNative::KickOffThread+0x79 kernel32!BaseThreadInitThunk+0x14 ntdll!RtlUserThreadStart+0x21 ``` 2. The heap is corrupted: ``` 0:340> !verifyheap object 0000022da400fff8: bad member 0000022D04C05821 at 0000022DA4010000 Last good object: 0000022DA400FFE0. ``` 3. The last good object is: ``` 0:340> !do 0000022DA400FFE0 Name: Sparrow.Utils.TimeoutManager+<>c__DisplayClass6_0 MethodTable: 00007ffd67a1a6e0 EEClass: 00007ffd67a24988 Tracked Type: false Size: 24(0x18) bytes File: c:\Jenkins\workspace\PR_Tests\s\test\SlowTests\bin\Release\net6.0\Sparrow.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffd679b44a0 40002eb 8 ...Private.CoreLib]] 0 instance 0000022da4010160 onCancel ``` I see `onCancel` member so it's likely the following from `TimeoutManager.cs`: ``` var onCancel = new TaskCompletionSource

Author:	arekpalinski
Assignees:	-
Labels:	`area-GC-coreclr`, `untriaged`
Milestone:	-

Author:	arekpalinski
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	6.0.x

dotnet / runtime

Memory dump of AccessViolationException on gc_heap::mark_object_simple and heap corruption #65694