dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

Memory dump of AccessViolationException on gc_heap::mark_object_simple and heap corruption #65694

Closed arekpalinski closed 2 years ago

arekpalinski commented 2 years ago

When running our test suite we got a crash: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.:

Faulting application name: dotnet.exe, version: 6.0.222.6406, time stamp: 0x61e1d8df
Faulting module name: coreclr.dll, version: 6.0.222.6406, time stamp: 0x61e1d09e
Exception code: 0xc0000005
Fault offset: 0x000000000009b5fc
Faulting process id: 0x22b4
Faulting application start time: 0x01d823e188848226
Faulting application path: C:\Program Files\dotnet\dotnet.exe
Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\coreclr.dll

We have configured automatic memory dumps creation which resulted in creating the following memory dump:

https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing

In our project (github.com/ravendb/ravendb) we use unmanaged memory directly, so it might be that it's because of our code.

The following analysis was made so far in WinDBG.

  1. Based on !analyze -v I the crashing stacktrace is:
EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffdc50ab5fc (coreclr!WKS::gc_heap::mark_object_simple+0x000000000000011c)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000001
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000022da7b73000
Attempt to read from address 0000022da7b73000
coreclr!WKS::gc_heap::mark_object_simple+0x11c
coreclr!WKS::GCHeap::Promote+0x74
coreclr!GcEnumObject+0x76
coreclr!GcInfoDecoder::EnumerateLiveSlots+0x792
coreclr!EECodeManager::EnumGcRefs+0xe9
coreclr!GcStackCrawlCallBack+0x12f
coreclr!Thread::StackWalkFramesEx+0xee
coreclr!Thread::StackWalkFrames+0xae
coreclr!ScanStackRoots+0x7a
coreclr!GCToEEInterface::GcScanRoots+0x9f
coreclr!WKS::gc_heap::mark_phase+0x291
coreclr!WKS::gc_heap::gc1+0x98
coreclr!WKS::gc_heap::garbage_collect+0x1ad
coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b
coreclr!WKS::gc_heap::try_allocate_more_space+0x5c141
coreclr!WKS::gc_heap::allocate_more_space+0x31
coreclr!WKS::GCHeap::Alloc+0x84
coreclr!JIT_NewArr1+0x4bd
0x00007ffd`778e62c6
0x00007ffd`6f3e3770
0x00007ffd`741c1efb
...
0x00007ffd`67d765da
0x00007ffd`67deeff2
coreclr!CallDescrWorkerInternal+0x83
coreclr!DispatchCallSimple+0x80
coreclr!ThreadNative::KickOffThread_Worker+0x63
coreclr!ManagedThreadBase_DispatchMiddle+0x85
coreclr!ManagedThreadBase_DispatchOuter+0xae
coreclr!ThreadNative::KickOffThread+0x79
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21
  1. The heap is corrupted:
0:340> !verifyheap
object 0000022da400fff8: bad member 0000022D04C05821 at 0000022DA4010000
Last good object: 0000022DA400FFE0.
  1. The last good object is:
0:340> !do 0000022DA400FFE0
Name:        Sparrow.Utils.TimeoutManager+<>c__DisplayClass6_0
MethodTable: 00007ffd67a1a6e0
EEClass:     00007ffd67a24988
Tracked Type: false
Size:        24(0x18) bytes
File:        c:\Jenkins\workspace\PR_Tests\s\test\SlowTests\bin\Release\net6.0\Sparrow.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd679b44a0  40002eb        8 ...Private.CoreLib]]  0 instance 0000022da4010160 onCancel

I see onCancel member so it's likely the following from TimeoutManager.cs:

var onCancel = new TaskCompletionSource<object>(TaskCreationOptions.RunContinuationsAsynchronously);
using (token.Register(tcs => onCancel.TrySetCanceled(), onCancel))
{
}

https://github.com/ravendb/ravendb/blob/193624d559fe2e6525cc383de362c83d19aacffd/src/Sparrow/Utils/TimeoutManager.cs#L139

  1. Bad object is:
0:340> !do 0000022da400fff8
Name:        System.Action`1[[System.Object, System.Private.CoreLib]]
MethodTable: 00007ffd66a69428
EEClass:     00007ffd658f6788
Tracked Type: false
Size:        64(0x40) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd655a5678  40001ec        8        System.Object  0 instance 0000022d04c05821 _target
00007ffd655a5678  40001ed       10        System.Object  0 instance 0000000000000000 _methodBase
00007ffd65654228  40001ee       18        System.IntPtr  1 instance 00007FFD67569160 _methodPtr
00007ffd65654228  40001ef       20        System.IntPtr  1 instance 0000000000000000 _methodPtrAux
00007ffd655a5678  4000273       28        System.Object  0 instance 0000000000000000 _invocationList
00007ffd65654228  4000274       30        System.IntPtr  1 instance 0000000000000000 _invocationCount

It is System.Action1[[System.Object, System.Private.CoreLib]] so my suspicion is that it's this action tcs => onCancel.TrySetCanceled().

The attempt to get its _target results in:

!DumpObj /d 0000022d04c05821
<Note: this object has an invalid CLASS field>
Invalid object

The address matches the output of verifyheap - bad member 0000022D04C05821 at 0000022DA4010000 so we know that the corrupted member is _target.

  1. I see that in our code we use directly onCancel variable in tcs => onCancel.TrySetCanceled() instead of using the callback action: tcs => ((TaskCompletionSource<object>)tcs).TrySetCanceled() but effectively it's the same thing. Could it cause any GC problems and result in something like that?
ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
When running our test suite we got a crash: `The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.`: ``` Faulting application name: dotnet.exe, version: 6.0.222.6406, time stamp: 0x61e1d8df Faulting module name: coreclr.dll, version: 6.0.222.6406, time stamp: 0x61e1d09e Exception code: 0xc0000005 Fault offset: 0x000000000009b5fc Faulting process id: 0x22b4 Faulting application start time: 0x01d823e188848226 Faulting application path: C:\Program Files\dotnet\dotnet.exe Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\coreclr.dll ``` We have configured automatic memory dumps creation which resulted in creating the following memory dump: https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing In our project ([github.com/ravendb/ravendb](https://github.com/ravendb/ravendb)) we use unmanaged memory directly, so it might be that it's because of our code. The following analysis was made so far in WinDBG. 1. Based on `!analyze -v` I the crashing stacktrace is: ``` EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ffdc50ab5fc (coreclr!WKS::gc_heap::mark_object_simple+0x000000000000011c) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000001 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 0000022da7b73000 Attempt to read from address 0000022da7b73000 ``` ``` coreclr!WKS::gc_heap::mark_object_simple+0x11c coreclr!WKS::GCHeap::Promote+0x74 coreclr!GcEnumObject+0x76 coreclr!GcInfoDecoder::EnumerateLiveSlots+0x792 coreclr!EECodeManager::EnumGcRefs+0xe9 coreclr!GcStackCrawlCallBack+0x12f coreclr!Thread::StackWalkFramesEx+0xee coreclr!Thread::StackWalkFrames+0xae coreclr!ScanStackRoots+0x7a coreclr!GCToEEInterface::GcScanRoots+0x9f coreclr!WKS::gc_heap::mark_phase+0x291 coreclr!WKS::gc_heap::gc1+0x98 coreclr!WKS::gc_heap::garbage_collect+0x1ad coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b coreclr!WKS::gc_heap::try_allocate_more_space+0x5c141 coreclr!WKS::gc_heap::allocate_more_space+0x31 coreclr!WKS::GCHeap::Alloc+0x84 coreclr!JIT_NewArr1+0x4bd 0x00007ffd`778e62c6 0x00007ffd`6f3e3770 0x00007ffd`741c1efb ... 0x00007ffd`67d765da 0x00007ffd`67deeff2 coreclr!CallDescrWorkerInternal+0x83 coreclr!DispatchCallSimple+0x80 coreclr!ThreadNative::KickOffThread_Worker+0x63 coreclr!ManagedThreadBase_DispatchMiddle+0x85 coreclr!ManagedThreadBase_DispatchOuter+0xae coreclr!ThreadNative::KickOffThread+0x79 kernel32!BaseThreadInitThunk+0x14 ntdll!RtlUserThreadStart+0x21 ``` 2. The heap is corrupted: ``` 0:340> !verifyheap object 0000022da400fff8: bad member 0000022D04C05821 at 0000022DA4010000 Last good object: 0000022DA400FFE0. ``` 3. The last good object is: ``` 0:340> !do 0000022DA400FFE0 Name: Sparrow.Utils.TimeoutManager+<>c__DisplayClass6_0 MethodTable: 00007ffd67a1a6e0 EEClass: 00007ffd67a24988 Tracked Type: false Size: 24(0x18) bytes File: c:\Jenkins\workspace\PR_Tests\s\test\SlowTests\bin\Release\net6.0\Sparrow.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffd679b44a0 40002eb 8 ...Private.CoreLib]] 0 instance 0000022da4010160 onCancel ``` I see `onCancel` member so it's likely the following from `TimeoutManager.cs`: ``` var onCancel = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously); using (token.Register(tcs => onCancel.TrySetCanceled(), onCancel)) { } ``` https://github.com/ravendb/ravendb/blob/193624d559fe2e6525cc383de362c83d19aacffd/src/Sparrow/Utils/TimeoutManager.cs#L139 4. Bad object is: ``` 0:340> !do 0000022da400fff8 Name: System.Action`1[[System.Object, System.Private.CoreLib]] MethodTable: 00007ffd66a69428 EEClass: 00007ffd658f6788 Tracked Type: false Size: 64(0x40) bytes File: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\System.Private.CoreLib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffd655a5678 40001ec 8 System.Object 0 instance 0000022d04c05821 _target 00007ffd655a5678 40001ed 10 System.Object 0 instance 0000000000000000 _methodBase 00007ffd65654228 40001ee 18 System.IntPtr 1 instance 00007FFD67569160 _methodPtr 00007ffd65654228 40001ef 20 System.IntPtr 1 instance 0000000000000000 _methodPtrAux 00007ffd655a5678 4000273 28 System.Object 0 instance 0000000000000000 _invocationList 00007ffd65654228 4000274 30 System.IntPtr 1 instance 0000000000000000 _invocationCount ``` It is `System.Action1[[System.Object, System.Private.CoreLib]]` so my suspicion is that it's this action `tcs => onCancel.TrySetCanceled()`. The attempt to get its `_target` results in: ``` !DumpObj /d 0000022d04c05821 Invalid object ``` The address matches the output of `verifyheap` - `bad member 0000022D04C05821 at 0000022DA4010000` so we know that the corrupted member is `_target`. 5. I see that in our code we use directly `onCancel` variable in `tcs => onCancel.TrySetCanceled()` instead of using the callback action: `tcs => ((TaskCompletionSource)tcs).TrySetCanceled()` but effectively it's the same thing. Could it cause any GC problems and result in something like that?
Author: arekpalinski
Assignees: -
Labels: `area-GC-coreclr`, `untriaged`
Milestone: -
arekpalinski commented 2 years ago

I managed to repro AVE (the same stacktrace) by running a given test in the loop but this time there is no heap corruption:

0:012> !verifyheap
No heap corruption detected.

Also my question about potential problems from point 5. about using onCancel directly instead of using the callback argument is not relevant - it reproduced without it.

Here is the memory dump: https://drive.google.com/file/d/1EosHaFy4KDfggT2q3adXYF3jhLDNq2uV/view?usp=sharing

It was created as follow:

procdump.exe -ma -e 1 -f C0000005.ACCESS_VIOLATION -g -x C:\workspace\memory-dumps "C:\workspace\ravendb-5.2\test\Tryouts\bin\Release\net6.0\Tryouts.exe"

I will continue the investigation on my side.

mangod9 commented 2 years ago

Thanks for reporting the issue. We will look at the dump, is the standalone repro easily shareable?

arekpalinski commented 2 years ago

Thanks for looking into this. In order to reproduce (it's not consistent) please do:

  1. Clone: https://github.com/arekpalinski/ravendb/tree/RavenDB-18075
  2. Run Tryouts project in Release

The repro is basically running the same test multiple times. See my last commit there - https://github.com/arekpalinski/ravendb/commit/7b9fd16c930699593c917360acd30a3da3f3c3d5. On my end it reproduced usually after 2500 - 3000 iterations (machine with 32GB of memory) but I have just run it on different machine (16GB of memory) and it reproduced after 600 iterations.

arekpalinski commented 2 years ago

Hi @mangod9. Did you have a chance to check if my repro steps results in AVE on your end?

mangod9 commented 2 years ago

No havent had a chance to investigate further. Will try to look into it next week.

arekpalinski commented 2 years ago

Thank you. We got another failure, this time it happened on x86 build running the tests. The stacktrace is a bit different but still GC is involved:

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 7251b419 (coreclr!WKS::my_get_size+0x0000000b)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000001
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00000000
Attempt to read from address 00000000

coreclr!WKS::gc_heap::mark_object_simple+0x69
coreclr!WKS::gc_heap::mark_through_cards_for_segments+0x413
coreclr!WKS::gc_heap::mark_phase+0x1bf
coreclr!WKS::gc_heap::gc1+0x6b
coreclr!WKS::gc_heap::garbage_collect+0x175
coreclr!WKS::GCHeap::GarbageCollectGeneration+0xea
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x19
coreclr!WKS::gc_heap::try_allocate_more_space+0x19c
coreclr!WKS::gc_heap::allocate_more_space+0x18
coreclr!WKS::GCHeap::Alloc+0x4f
coreclr!AllocateObject+0x117
coreclr!JIT_New+0xbe

I can share the dump if needed but I think the repro steps that I shared are more valuable.

arekpalinski commented 2 years ago

I think we started to see that since we moved to .NET 6. But as I said we do unmanaged memory allocations on our own so this might be us causing this as as well (we investigated it on our side and don't see how we could cause it but it's still an option).

Although I think it might be important that my repro resulted in AVE in some GC stuff but the heap was not corrupted (yet?).

arekpalinski commented 2 years ago

Hi. any update on this?

mangod9 commented 2 years ago

I just cloned the repro and seem to be able to repro it after some time. When I look at the dump you shared and the local repro, the corrupted object pointer seems to be located between these two objects. Could you please confirm that for all repros that is the case, and if so is there anything special about these?

0:320> !lno 0x000001a5`d7f9c5e0
Before:  000001a5d7f9c350          664 (0x298)  Raven.Server.Documents.Queries.AST.MatchPath[]
After:   000001a5d7f9c5e8           32 (0x20)   System.Collections.Generic.List`1[[Sparrow.StringSegment, Sparrow]]
Heap local consistency confirmed.
mangod9 commented 2 years ago

From looking closer, this doesnt look like a heap corruption but rather an issue with reporting gcinfo on the stack, possibly some race condition leading to an off-by-8 condition.

arekpalinski commented 2 years ago

Exactly, the repro I provided doesn't cause a heap corruption. That's why I commented:

Although I think it might be important that my repro resulted in AVE in some GC stuff but the heap was not corrupted (yet?).

We use unamanged code so potentially heap corruption could be the root cause. But given that it reproduces without corrupting the heap I think the issue isn't caused by our unmanaged code.

arekpalinski commented 2 years ago

Hi @mangod9. Do you have any leads on that? Is is a race condition as you suspected?

mangod9 commented 2 years ago

Yeah appears to be an issue with gcinfo. Still trying to track down the issue, will try by enabling stress logs and see if we can figure it out. Btw, do you happen to know if this started happening after moving to .NET 6?

arekpalinski commented 2 years ago

I think we didn't see it before (on .NET 5) but I'm not absolutely sure that's the case. It happens occasionally. In contrast to the repro code which fails pretty consistently.

arekpalinski commented 2 years ago

Actually I noticed that it didn't happen to us for some time. Meanwhile we upgraded dotnet version so I though it could be fixed. Although I have run the repro code again (using dotnet 6.0.3) and it's still happening.

alex-jitbit commented 2 years ago

We run a massive app on asp.net core and we started seeing similar random errors like this too after upgrading from .NET 5 to .NET 6:

Application: w3wp.exe
CoreCLR Version: 6.0.322.12309
.NET Version: 6.0.3
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FF932A2A5A9 (00007FF932940000) with exit code 80131506.

I'm currently reverting some nuget packages we upgraded to narrow down the possible causes.

UPD: turns out in our case it was ImageSharp 2.1, reverting to 1.0.4 resolved the issue. I'm going to leave this comment here in case anyone bumps into the same issue

mangod9 commented 2 years ago

got back to investigating this again. On looking through the repro it appears to be a codegen optimization issue with how gcinfos are reported. If I comment out this line: https://github.com/arekpalinski/ravendb/blob/7b9fd16c930699593c917360acd30a3da3f3c3d5/src/Raven.Server/Documents/Queries/Parser/QueryParser.cs#L391

and inline define prev here: https://github.com/arekpalinski/ravendb/blob/7b9fd16c930699593c917360acd30a3da3f3c3d5/src/Raven.Server/Documents/Queries/Parser/QueryParser.cs#L465 with

 if (_synteticWithQueries?.TryGetValue(alias, out SynteticWithQuery prev) == true && !prev.IsEdge)

the failure didnt repro for me (at-least with 10k iterations). @arekpalinski, could you please try that and check if it still reproduces for you? Thx

arekpalinski commented 2 years ago

I can confirm that it didn't repro on my end as well after making those changes. I have also opened the PR to our repo with that workaround (https://github.com/ravendb/ravendb/pull/13968).

That indeed looks as it resolves the problem. Thank you for narrowing it down.

mangod9 commented 2 years ago

Ok good to know. We will continue to investigate whats causing the bug, though I havent been able to narrow down the repro.

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details
When running our test suite we got a crash: `The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.`: ``` Faulting application name: dotnet.exe, version: 6.0.222.6406, time stamp: 0x61e1d8df Faulting module name: coreclr.dll, version: 6.0.222.6406, time stamp: 0x61e1d09e Exception code: 0xc0000005 Fault offset: 0x000000000009b5fc Faulting process id: 0x22b4 Faulting application start time: 0x01d823e188848226 Faulting application path: C:\Program Files\dotnet\dotnet.exe Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\coreclr.dll ``` We have configured automatic memory dumps creation which resulted in creating the following memory dump: https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing In our project ([github.com/ravendb/ravendb](https://github.com/ravendb/ravendb)) we use unmanaged memory directly, so it might be that it's because of our code. The following analysis was made so far in WinDBG. 1. Based on `!analyze -v` I the crashing stacktrace is: ``` EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ffdc50ab5fc (coreclr!WKS::gc_heap::mark_object_simple+0x000000000000011c) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000001 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 0000022da7b73000 Attempt to read from address 0000022da7b73000 ``` ``` coreclr!WKS::gc_heap::mark_object_simple+0x11c coreclr!WKS::GCHeap::Promote+0x74 coreclr!GcEnumObject+0x76 coreclr!GcInfoDecoder::EnumerateLiveSlots+0x792 coreclr!EECodeManager::EnumGcRefs+0xe9 coreclr!GcStackCrawlCallBack+0x12f coreclr!Thread::StackWalkFramesEx+0xee coreclr!Thread::StackWalkFrames+0xae coreclr!ScanStackRoots+0x7a coreclr!GCToEEInterface::GcScanRoots+0x9f coreclr!WKS::gc_heap::mark_phase+0x291 coreclr!WKS::gc_heap::gc1+0x98 coreclr!WKS::gc_heap::garbage_collect+0x1ad coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b coreclr!WKS::gc_heap::try_allocate_more_space+0x5c141 coreclr!WKS::gc_heap::allocate_more_space+0x31 coreclr!WKS::GCHeap::Alloc+0x84 coreclr!JIT_NewArr1+0x4bd 0x00007ffd`778e62c6 0x00007ffd`6f3e3770 0x00007ffd`741c1efb ... 0x00007ffd`67d765da 0x00007ffd`67deeff2 coreclr!CallDescrWorkerInternal+0x83 coreclr!DispatchCallSimple+0x80 coreclr!ThreadNative::KickOffThread_Worker+0x63 coreclr!ManagedThreadBase_DispatchMiddle+0x85 coreclr!ManagedThreadBase_DispatchOuter+0xae coreclr!ThreadNative::KickOffThread+0x79 kernel32!BaseThreadInitThunk+0x14 ntdll!RtlUserThreadStart+0x21 ``` 2. The heap is corrupted: ``` 0:340> !verifyheap object 0000022da400fff8: bad member 0000022D04C05821 at 0000022DA4010000 Last good object: 0000022DA400FFE0. ``` 3. The last good object is: ``` 0:340> !do 0000022DA400FFE0 Name: Sparrow.Utils.TimeoutManager+<>c__DisplayClass6_0 MethodTable: 00007ffd67a1a6e0 EEClass: 00007ffd67a24988 Tracked Type: false Size: 24(0x18) bytes File: c:\Jenkins\workspace\PR_Tests\s\test\SlowTests\bin\Release\net6.0\Sparrow.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffd679b44a0 40002eb 8 ...Private.CoreLib]] 0 instance 0000022da4010160 onCancel ``` I see `onCancel` member so it's likely the following from `TimeoutManager.cs`: ``` var onCancel = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously); using (token.Register(tcs => onCancel.TrySetCanceled(), onCancel)) { } ``` https://github.com/ravendb/ravendb/blob/193624d559fe2e6525cc383de362c83d19aacffd/src/Sparrow/Utils/TimeoutManager.cs#L139 4. Bad object is: ``` 0:340> !do 0000022da400fff8 Name: System.Action`1[[System.Object, System.Private.CoreLib]] MethodTable: 00007ffd66a69428 EEClass: 00007ffd658f6788 Tracked Type: false Size: 64(0x40) bytes File: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.2\System.Private.CoreLib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffd655a5678 40001ec 8 System.Object 0 instance 0000022d04c05821 _target 00007ffd655a5678 40001ed 10 System.Object 0 instance 0000000000000000 _methodBase 00007ffd65654228 40001ee 18 System.IntPtr 1 instance 00007FFD67569160 _methodPtr 00007ffd65654228 40001ef 20 System.IntPtr 1 instance 0000000000000000 _methodPtrAux 00007ffd655a5678 4000273 28 System.Object 0 instance 0000000000000000 _invocationList 00007ffd65654228 4000274 30 System.IntPtr 1 instance 0000000000000000 _invocationCount ``` It is `System.Action1[[System.Object, System.Private.CoreLib]]` so my suspicion is that it's this action `tcs => onCancel.TrySetCanceled()`. The attempt to get its `_target` results in: ``` !DumpObj /d 0000022d04c05821 Invalid object ``` The address matches the output of `verifyheap` - `bad member 0000022D04C05821 at 0000022DA4010000` so we know that the corrupted member is `_target`. 5. I see that in our code we use directly `onCancel` variable in `tcs => onCancel.TrySetCanceled()` instead of using the callback action: `tcs => ((TaskCompletionSource)tcs).TrySetCanceled()` but effectively it's the same thing. Could it cause any GC problems and result in something like that?
Author: arekpalinski
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: 6.0.x
mangod9 commented 2 years ago

Moving to the codegen area, since this looks to be an issue with missing initialization of local structs in certain cases.

AndyAyersMS commented 2 years ago

I have a prospective fix and a simpler repro case.

AndyAyersMS commented 2 years ago

6.06 is out now so if you get a chance, please verify it properly fixes this issue.

AndyAyersMS commented 2 years ago

@arekpalinski I'm going to assume this is fixed; please re-open if you discover that's not the case.

arekpalinski commented 2 years ago

We're approaching the release of a new version soon. So for now we'll stay with the workaround just in case. Afterwards I will revert it and will get back to you if I see any issues.