dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

AccessViolationException in .Net Core 2.2.203 #12691

Closed myarichuk closed 4 years ago

myarichuk commented 5 years ago

The exception happens intermittently when running specific RavenDB test in a 64-bit Windows process.

dotnet --info returns the following:

.NET Core SDK (reflecting any global.json):
 Version:   2.2.203
 Commit:    e5bab63eca

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.17134
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\2.2.203\

Host (useful for support):
  Version: 2.2.4
  Commit:  f95848e524

.NET Core SDKs installed:
  2.2.203 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.2.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]

The following code was used to reproduce this. On my machine reproduction takes between 30 minutes to several hours to happen (the code for the test itself can be found here: https://github.com/ravendb/ravendb/blob/v4.2/test/SlowTests/Server/Documents/PeriodicBackup/RavenDB-11139.cs)

for (int i = 0; i < 500; i++)
{
    Parallel.For(0, 1000, new ParallelOptions
    {
        MaxDegreeOfParallelism = 2
    }, _ =>
    {
        using (var test = new RavenDB_11139())
        {
        test.CreateFullAndIncrementalBackupWithCompareExchange().Wait();
        }
    });
}

The stack trace of the AccessViolationException typically looks like this:

 # Child-SP          RetAddr           Call Site
00 (Inline Function) --------`-------- coreclr!Volatile<enum AppDomain::Stage>::Load+0x5 [e:\a\_work\572\s\src\inc\volatile.h @ 292] 
01 (Inline Function) --------`-------- coreclr!Volatile<enum AppDomain::Stage>::operator enum AppDomain::Stage+0x5 [e:\a\_work\572\s\src\inc\volatile.h @ 346] 
02 (Inline Function) --------`-------- coreclr!AppDomain::NoAccessToHandleTable+0x5 [e:\a\_work\572\s\src\vm\appdomain.hpp @ 3138] 
03 0000001a`f023a4c8 00007ffc`606af2f6 coreclr!MethodTable::GetLoaderAllocatorObjectForGC+0x11 [e:\a\_work\572\s\src\vm\methodtable.cpp @ 10100] 
04 0000001a`f023a4d0 00007ffc`6056e292 coreclr!WKS::gc_heap::mark_object_simple1+0x140d76 [e:\a\_work\572\s\src\gc\gc.cpp @ 17800] 
05 0000001a`f023a560 00007ffc`6056fd39 coreclr!WKS::gc_heap::mark_object_simple+0x1b2 [e:\a\_work\572\s\src\gc\gc.cpp @ 18285] 
06 0000001a`f023a5b0 00007ffc`6056f39a coreclr!WKS::GCHeap::Promote+0x79 [e:\a\_work\572\s\src\gc\gc.cpp @ 33830] 
07 (Inline Function) --------`-------- coreclr!GcInfoDecoder::ReportStackSlotToGC+0x49 [e:\a\_work\572\s\src\vm\gcinfodecoder.cpp @ 1841] 
08 (Inline Function) --------`-------- coreclr!GcInfoDecoder::ReportSlotToGC+0x6a [e:\a\_work\572\s\src\inc\gcinfodecoder.h @ 679] 
09 (Inline Function) --------`-------- coreclr!GcInfoDecoder::ReportUntrackedSlots+0x80 [e:\a\_work\572\s\src\vm\gcinfodecoder.cpp @ 1027] 
0a 0000001a`f023a610 00007ffc`6056d4a1 coreclr!GcInfoDecoder::EnumerateLiveSlots+0x59a [e:\a\_work\572\s\src\vm\gcinfodecoder.cpp @ 976] 
0b 0000001a`f023aa60 00007ffc`6056c4cc coreclr!EECodeManager::EnumGcRefs+0x141 [e:\a\_work\572\s\src\vm\eetwain.cpp @ 5130] 
0c 0000001a`f023ac00 00007ffc`6056ab8e coreclr!GcStackCrawlCallBack+0x11c [e:\a\_work\572\s\src\vm\gcenv.ee.common.cpp @ 304] 
0d (Inline Function) --------`-------- coreclr!Thread::MakeStackwalkerCallback+0x73 [e:\a\_work\572\s\src\vm\stackwalk.cpp @ 877] 
0e 0000001a`f023adb0 00007ffc`60567c64 coreclr!Thread::StackWalkFramesEx+0x106 [e:\a\_work\572\s\src\vm\stackwalk.cpp @ 958] 
0f 0000001a`f023b1a0 00007ffc`6055bed1 coreclr!Thread::StackWalkFrames+0xc0 [e:\a\_work\572\s\src\vm\stackwalk.cpp @ 1042] 
10 (Inline Function) --------`-------- coreclr!ScanStackRoots+0x84 [e:\a\_work\572\s\src\vm\gcenv.ee.cpp @ 149] 
11 (Inline Function) --------`-------- coreclr!GCToEEInterface::GcScanRoots+0x14a [e:\a\_work\572\s\src\vm\gcenv.ee.cpp @ 178] 
12 (Inline Function) --------`-------- coreclr!GCScan::GcScanRoots+0x14a [e:\a\_work\572\s\src\gc\gcscan.cpp @ 155] 
13 0000001a`f023c2b0 00007ffc`605eb3e1 coreclr!WKS::gc_heap::mark_phase+0x361 [e:\a\_work\572\s\src\gc\gc.cpp @ 19624] 
14 0000001a`f023c530 00007ffc`605edf3f coreclr!WKS::gc_heap::gc1+0xb1 [e:\a\_work\572\s\src\gc\gc.cpp @ 15390] 
15 (Inline Function) --------`-------- coreclr!GCToOSInterface::GetLowPrecisionTimeStamp+0x5 [e:\a\_work\572\s\src\vm\gcenv.os.cpp @ 704] 
16 0000001a`f023c630 00007ffc`605aba5e coreclr!WKS::gc_heap::garbage_collect+0x29f [e:\a\_work\572\s\src\gc\gc.cpp @ 16967] 
17 0000001a`f023c730 00007ffc`6057240e coreclr!WKS::GCHeap::GarbageCollectGeneration+0x19e [e:\a\_work\572\s\src\gc\gc.cpp @ 35107] 
18 (Inline Function) --------`-------- coreclr!WKS::gc_heap::try_allocate_more_space+0x233 [e:\a\_work\572\s\src\gc\gc.cpp @ 13197] 
19 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x233 [e:\a\_work\572\s\src\gc\gc.cpp @ 13490] 
1a (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x5a9 [e:\a\_work\572\s\src\gc\gc.cpp @ 13521] 
1b (Inline Function) --------`-------- coreclr!WKS::GCHeap::Alloc+0x5d1 [e:\a\_work\572\s\src\gc\gc.cpp @ 34419] 
1c (Inline Function) --------`-------- coreclr!Alloc+0x6ad [e:\a\_work\572\s\src\vm\gchelpers.cpp @ 241] 
1d (Inline Function) --------`-------- coreclr!AllocateObject+0x701 [e:\a\_work\572\s\src\vm\gchelpers.cpp @ 1156] 
1e 0000001a`f023c780 00007ffc`554e4bfb coreclr!JIT_New+0x7ee [e:\a\_work\572\s\src\vm\jithelpers.cpp @ 2810] 
1f 0000001a`f023cb50 00007ffc`554e6e68 System_Private_CoreLib!System.Attribute.InternalGetCustomAttributes(System.Reflection.PropertyInfo, System.Type, Boolean)$##60004EF+0x5b
20 0000001a`f023cbc0 00007ffc`00ba2f50 System_Private_CoreLib!System.Reflection.CustomAttributeExtensions.GetCustomAttributes[System.__Canon](System.Reflection.MemberInfo)$##600333B+0x38
21 0000001a`f023cc10 00007ffc`703ac3ed 0x00007ffc`00ba2f50
22 0000001a`f023cc70 00007ffc`703afb03 System_Linq!System.Linq.Enumerable+RangeIterator.TryGetLast(Boolean ByRef)$##6000151+0xd9d
23 0000001a`f023ccc0 00007ffc`703b5ca6 System_Linq!System.Linq.Enumerable+RangeIterator.TryGetLast(Boolean ByRef)$##6000151+0x44b3
24 0000001a`f023cd50 00007ffc`703b8e2d System_Linq!System.Linq.Enumerable+RangeIterator.TryGetLast(Boolean ByRef)$##6000151+0xa656
25 0000001a`f023cda0 00007ffc`703ad6ff System_Linq!System.Linq.Error.NotSupported()$##60002B0+0x2c6d
26 0000001a`f023ce00 00007ffc`00ba1aee System_Linq!System.Linq.Enumerable+RangeIterator.TryGetLast(Boolean ByRef)$##6000151+0x20af
27 0000001a`f023ce50 00007ffc`00b9f6fa 0x00007ffc`00ba1aee
28 0000001a`f023d740 00007ffc`00b9ed4a 0x00007ffc`00b9f6fa
29 0000001a`f023d860 00007ffc`01d79f92 0x00007ffc`00b9ed4a
2a 0000001a`f023dca0 00007ffc`01d7928f 0x00007ffc`01d79f92
2b 0000001a`f023dda0 00007ffc`02311f9d 0x00007ffc`01d7928f
2c 0000001a`f023dee0 00007ffc`01d8877a 0x00007ffc`02311f9d
2d 0000001a`f023e070 00007ffc`01d87633 0x00007ffc`01d8877a
2e 0000001a`f023e100 00007ffc`01d87210 0x00007ffc`01d87633
2f 0000001a`f023e420 00007ffc`01b513ed 0x00007ffc`01d87210
30 0000001a`f023e460 00007ffc`01d87086 0x00007ffc`01b513ed
31 0000001a`f023e4d0 00007ffc`01d4d7ef 0x00007ffc`01d87086
32 0000001a`f023e590 00007ffc`01cf36d1 0x00007ffc`01d4d7ef
33 0000001a`f023e600 00007ffc`55422019 0x00007ffc`01cf36d1
34 0000001a`f023e670 00007ffc`55510052 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##600288E+0x89
35 0000001a`f023e6f0 00007ffc`6064a673 System_Private_CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()$##6002670+0x1a2
36 0000001a`f023e790 00007ffc`60522c49 coreclr!CallDescrWorkerInternal+0x83 [E:\A\_work\572\s\src\vm\amd64\CallDescrWorkerAMD64.asm @ 101] 
37 (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x53 [e:\a\_work\572\s\src\vm\callhelpers.cpp @ 78] 
38 0000001a`f023e7d0 00007ffc`605228a8 coreclr!MethodDescCallSite::CallTargetWorker+0x2b5 [e:\a\_work\572\s\src\vm\callhelpers.cpp @ 628] 
39 (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetBool+0x15 [e:\a\_work\572\s\src\vm\callhelpers.h @ 433] 
3a 0000001a`f023e920 00007ffc`6052273c coreclr!QueueUserWorkItemManagedCallback+0x98 [e:\a\_work\572\s\src\vm\comthreadpool.cpp @ 451] 
3b (Inline Function) --------`-------- coreclr!ManagedThreadBase_DispatchInner+0x5b [e:\a\_work\572\s\src\vm\threads.cpp @ 8850] 
3c 0000001a`f023ea10 00007ffc`605225c3 coreclr!ManagedThreadBase_DispatchMiddle+0xe4 [e:\a\_work\572\s\src\vm\threads.cpp @ 8901] 
3d 0000001a`f023eb40 00007ffc`605224c0 coreclr!ManagedThreadBase_DispatchOuter+0xaf [e:\a\_work\572\s\src\vm\threads.cpp @ 9140] 
3e (Inline Function) --------`-------- coreclr!ManagedThreadBase_FullTransitionWithAD+0x56 [e:\a\_work\572\s\src\vm\threads.cpp @ 9199] 
3f (Inline Function) --------`-------- coreclr!ManagedThreadBase::ThreadPool+0x56 [e:\a\_work\572\s\src\vm\threads.cpp @ 9241] 
40 0000001a`f023ebe0 00007ffc`60521d1f coreclr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xe0 [e:\a\_work\572\s\src\vm\threadpoolrequest.cpp @ 758] 
41 (Inline Function) --------`-------- coreclr!ThreadpoolMgr::ExecuteWorkRequest+0x139 [e:\a\_work\572\s\src\vm\win32threadpool.cpp @ 1560] 
42 0000001a`f023ed40 00007ffc`605aa34a coreclr!ThreadpoolMgr::WorkerThreadStart+0x2df [e:\a\_work\572\s\src\vm\win32threadpool.cpp @ 2025] 
43 0000001a`f023ef40 00007ffc`8e024034 coreclr!Thread::intermediateThreadProc+0x8a [e:\a\_work\572\s\src\vm\threads.cpp @ 2255] 
44 0000001a`f023fa80 00007ffc`8fe63691 kernel32!BaseThreadInitThunk+0x14
45 0000001a`f023fab0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

I have captured several dumps from when the AVE happened. They are available to download from this link: https://drive.google.com/file/d/1bNY6GZw3BhpsK73i8naiEB-AY_-LdNjy/view?usp=sharing

AndyAyersMS commented 5 years ago

Looking at dump dotnet/coreclr#3, the heap is corrupted:

0:077> !VerifyHeap
object 000001f1b1bc0108: bad member 000001F1B1B97E41 at 000001F1B1BC0118
Last good object: 000001F1B1BC00E8.
object 000001f1b1bc0340: bad member 000001F1B1BC0118 at 000001F1B1BC0348
Last good object: 000001F1B1BC0318.

The corrupted addresses are very near valid addresses:

!do 000001f1b1bc0340
Name:        Voron.Util.PageTable+PagesBuffer
MethodTable: 00007ffbf4091728
EEClass:     00007ffbf408dcd0
Size:        32(0x20) bytes
File:        C:\Work\ravendb4\test\Tryouts\bin\Debug\netcoreapp2.2 - Copy\Voron.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffbf40b8008  40004ab        8 ...pl.PagePosition[]  0 instance 000001f1b1bc0118 PagePositions
00007ffc45156a70  40004ac       10         System.Int32  1 instance                0 Start
00007ffc45156a70  40004ad       14         System.Int32  1 instance                1 End
0:077> !DumpObj /d 000001f1b1bc0118
<Note: this object has an invalid CLASS field>
Invalid object

I suspect this perhaps should be 000001f1b1bc0108 ( vs ...18) which means perhaps one bit flipped in the address value:

0:077> !DumpArray /d 000001f1b1bc0108
Name:        Voron.Impl.PagePosition[]
MethodTable: 00007ffbf40b8008
EEClass:     00007ffc447f6c08
Size:        40(0x28) bytes
Array:       Rank 1, Number of elements 2, Type CLASS
Element Methodtable: 00007ffbf407edb0
[0] 000001f1b1b97e41
[1] null

array element at 0 is also corrupt, looks like here perhaps the low bit got flipped:

0:077> !DumpObj /d 000001f1b1b97e41
<Note: this object has an invalid CLASS field>
Invalid object

0:077> !DumpObj /d 000001f1b1b97e40
Name:        Voron.Impl.PagePosition
MethodTable: 00007ffbf407edb0
EEClass:     00007ffbf408cac0
Size:        48(0x30) bytes
File:        C:\Work\ravendb4\test\Tryouts\bin\Debug\netcoreapp2.2 - Copy\Voron.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffc45156b70  40001dd        8         System.Int64  1 instance 2 TransactionId
00007ffc45156b70  40001de       10         System.Int64  1 instance 1 JournalNumber
00007ffc45156b70  40001df       18         System.Int64  1 instance 414 ScratchPage
00007ffc45156a70  40001e0       20         System.Int32  1 instance                0 ScratchNumber
00007ffc45154bb0  40001e1       24       System.Boolean  1 instance                0 IsFreedPageMarker
00007ffc45154bb0  40001e2       25       System.Boolean  1 instance                0 UnusedInPTT

My guess would be that perhaps something in the app is corrupting these values. These single bit flips do not look like the pattern we'd see if we had an issue in GC reporting.

Does this app do single-bit updates to memory (say via pointers)?

AndyAyersMS commented 5 years ago

Dump 1 looks similar -- same stack and a bit is cleared in a method table address for an object in an array.

Dump 2 seems to have more severe corruption and a different stack.

leculver commented 5 years ago

@myarichuk Could I trouble you to re-run this test with stress logs enabled? This would be "set COMPlus_StressLog=1" as an environment variable before running the test and reproducing the crash. (The logs themselves are captured inside the crash dump, so you only have to share the dump...no extra files.)

This greatly helps us track down this kind of failure. Thank you!

myarichuk commented 5 years ago

Thanks for looking into this! Sure, I will rerun this with this env variable and share the results asap.

myarichuk commented 5 years ago

Does this app do single-bit updates to memory (say via pointers)?

There are couple of places we do single-bit flips through pointers, but those are heavily tested areas and judging by the corrupted objects pretty far from those code areas.

myarichuk commented 5 years ago

The "single bit flipping" sounded suspicious to me so I checked my memory with memtest86 and tried to reproduce this issue on another machine. It appears this was happening due to hardware issue in memory sticks because: A) it didn't reproduce on another machine B) memtest86 has shown I had hardware issues in my memory

Thus, I am closing this issue because I think the underlying reason for this access violation exception is faulty memory sticks. What do you think? Can there be any other issue besides faulty memory?

leculver commented 5 years ago

What do you think? Can there be any other issue besides faulty memory?

It's very likely this is due to faulty memory. Even if it's not, the very first step will be to reproduce this issue on a machine where the memory doesn't fault. =)

Please do feel free to file another issue if you see this again on another machine!