dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.89k stars 4.63k forks source link

Access Violation Error in GC DiagTraceGCSegments #42743

Open mgasparel opened 3 years ago

mgasparel commented 3 years ago

Description

This week our application crashed with an access violation error. We were able to retrieve an automatic crash dump, and it appears that this exception originated in the GC at coreclr!SVR::GCHeap::DiagTraceGCSegments+0x94 [f:\workspace\_work\1\s\src\gc\gcee.cpp @ 444]

Configuration

CLR Version: 4.700.20.36602

OS_VERSION: 10.0.14393.2430

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

Other information

I did some digging around in this repos issues and could not find anything that seemed relevant to this case.

crash dump analysis from windbg below:

0:415> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
Failed to request MethodData, not in JIT code range
KEY_VALUES_STRING: 1
Key : AV.Fault
 Value: Read
Key : Analysis.CPU.Sec
 Value: 22
Key : Analysis.DebugAnalysisProvider.CPP
 Value: Create: 8007007e on [omitted]
Key : Analysis.DebugData
 Value: CreateObject
Key : Analysis.DebugModel
 Value: CreateObject
Key : Analysis.Elapsed.Sec
 Value: 21
Key : Analysis.Memory.CommitPeak.Mb
 Value: 322
Key : Analysis.System
 Value: CreateObject
Key : CLR.Engine
 Value: CORECLR
Key : CLR.Version
 Value: 4.700.20.36602
Key : Timeline.OS.Boot.DeltaSec
 Value: 194834
Key : Timeline.Process.Start.DeltaSec
 Value: 194540

NTGLOBALFLAG: 0
PROCESS_BAM_CURRENT_THROTTLED: 0
PROCESS_BAM_PREVIOUS_THROTTLED: 0
APPLICATION_VERIFIER_FLAGS: 0
CONTEXT: (.ecxr)
rax=0000000000000000 rbx=000001600c5f0000 rcx=0000015a189796a0
rdx=0000015efded1000 rsi=0000000000000010 rdi=0000000000000002
rip=00007ff867f2eda4 rsp=000000617b07f0e0 rbp=0000015a18930360
 r8=000000000ffff000 r9=0000000000000000 r10=00000fff0cfe5da2
r11=0000000400000000 r12=0000000000003f00 r13=0000000000000001
r14=0000000000000001 r15=0000000000000000
iopl=0 nv up ei pl nz na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
coreclr!SVR::GCHeap::DiagTraceGCSegments+0x94:
00007ff8`67f2eda4 488b5320 mov rdx,qword ptr [rbx+20h] ds:00000160`0c5f0020=????????????????
Resetting default scope
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 00007ff867f2eda4 (coreclr!SVR::GCHeap::DiagTraceGCSegments+0x0000000000000094)
 ExceptionCode: c0000005 (Access violation)
 ExceptionFlags: 00000000
NumberParameters: 2
 Parameter[0]: 0000000000000000
 Parameter[1]: 000001600c5f0020
Attempt to read from address 000001600c5f0020
PROCESS_NAME: [omitted].exe
READ_ADDRESS: 000001600c5f0020
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.
EXCEPTION_CODE_STR: c0000005
EXCEPTION_PARAMETER1: 0000000000000000
EXCEPTION_PARAMETER2: 000001600c5f0020
STACK_TEXT:
00000061`7b07f0e0 00007ff8`67e2e173 : 00000000`80000000 00000061`7b07f231 00000000`00000001 00000000`00000001 : coreclr!SVR::GCHeap::DiagTraceGCSegments+0x94
00000061`7b07f120 00007ff8`67e363e5 : 00000000`00000000 00000000`00000001 00000000`00000001 00000000`00000001 : coreclr!ETW::GCLog::GCSettingsEvent+0x127
00000061`7b07f170 00007ff8`67e3690c : 00000000`00010000 00007ff8`6812d690 00000000`00000001 00000000`00000001 : coreclr!EtwCallback+0x14d
00000061`7b07f280 00007ff8`88875d8d : 00000000`00000004 00000061`7b07f400 00000000`00000000 00000000`00000000 : coreclr!McGenControlCallbackV2+0x12c
00000061`7b07f2d0 00007ff8`88875a15 : 0000015e`a909b2b0 00000061`7b07f7f8 0000015a`183c19d0 00007ff8`87784f49 : ntdll!EtwpEventApiCallback+0xbd
00000061`7b07f360 00007ff8`8884c9fb : 0000015a`183c1900 00000000`00000101 00000061`7b07f438 00007ff8`8778002d : ntdll!EtwpUpdateEnableInfoAndCallback+0xb9
00000061`7b07f3a0 00007ff8`8884c7bb : 0000015a`183c19d0 00000000`00000000 00000000`00000000 00000000`00000100 : ntdll!EtwpProcessNotification+0x43
00000061`7b07f3d0 00007ff8`8884c4ec : 0000015e`9e980800 00000061`00000000 00000000`00000178 00000061`7b07f470 : ntdll!EtwDeliverDataBlock+0x8b
00000061`7b07f430 00007ff8`88885b70 : 0000015a`18330980 00000061`7b07f7f8 0000015a`18330980 00000000`00000000 : ntdll!EtwpNotificationThread+0x6c
00000061`7b07f5a0 00007ff8`8886a073 : 0000015a`18330b08 0000015e`a52e7910 00000000`00000000 00000000`00000010 : ntdll!TppExecuteWaitCallback+0x8c
00000061`7b07f5e0 00007ff8`860384d4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!TppWorkerThread+0x923
00000061`7b07f9e0 00007ff8`888ae871 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
00000061`7b07fa10 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

FAULTING_SOURCE_LINE: f:\workspace\_work\1\s\src\gc\gcee.cpp
FAULTING_SOURCE_FILE: f:\workspace\_work\1\s\src\gc\gcee.cpp
FAULTING_SOURCE_LINE_NUMBER: 444
SYMBOL_NAME: coreclr!SVR::GCHeap::DiagTraceGCSegments+94
MODULE_NAME: coreclr
IMAGE_NAME: coreclr.dll
STACK_COMMAND: ~415s ; .ecxr ; kb
FAILURE_BUCKET_ID: INVALID_POINTER_READ_ETWCALLBACK_c0000005_coreclr.dll!SVR::GCHeap::DiagTraceGCSegments
OS_VERSION: 10.0.14393.2430
BUILDLAB_STR: rs1_release_inmarket_aim
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {f5456149-b552-c332-6b94-272446e7c6e7}
Followup: MachineOwner
---------
0:415> ~415s ; .ecxr ; kb
ntdll!NtWaitForMultipleObjects+0x14:
00007ff8`888e6554 c3 ret
rax=0000000000000000 rbx=000001600c5f0000 rcx=0000015a189796a0
rdx=0000015efded1000 rsi=0000000000000010 rdi=0000000000000002
rip=00007ff867f2eda4 rsp=000000617b07f0e0 rbp=0000015a18930360
 r8=000000000ffff000 r9=0000000000000000 r10=00000fff0cfe5da2
r11=0000000400000000 r12=0000000000003f00 r13=0000000000000001
r14=0000000000000001 r15=0000000000000000
iopl=0 nv up ei pl nz na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
coreclr!SVR::GCHeap::DiagTraceGCSegments+0x94:
00007ff8`67f2eda4 488b5320 mov rdx,qword ptr [rbx+20h] ds:00000160`0c5f0020=????????????????
 *** Stack trace for last set context - .thread/.cxr resets it
 # RetAddr : Args to Child : Call Site
00 00007ff8`67e2e173 : 00000000`80000000 00000061`7b07f231 00000000`00000001 00000000`00000001 : coreclr!SVR::GCHeap::DiagTraceGCSegments+0x94 [f:\workspace\_work\1\s\src\gc\gcee.cpp @ 444]
01 00007ff8`67e363e5 : 00000000`00000000 00000000`00000001 00000000`00000001 00000000`00000001 : coreclr!ETW::GCLog::GCSettingsEvent+0x127 [f:\workspace\_work\1\s\src\vm\eventtrace.cpp @ 449]
02 00007ff8`67e3690c : 00000000`00010000 00007ff8`6812d690 00000000`00000001 00000000`00000001 : coreclr!EtwCallback+0x14d [f:\workspace\_work\1\s\src\vm\eventtrace.cpp @ 4512]
03 00007ff8`88875d8d : 00000000`00000004 00000061`7b07f400 00000000`00000000 00000000`00000000 : coreclr!McGenControlCallbackV2+0x12c [f:\workspace\_work\1\s\bin\obj\windows_nt.x64.release\src\inc\etw\clretwall.h @ 226]
04 00007ff8`88875a15 : 0000015e`a909b2b0 00000061`7b07f7f8 0000015a`183c19d0 00007ff8`87784f49 : ntdll!EtwpEventApiCallback+0xbd
05 00007ff8`8884c9fb : 0000015a`183c1900 00000000`00000101 00000061`7b07f438 00007ff8`8778002d : ntdll!EtwpUpdateEnableInfoAndCallback+0xb9
06 00007ff8`8884c7bb : 0000015a`183c19d0 00000000`00000000 00000000`00000000 00000000`00000100 : ntdll!EtwpProcessNotification+0x43
07 00007ff8`8884c4ec : 0000015e`9e980800 00000061`00000000 00000000`00000178 00000061`7b07f470 : ntdll!EtwDeliverDataBlock+0x8b
08 00007ff8`88885b70 : 0000015a`18330980 00000061`7b07f7f8 0000015a`18330980 00000000`00000000 : ntdll!EtwpNotificationThread+0x6c
09 00007ff8`8886a073 : 0000015a`18330b08 0000015e`a52e7910 00000000`00000000 00000000`00000010 : ntdll!TppExecuteWaitCallback+0x8c
0a 00007ff8`860384d4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!TppWorkerThread+0x923
0b 00007ff8`888ae871 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
0c 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21
0:415> !address 000001600c5f0020
Usage: <unknown>
Base Address: 0000015f`ff9b2000
End Address: 00000160`1c5f0000
Region Size: 00000000`1cc3e000 ( 460.242 MB)
State: <info not present at the target>
Protect: <info not present at the target>
Type: <info not present at the target>
Allocation Base: <info not present at the target>
Allocation Protect: <info not present at the target>

Content source: 0 (invalid), length: fffffe0
0:415> !heap -s
************************************************************************************************************************
 NT HEAP STATS BELOW
************************************************************************************************************************
LFH Key : 0x96cf1311aace65df
Termination on corruption : ENABLED
 Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
 (k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
0000015a18330000 00000002 146068 137564 145868 10636 1033 18 4 e549 LFH
0000015a18170000 00008000 64 4 64 2 1 1 0 0
0000015a18310000 00001002 3324 2208 3124 183 36 3 0 0 LFH
0000015a18da0000 00001002 64168 48340 63968 77 247 8 1 1 LFH
0000015a18d20000 00041002 60 8 60 5 1 1 0 0
0000015ea5f50000 00041002 60 8 60 2 1 1 0 0
0000015e993e0000 00001002 1280 88 1080 20 10 2 0 0 LFH
0000015a10470000 00001002 1280 148 1080 12 4 2 0 0 LFH
-------------------------------------------------------------------------------------

Happy to assist if you need any further information or analysis on the dump.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

danmoseley commented 3 years ago

@mgasparel I don't have a CLR version- release version mapping handy. What version of .NET Core / .NET 5 is this?

mgasparel commented 3 years ago

@danmosemsft This should be 3.1.7

cshung commented 3 years ago

@mgasparel, where is the dump?

danmoseley commented 3 years ago

@mgasparel one possible secure way to submit a dump to @cshung is to open a dummy feedback issue here https://developercommunity.visualstudio.com/spaces/61/index.html and share the ID.

mgasparel commented 3 years ago

@danmosemsft Our security team has indicated that we can not share the raw dump. Is there anything we can do without sharing the dump? Any specific commands/analysis you would want to run? I'm happy to be your remote hands and provide output/results.

danmoseley commented 3 years ago

@mgasparel i will let @cshung guide you as he’s the area owner.

cshung commented 3 years ago

I had a call with Mike, and he indicated that:

  1. The dump is not currently accessible.
  2. They are going to move to Linux.
  3. It might be hard to reproduce this bug.

For now, there is not much we can do, and we are closing this issue for now. We can revisit it if it happens again or when we have more information to work on.

cshung commented 3 years ago

The customer reported back that he can access the dump again. We will take a look and see if it is actionable.

mgasparel commented 3 years ago

Just wanted to let you know that we had this issue on another server this week

PerfView arguments:

PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack+Exception -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:perfviewlog-$timestamp.txt -DataFile:PerfView-$timestamp.etl
mgasparel commented 3 years ago

@cshung

cshung commented 3 years ago

The bug is understood.

When a new ETW session is established, coreclr!EtwCallback will be called on a thread. Under certain conditions, coreclr!EtwCallback calls coreclr!ETW::GCLog::GCSettingsEvent which then calls GCHeap::DiagTraceGCSegments. It will access the segment list without synchronizing with the GC. By chance, we might hit a case where the GC alters the segment list (e.g. when the GC decommit a segment), and that would cause an access violation as we witness in this issue.

With regard to whether or not it will repro with perfcollect, can you check if you have a similar call to coreclr!ETW::GCLog::GCSettingsEvent when perfcollect happens?

@dotnet/dotnet-diag

baal2000 commented 3 years ago

@cshung Could you please confirm if this is considered PerfView issue or the framework's (GC) or both? If the framework's: is it going to be resolved by 6.0, or earlier in the version 5 of the framework?

sywhang commented 3 years ago

With regard to whether or not it will repro with perfcollect, can you check if you have a similar call to coreclr!ETW::GCLog::GCSettingsEvent when perfcollect happens?

As far as I can tell GCLog::GCSettingsEvent is only called from EtwCallback. Using perfcollect or EventPipe does not trigger the runtime to call back into EtwCallback as this callback is specifically for ETW only.

cshung commented 3 years ago

@baal2000 The bug is in the runtime. @mangod9 can help with the question about when and where we will deliver the fix.

baal2000 commented 3 years ago

@mangod9 Is there an ETA on this issue update?

mangod9 commented 3 years ago

Hi @baal2000, we hope to get to it in the next few .net 6 previews. Hope that works for you?

baal2000 commented 3 years ago

@mangod9 Ideally this is fixed in version 5 because preview frameworks are not for production use. The hope is that the fix is narrow enough to not represent much of a regression risk. Thank you.

mangod9 commented 3 years ago

Yeah once we have a fix ready we can determine risk of a backport. Thx.

mangod9 commented 3 years ago

@cshung, is this issue and/or fix well understood?

mangod9 commented 3 years ago

Moving to 7 since this is a corner case.

Maoni0 commented 2 years ago

I don't think we have time for this 7.0. moving this out; let me know if you disagree.

Maoni0 commented 2 years ago

(sorry accidentally closed)

baal2000 commented 2 years ago

I disagree this is a corner case.

indy-singh commented 11 months ago

Hi,

We rolled out PerfView to help with this issue and one of the things we've seen is that PerfView on occasion itself causes the app the crash:

access-violation-exception

This isn't great. Probably happens once every two weeks.

I don't know if it is the runtime or PerfView at fault here.

Cheers, Indy