dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.19k stars 355 forks source link

NativeAOT crash dumps bucket all user exceptions as APPLICATION_FAULT #4099

Closed agocke closed 10 months ago

agocke commented 2 years ago

Here's what !analyze -v sees for a simple console app that throws an ApplicationException when there are no symbols/PDB:

PROCESS_NAME:  aotExc.exe

ERROR_CODE: (NTSTATUS) 0xa8527f6c - <Unable to get error code text>

EXCEPTION_CODE_STR:  a8527f6c

STACK_TEXT:  
0000000c`9e2feac8 00007ffe`1f3e5eb4     : 0000000c`9e2ff118 00000000`000001c0 ffffffff`f70f2e80 0000000c`9e2ff310 : ntdll!NtRaiseException+0x14
0000000c`9e2fead0 00007ff6`b2eaf8d4     : 00000000`00000002 00000000`00000064 00007ff6`b2ecb0b4 00000000`00000000 : KERNELBASE!RaiseFailFastException+0x144
0000000c`9e2ff0b0 00007ff6`b2e008f2     : 000001cf`80005f30 00007ff6`b30c4020 0000000c`9e2ff2d0 00007ff6`b2dfa6b8 : aotExc+0x14f8d4
0000000c`9e2ff200 00007ff6`b2e00762     : 00000028`ffffffff 000001cf`80005f30 ffffffff`ffffffff ffffffff`00000118 : aotExc+0xa08f2
0000000c`9e2ff270 00007ff6`b2e8562c     : 000001cf`80100000 00007ff6`b2e62993 00007ff6`b30075f8 00007ff6`b2f3d8b5 : aotExc+0xa0762
0000000c`9e2ff2e0 00007ff6`b2e85b3a     : 000001cf`80005f30 ffffffff`ffffffff 0000000c`9e2ff940 0000000c`9e2ff920 : aotExc+0x12562c
0000000c`9e2ff830 00007ff6`b2e859a1     : 00000000`00000019 000001cf`80017d18 00740061`ffffffff 002e006e`006f0069 : aotExc+0x125b3a
0000000c`9e2ff8d0 00007ff6`b2d64cc7     : 000001cf`80005f30 003d006e`0065006b 00ff00ff`00ff00ff 00ff00ff`00ff00ff : aotExc+0x1259a1
0000000c`9e2ff900 00007ff6`b2ecb0b4     : 000001cf`80005808 00000000`00000000 00000000`00000000 00007ff6`b2eaebb9 : aotExc+0x4cc7
0000000c`9e2ffc90 00007ff6`b2ecb074     : 00000000`00000001 000001cf`80003180 000001cf`00000000 00007ff6`b30505d8 : aotExc+0x16b0b4
0000000c`9e2ffcc0 00007ff6`b2f2cf53     : 000001cf`e064a530 00000000`00000000 00007ff6`0000000a 00007ff6`b2de9241 : aotExc+0x16b074
0000000c`9e2ffcf0 00007ff6`b2de5bee     : 000001cf`e064a530 00007ff6`b30740a0 00007ff6`b2d60000 00000000`00000001 : aotExc+0x1ccf53
0000000c`9e2ffd50 00007ff6`b2dc9918     : 000001cf`e064a530 00000000`00000000 00000000`00000000 00000000`00000000 : aotExc+0x85bee
0000000c`9e2ffda0 00007ffe`200154e0     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : aotExc+0x69918
0000000c`9e2ffde0 00007ffe`217c485b     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x10
0000000c`9e2ffe10 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x2b

STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_NAME:  aotExc+16b0b4

MODULE_NAME: aotExc

IMAGE_NAME:  aotExc.exe

FAILURE_BUCKET_ID:  APPLICATION_FAULT_a8527f6c_aotExc.exe!Unknown

OS_VERSION:  10.0.22000.1

BUILDLAB_STR:  co_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

IMAGE_VERSION:  1.0.0.0

FAILURE_ID_HASH:  {1ccda2eb-5034-ce7f-51d9-e68f8dbe83a9}

Followup:     MachineOwner
---------

The failure bucket is FAILURE_BUCKET_ID: APPLICATION_FAULT_a8527f6c_aotExc.exe!Unknown, and the stack is on RaiseException. This seems to indicate that analyze can't see the actual exception type.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
Here's what `!analyze -v` sees for a simple console app that throws an `ApplicationException` when there are no symbols/PDB: ``` PROCESS_NAME: aotExc.exe ERROR_CODE: (NTSTATUS) 0xa8527f6c - EXCEPTION_CODE_STR: a8527f6c STACK_TEXT: 0000000c`9e2feac8 00007ffe`1f3e5eb4 : 0000000c`9e2ff118 00000000`000001c0 ffffffff`f70f2e80 0000000c`9e2ff310 : ntdll!NtRaiseException+0x14 0000000c`9e2fead0 00007ff6`b2eaf8d4 : 00000000`00000002 00000000`00000064 00007ff6`b2ecb0b4 00000000`00000000 : KERNELBASE!RaiseFailFastException+0x144 0000000c`9e2ff0b0 00007ff6`b2e008f2 : 000001cf`80005f30 00007ff6`b30c4020 0000000c`9e2ff2d0 00007ff6`b2dfa6b8 : aotExc+0x14f8d4 0000000c`9e2ff200 00007ff6`b2e00762 : 00000028`ffffffff 000001cf`80005f30 ffffffff`ffffffff ffffffff`00000118 : aotExc+0xa08f2 0000000c`9e2ff270 00007ff6`b2e8562c : 000001cf`80100000 00007ff6`b2e62993 00007ff6`b30075f8 00007ff6`b2f3d8b5 : aotExc+0xa0762 0000000c`9e2ff2e0 00007ff6`b2e85b3a : 000001cf`80005f30 ffffffff`ffffffff 0000000c`9e2ff940 0000000c`9e2ff920 : aotExc+0x12562c 0000000c`9e2ff830 00007ff6`b2e859a1 : 00000000`00000019 000001cf`80017d18 00740061`ffffffff 002e006e`006f0069 : aotExc+0x125b3a 0000000c`9e2ff8d0 00007ff6`b2d64cc7 : 000001cf`80005f30 003d006e`0065006b 00ff00ff`00ff00ff 00ff00ff`00ff00ff : aotExc+0x1259a1 0000000c`9e2ff900 00007ff6`b2ecb0b4 : 000001cf`80005808 00000000`00000000 00000000`00000000 00007ff6`b2eaebb9 : aotExc+0x4cc7 0000000c`9e2ffc90 00007ff6`b2ecb074 : 00000000`00000001 000001cf`80003180 000001cf`00000000 00007ff6`b30505d8 : aotExc+0x16b0b4 0000000c`9e2ffcc0 00007ff6`b2f2cf53 : 000001cf`e064a530 00000000`00000000 00007ff6`0000000a 00007ff6`b2de9241 : aotExc+0x16b074 0000000c`9e2ffcf0 00007ff6`b2de5bee : 000001cf`e064a530 00007ff6`b30740a0 00007ff6`b2d60000 00000000`00000001 : aotExc+0x1ccf53 0000000c`9e2ffd50 00007ff6`b2dc9918 : 000001cf`e064a530 00000000`00000000 00000000`00000000 00000000`00000000 : aotExc+0x85bee 0000000c`9e2ffda0 00007ffe`200154e0 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : aotExc+0x69918 0000000c`9e2ffde0 00007ffe`217c485b : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x10 0000000c`9e2ffe10 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x2b STACK_COMMAND: ~0s; .ecxr ; kb SYMBOL_NAME: aotExc+16b0b4 MODULE_NAME: aotExc IMAGE_NAME: aotExc.exe FAILURE_BUCKET_ID: APPLICATION_FAULT_a8527f6c_aotExc.exe!Unknown OS_VERSION: 10.0.22000.1 BUILDLAB_STR: co_release OSPLATFORM_TYPE: x64 OSNAME: Windows 10 IMAGE_VERSION: 1.0.0.0 FAILURE_ID_HASH: {1ccda2eb-5034-ce7f-51d9-e68f8dbe83a9} Followup: MachineOwner --------- ``` The failure bucket is `FAILURE_BUCKET_ID: APPLICATION_FAULT_a8527f6c_aotExc.exe!Unknown`, and the stack is on `RaiseException`. This seems to indicate that analyze can't see the actual exception type.
Author: agocke
Assignees: -
Labels: `area-Diagnostics-coreclr`
Milestone: -
agocke commented 2 years ago

fyi @MichalStrehovsky

MichalStrehovsky commented 2 years ago

Is it expected we would be able to find anything useful without symbols?

Symbol-less dump for a C++ app that crashes with an unhandled exception is FAILURE_BUCKET_ID: MISSING_CRITICAL_SYMBOLS_ntdll.dll_FAIL_FAST_FATAL_APP_EXIT_c0000409_main.exe!Unknown.

0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

Warning: Not BugCheck 0x14C 
Error: Cannot perform bugcheck analysis for 0x14C, necessary tools are missing
       Dexter, stacks.wds

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 780

    Key  : Analysis.Elapsed.mSec
    Value: 79004

    Key  : Analysis.Init.CPU.mSec
    Value: 2593

    Key  : Analysis.Init.Elapsed.mSec
    Value: 132782

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 197

    Key  : FailFast.Name
    Value: FATAL_APP_EXIT

    Key  : FailFast.Type
    Value: 7

    Key  : Memory.Job.PrivateCommit.MB
    Value: 10044

    Key  : Memory.Job.PrivateLimit.MB
    Value: 0

    Key  : Memory.Job.SharedCommit.MB
    Value: 1138

    Key  : Memory.Job.TotalLimit.MB
    Value: 0

    Key  : Memory.System.Available.MB
    Value: 2549

    Key  : Memory.System.CommitDelta.MB
    Value: 4606

    Key  : Memory.System.CommitLimit.MB
    Value: 39467

    Key  : Memory.System.Committed.MB
    Value: 34861

    Key  : Memory.System.PeakCommitment.MB
    Value: 48170

    Key  : Memory.System.RAM.MB
    Value: 4165

    Key  : Stack.Best.Hash
    Value: 898f9e285cc78272a2e1acf7860973ac1fce2ff6

    Key  : Stack.Best.SymbolType
    Value: None

    Key  : Statistics.LastEvent.Exception.Code
    Value: 0xC0000409

    Key  : Statistics.LastEvent.Exception.Param.0
    Value: 7

    Key  : Statistics.LastEvent.Process.Image
    Value: main.exe

    Key  : Statistics.LastEvent.Process.Modules.Count
    Value: 4

    Key  : Statistics.LastEvent.Process.Threads.Count
    Value: 1

    Key  : Statistics.Processes.Count
    Value: 1

    Key  : Statistics.Processes.Max.MemoryUsage
    Value: 458752

    Key  : Statistics.Processes.Max.MemoryUsageImage
    Value: main.exe

    Key  : Statistics.Processes.Max.ProcessCount
    Value: 1

    Key  : Statistics.Processes.Max.ProcessCountImage
    Value: main.exe

    Key  : Statistics.Processors.Count
    Value: 8

    Key  : Statistics.Processors.Id.ArchRev
    Value: 0

    Key  : Statistics.Processors.Id.Architecture
    Value: X64

    Key  : Statistics.Processors.Id.FMS
    Value: 6,70,1

    Key  : Statistics.Processors.Id.Revision
    Value: 0

    Key  : Statistics.Processors.Id.Vendor
    Value: Unknown

    Key  : Statistics.Threads.Count
    Value: 1

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 1728185

    Key  : Timeline.Process.Day.DeltaSec
    Value: 32248

    Key  : Timeline.Process.Start.DeltaSec
    Value: 45

    Key  : Timeline.Timezone.Standard.Bias
    Value: -540

    Key  : Timeline.Timezone.Standard.Name
    Value: Tokyo Standard Time

    Key  : Timeline.Zulu
    Value: 2022-08-01T23:57:28.130Z

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Locale
    Value: en-US

    Key  : WER.OS.Platform
    Value: Windows

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19044.1

    Key  : WER.Process.Name
    Value: main.exe

FILE_IN_CAB:  main.dmp

PROCESS_NAME:  main.exe

APPLICATION_VERIFIER_FLAGS:  0

CONTEXT:  00000093a7aff2c0 -- (.cxr 0x93a7aff2c0)
rax=00000000000004e4 rbx=00007ff6004469a0 rcx=0000000000000000
rdx=0000000029000029 rsi=00000093a7affbe0 rdi=0000000019930520
rip=00007fff004b4fd9 rsp=00000093a7affa70 rbp=0000000000000000
 r8=0000019947780150  r9=0000000000000100 r10=0000000000000100
r11=0000000000000100 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
KERNELBASE!RaiseException+0x69:
00007fff`004b4fd9 0f1f440000      nop     dword ptr [rax+rax]
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff6004346dd (main+0x00000000000046dd)
   ExceptionCode: c0000409 (Security check failure or stack buffer overrun)
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 0000000000000007
Subcode: 0x7 FAST_FAIL_FATAL_APP_EXIT 

ERROR_CODE: (NTSTATUS) 0xc0000409 - The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

EXCEPTION_CODE_STR:  c0000409

EXCEPTION_PARAMETER1:  0000000000000007

STACK_TEXT:  
00000093`a7affa70 00007ff6`00431d20     : 00000000`00000000 00000000`00000001 00000000`00000fa0 00007ff6`00430000 : KERNELBASE!RaiseException+0x69
00000093`a7affb50 00007ff6`0043104b     : 05100800`00040661 bfcbfbff`fedaf383 00000199`477867e0 00000000`00000000 : main+0x1d20
00000093`a7affbb0 00007ff6`004312a8     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : main+0x104b
00000093`a7affc00 00007fff`00797034     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : main+0x12a8
00000093`a7affc40 00007fff`02742651     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
00000093`a7affc70 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

STACK_COMMAND:  .cxr 0x93a7aff2c0 ; kb

SYMBOL_NAME:  main+1d20

MODULE_NAME: main

IMAGE_NAME:  main.exe

FAILURE_BUCKET_ID:  MISSING_CRITICAL_SYMBOLS_ntdll.dll_FAIL_FAST_FATAL_APP_EXIT_c0000409_main.exe!Unknown

OS_VERSION:  10.0.19044.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {02031277-0b2a-4dbf-6710-3fd12dd1df4e}

FAILURE_ID_REPORT_LINK: https://go.microsoft.com/fwlink/?LinkID=397724&FailureSearchText=02031277-0b2a-4dbf-6710-3fd12dd1df4e

Followup:     MachineOwner
---------
agocke commented 2 years ago

I would expect that the only useful thing we might get is differentiated exception type -- otherwise I assume everything would go in the same failure bucket ID.

MichalStrehovsky commented 2 years ago

Isn't the bucket determined primarily on the stack? I would expect these to bucked into different categories, even if we lack symbols. Interpreting the dump without symbols would still be a big challenge though.

tommcdon commented 2 years ago

Each bucket is stored by a hash. It's my understanding that the hash is calculated using the Exception type and the callstack.

agocke commented 2 years ago

That doesn't seem too bad then -- you would only see mixed buckets if things had the same callstack but different exceptions. Seems pretty rare.

agocke commented 2 years ago

Eh, I guess not really if you have some sort of async situation with aggregateexception where everything gets thrown out of one method, but I guess that can't be helped.

tommcdon commented 1 year ago

NativeAOT createdump work was completed as part of https://github.com/dotnet/runtime/issues/88904.
We will use this issue to track the remaining work in Watson. Since there is no more remaining untracked runtime work, moving this issue to the diagnostics repo.

am11 commented 1 year ago

Is this issue also tracking remaining work for Unix?

(lldb) clrstack
Failed to find runtime module (libcoreclr.so), 0x80004002
Extension commands need it in order to have something to do.

Looks like DotnetRuntimeInfo etc. are not present in ELF and Mach-O objects. SOS looks for it here: https://github.com/dotnet/diagnostics/blob/ffd489e909eefd11aeaa73e3f03bb1084009bb42/src/dbgshim/dbgshim.cpp#L1404

If this is unexpected and the issue is that they not retained after the final linkage, I can take a look in runtime to keep them (we have some symbols explicitly kept in AOT build integration).

mikem8361 commented 1 year ago

dbgshim.cpp is part of the managed debugging launching and not part of SOS. Actually SOS uses CLRMD to enumerate the runtimes in the process and part of the is looking for the DotnetRuntimeInfo, but this is just for .NET Core and not Native AOT apps mainly because Native AOT doesn't have a DAC.

This issue addresses hooking up !analyze in Watson with the crash info JSON blob that is currently implemented for Native AOT runtimes.

tommcdon commented 10 months ago

.NET work is completed, closing.