dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

UnwindManagedExceptionPass1 crashes on virtual dispatch stub when handling System.NullReferenceException on macOS ARM64 #62058

Closed k15tfu closed 2 years ago

k15tfu commented 2 years ago

Hi!

I faced with the following crash running managed .NET 6 app on Apple Silicon:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [29403]

VM Regions Near 0:
--> 
    __TEXT                      102b88000-102b94000    [   48K] r-x/r-x SM=COW  /Users/*/Downloads/*

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libcoreclr.dylib                0x0000000102eb82b0 EECodeInfo::GetFunctionEntry() + 40
1   libcoreclr.dylib                0x000000010304afbc UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*) + 208
2   libcoreclr.dylib                0x000000010304afbc UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*) + 208
3   libcoreclr.dylib                0x000000010304b2fc DispatchManagedException(PAL_SEHException&, bool) + 68
4   libcoreclr.dylib                0x0000000102fac08c IL_Throw(Object*) + 532
5   ???                             0x0000000280e40110 0 + 10752360720
6   ???                             0x0000000280e3ffa8 0 + 10752360360
7   ???                             0x0000000280e3fe38 0 + 10752359992
8   ???                             0x0000000280e3edbc 0 + 10752355772
9   libcoreclr.dylib                0x00000001030d3d88 CallDescrWorkerInternal + 132
[...]

Looking into this, I found that the crash happens while handling System.NullReferenceException in the catch block, which is called via CallEHFunclet. Here is its stack and registers on enter:

Process 30925 stopped                                                                                                                                                                                                            [7/141]* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001022ef33c libcoreclr.dylib`CallEHFunclet
libcoreclr.dylib`CallEHFunclet:
->  0x1022ef33c <+0>:  stp    x29, x30, [sp, #-0x60]!
    0x1022ef340 <+4>:  stp    x19, x20, [sp, #0x10]
    0x1022ef344 <+8>:  stp    x21, x22, [sp, #0x20]
    0x1022ef348 <+12>: stp    x23, x24, [sp, #0x30]
Target 0: (dotnet) stopped.
(lldb) clrstack
OS Thread Id: 0x1f459cb (1)
        Child SP               IP Call Site
000000016FDFD410 00000001022ef33c [FaultingExceptionFrame: 000000016fdfd410]
000000016FDFE860 0000000280E30218 Program+<>c.<Main>b__2_1() [/Users/*/Program.cs @ 24]
000000016FDFE8C0 0000000280E30030 Program.CatchRethrow(System.Action) [/Users/*/Program.cs @ 33]
000000016FDFE920 0000000280E2FFA8 Program+<>c.<Main>b__2_0() [/Users/*/Program.cs @ 19]
000000016FDFE950 0000000280E2FE38 Program.CatchIgnore(System.Action) [/Users/*/Program.cs @ 47]
000000016FDFE990 0000000280E2EDBC Program.Main(System.String[]) [/Users/*/Program.cs @ 18]
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001022ef33c libcoreclr.dylib`CallEHFunclet
    frame #1: 0x0000000102261b3c libcoreclr.dylib`ExceptionTracker::CallHandler(this=0x000000010101b210, uHandlerStartPC=10752294988, sf=<unavailable>, pEHClause=0x000000010101b2a0, pMD=0x0000000280ebe740, funcletType=Catch, pContextRecord=0x000000016fdfc6d0) at exceptionhandling.cpp:3338:22 [opt]
    frame #2: 0x000000010226195c libcoreclr.dylib`ExceptionTracker::CallCatchHandler(this=0x000000010101b210, pContextRecord=0x000000016fdfc6d0, pfAborting=0x000000016fdfc520) at exceptionhandling.cpp:657:21 [opt]
    frame #3: 0x000000010226252c libcoreclr.dylib`::ProcessCLRException(pExceptionRecord=<unavailable>, MemoryStackFp=6171912480, pContextRecord=0x000000016fdfc6d0, pDispatcherContext=0x000000016fdfc660) at exceptionhandling.cpp:1136:35 [opt]
    frame #4: 0x0000000102266894 libcoreclr.dylib`UnwindManagedExceptionPass2(ex=0x000000016fdfd898, unwindStartContext=<unavailable>) at exceptionhandling.cpp:4452:27 [opt]
    frame #5: 0x0000000102266ca8 libcoreclr.dylib`UnwindManagedExceptionPass1(ex=0x000000016fdfd898, frameContext=0x000000016fdfd030) at exceptionhandling.cpp:4640:17 [opt]
    frame #6: 0x000000010226702c libcoreclr.dylib`DispatchManagedException(ex=0x000000016fdfd898, isHardwareException=<unavailable>) at exceptionhandling.cpp:4821:17 [opt]
    frame #7: 0x00000001022610c0 libcoreclr.dylib`HandleHardwareException(ex=0x000000016fdfd898) at exceptionhandling.cpp:5308:9 [opt]
    frame #8: 0x0000000102009840 libcoreclr.dylib`SEHProcessException(exception=0x000000016fdfd898) at seh.cpp:269:21 [opt]
    frame #9: 0x000000010203fddc libcoreclr.dylib`::PAL_DispatchException(pContext=0x000000016fdfe4c0, pExRecord=0x000000016fdfe428, pMachExceptionInfo=0x000000016fdfdee8) at machexception.cpp:394:29 [opt]
    frame #10: 0x000000010203fa00 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
    frame #11: 0x0000000280100070
    frame #12: 0x0000000280e30030
    frame #13: 0x0000000280e2ffa8
    frame #14: 0x0000000280e2fe38
    frame #15: 0x0000000280e2edbc
    frame #16: 0x00000001022efa88 libcoreclr.dylib`CallDescrWorkerInternal + 132
[...]
(lldb) f 9
libcoreclr.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
frame #9: 0x000000010203fddc libcoreclr.dylib`::PAL_DispatchException(pContext=0x000000016fdfe4c0, pExRecord=0x000000016fdfe428, pMachExceptionInfo=0x000000016fdfdee8) at machexception.cpp:394:29 [opt]
-> 394          continueExecution = SEHProcessException(&exception);
   395          if (continueExecution)
   396          {
   397              // Make a copy of the exception records so that we can free them before restoring the context
(lldb) register read
General Purpose Registers:
       x19 = 0x000000016fdfe4c0
       x20 = 0x000000016fdfdee8
       x21 = 0x000000016fdfe428
       x22 = 0x0000000101019800
       x23 = 0x000000010181b800
       x24 = 0x000000016fdfedd0
       x25 = 0x000000016fdfea80
       x26 = 0x0000000000000008
       x27 = 0x000000016fdfed10
       x28 = 0x000000016fdfeaf0
        fp = 0x000000016fdfded0
        sp = 0x000000016fdfd880
        pc = 0x000000010203fddc  libcoreclr.dylib`::PAL_DispatchException(PCONTEXT, PEXCEPTION_RECORD, MachExceptionInfo *) + 244 at machexception.cpp:395:13
21 registers were unavailable.

(lldb) f 10
frame #10: 0x000000010203fa00 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
libcoreclr.dylib`PAL_DispatchExceptionWrapper:
->  0x10203fa00 <+16>: brk    #0
    0x10203fa04 <+20>: ldp    x29, x30, [sp], #0x10
    0x10203fa08 <+24>: ret

libcoreclr.dylib`CorUnix::CPalThread::EnableMachExceptions:
    0x10203fa0c <+0>:  stp    x22, x21, [sp, #-0x30]!
(lldb) register read
General Purpose Registers:
       x19 = 0x000000016fdfeb80
       x20 = 0x000000016fdfeb68
       x21 = 0x000000016fdfed48
       x22 = 0x0000000000000001
       x23 = 0x000000016fdfed10
       x24 = 0x000000016fdfedd0
       x25 = 0x000000016fdfea80
       x26 = 0x0000000000000008
       x27 = 0x000000016fdfed10
       x28 = 0x000000016fdfeaf0
        fp = 0x000000016fdfe850
        sp = 0x000000016fdfdee0
        pc = 0x000000010203fa00  libcoreclr.dylib`PAL_DispatchExceptionWrapper + 16
21 registers were unavailable.

(lldb) f 11
frame #11: 0x0000000280100070
->  0x280100070: ldr    x13, [x0]
    0x280100074: adr    x9, #0x1c
    0x280100078: ldp    x10, x12, [x9]
    0x28010007c: cmp    x13, x10
(lldb) mem read --size 4 --format x $pc
0x280100070: 0xf940000d 0x100000e9 0xa940312a 0xeb0a01bf
0x280100080: 0x54000041 0xd61f0180 0x580000c9 0xd61f0120
(lldb) register read
General Purpose Registers:
       x19 = 0x000000016fdfeb80
       x20 = 0x000000016fdfeb68
       x21 = 0x000000016fdfed48
       x22 = 0x0000000000000001
       x23 = 0x000000016fdfed10
       x24 = 0x000000016fdfedd0
       x25 = 0x000000016fdfea80
       x26 = 0x0000000000000008
       x27 = 0x000000016fdfed10
       x28 = 0x000000016fdfeaf0
        fp = 0x000000016fdfe860
        lr = 0x0000000280100070
        sp = 0x000000016fdfe860
        pc = 0x0000000280100070
20 registers were unavailable.

(lldb) f 12
frame #12: 0x0000000280e30030
->  0x280e30030: nop
    0x280e30034: nop
    0x280e30038: nop
    0x280e3003c: b      0x280e30040
(lldb) register read
General Purpose Registers:
       x19 = 0x000000016fdfeb80
       x20 = 0x000000016fdfeb68
       x21 = 0x000000016fdfed48
       x22 = 0x0000000000000001
       x23 = 0x000000016fdfed10
       x24 = 0x000000016fdfedd0
       x25 = 0x000000016fdfea80
       x26 = 0x0000000000000008
       x27 = 0x000000016fdfed10
       x28 = 0x000000016fdfeaf0
        fp = 0x000000016fdfe8c0
        sp = 0x000000016fdfe870
        pc = 0x0000000280e30030
21 registers were unavailable.

(lldb) ip2md 0x0000000280e30030
MethodDesc:   0000000280ebe740
Method Name:          Program.CatchRethrow(System.Action)
Class:                0000000280eae390
MethodTable:          0000000280ebe780
mdToken:              0000000006000002
Module:               0000000280ebc258
IsJitted:             yes
Current CodeAddr:     0000000280e2ffd0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            00000001001f4290
Code Version History:
     CodeAddr:           0000000280e2ffd0  (Tier 0)
     NativeCodeVersion:  0000000000000000
Source file:  /Users/*/Program.cs @ 33
(lldb) disas -s 0000000280e2ffd0 -e 0x280e30050
    0x280e2ffd0: stp    x29, x30, [sp, #-0x60]!
    0x280e2ffd4: mov    x29, sp
    0x280e2ffd8: add    x9, x29, #0x10            ; =0x10
    0x280e2ffdc: movi.16b v16, #0x0
    0x280e2ffe0: stp    q16, q16, [x9]
    0x280e2ffe4: stp    q16, q16, [x9, #0x20]
    0x280e2ffe8: add    x1, sp, #0x60             ; =0x60
    0x280e2ffec: str    x1, [x29, #0x58]
    0x280e2fff0: str    x0, [x29, #0x50]
    0x280e2fff4: mov    x0, #0xc740
    0x280e2fff8: movk   x0, #0x80eb, lsl #16
    0x280e2fffc: movk   x0, #0x2, lsl #32
    0x280e30000: ldr    w0, [x0]
    0x280e30004: cbz    w0, 0x280e3000c
    0x280e30008: bl     0x280df64d0
    0x280e3000c: nop
    0x280e30010: nop
    0x280e30014: ldr    x0, [x29, #0x50]
    0x280e30018: str    x0, [x29, #0x10]
    0x280e3001c: ldr    x0, [x29, #0x10]
    0x280e30020: ldr    x0, [x0, #0x8]
    0x280e30024: ldr    x1, [x29, #0x10]
    0x280e30028: ldr    x1, [x1, #0x18]
    0x280e3002c: blr    x1
->  0x280e30030: nop
    0x280e30034: nop
    0x280e30038: nop
    0x280e3003c: b      0x280e30040
    0x280e30040: nop
    0x280e30044: ldp    x29, x30, [sp], #0x60
    0x280e30048: ret
    0x280e3004c: stp    x29, x30, [sp, #-0x20]!

The first exception (i.e. System.NullReferenceException) happens in a virtual dispatch stub because x0 is 0:

(lldb) f 11
frame #11: 0x0000000280100070
->  0x280100070: ldr    x13, [x0]
    0x280100074: adr    x9, #0x1c
    0x280100078: ldp    x10, x12, [x9]
    0x28010007c: cmp    x13, x10
(lldb) mem read --size 4 --format x $pc
0x280100070: 0xf940000d 0x100000e9 0xa940312a 0xeb0a01bf  <-- DISPATCH_STUB_FIRST_DWORD
0x280100080: 0x54000041 0xd61f0180 0x580000c9 0xd61f0120

After that, when we throw an exception from the catch block, the runtime starts unwinding the stack from the Program.CatchRethrow() frame and finally segfaults in UnwindManagedExceptionPass1(), trying to EECodeInfo::GetFunctionEntry() for CallEHFunclet:

UnwindManagedExceptionPass1 4552 pc 0000000280E30110 ExceptionRecord 000000010181E990 ContextRecord.pc 000000010209AAC0 IsManagedCode? 0
UnwindManagedExceptionPass1 4573 pc 0000000280E30110 ExceptionRecord 000000010181E990  <-- Program.CatchRethrow
UnwindManagedExceptionPass1 4632 pc = 00000001022EF378  <-- CallEHFunclet + 60
UnwindManagedExceptionPass1 4706 pc isManagedCode? 0
UnwindManagedExceptionPass1 4737 pc = 0000000102261B3C sp 000000016FDFC320 lr 00000001022EF378 fp 000000016FDFC3C0  <-- ExceptionTracker::CallHandler
UnwindManagedExceptionPass1 4737 pc = 000000010226195C sp 000000016FDFC3D0 lr 00000001022EF378 fp 000000016FDFC450  <-- ExceptionTracker::CallCatchHandler
UnwindManagedExceptionPass1 4737 pc = 000000010226252C sp 000000016FDFC460 lr 00000001022EF378 fp 000000016FDFC5F0  <-- ProcessCLRException
UnwindManagedExceptionPass1 4737 pc = 0000000102266894 sp 000000016FDFC600 lr 00000001022EF378 fp 000000016FDFCAC0  <-- UnwindManagedExceptionPass2
UnwindManagedExceptionPass1 4737 pc = 0000000102266CA8 sp 000000016FDFCAD0 lr 00000001022EF378 fp 000000016FDFD020  <-- UnwindManagedExceptionPass1
UnwindManagedExceptionPass1 4737 pc = 000000010226702C sp 000000016FDFD030 lr 00000001022EF378 fp 000000016FDFD3E0  <-- DispatchManagedException
UnwindManagedExceptionPass1 4737 pc = 00000001022610C0 sp 000000016FDFD3F0 lr 00000001022EF378 fp 000000016FDFD810  <-- HandleHardwareException
UnwindManagedExceptionPass1 4737 pc = 0000000102009840 sp 000000016FDFD820 lr 00000001022EF378 fp 000000016FDFD870  <-- SEHProcessException
UnwindManagedExceptionPass1 4737 pc = 000000010203FDDC sp 000000016FDFD880 lr 00000001022EF378 fp 000000016FDFDED0  <-- PAL_DispatchException
UnwindManagedExceptionPass1 4737 pc = 000000010203FA00 sp 000000016FDFDEE0 lr 00000001022EF378 fp 000000016FDFE850  <-- PAL_DispatchExceptionWrapper
UnwindManagedExceptionPass1 4737 pc = 0000000280100070 sp 000000016FDFE860 lr 00000001022EF378 fp 000000016FDFE860  <-- virtual dispatch stub (SK_DISPATCH)
AdjustContextForVirtualStub 1189 pc 0000000280100070 StubKind 2, sp 000000016FDFE860 lr 00000001022EF378 fp 000000016FDFE860  <-- takes lr
UnwindManagedExceptionPass1 4713 pc = 00000001022EF374  <-- CallEHFunclet + 56
UnwindManagedExceptionPass1 4573 pc 00000001022EF374 ExceptionRecord 000000010181E990
Process 30925 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001020d3f08 libcoreclr.dylib`EECodeInfo::GetFunctionEntry(this=0x000000016fdfb888) at jitinterface.cpp:15034:35 [opt]
   15031            SUPPORTS_DAC;
   15032
   15033            if (m_pFunctionEntry == NULL)
-> 15034                m_pFunctionEntry = m_pJM->LazyGetFunctionEntry(this);
   15035            return m_pFunctionEntry;
   15036        }
   15037
Target 0: (dotnet) stopped.
(lldb) print *this
(EECodeInfo) $1 = {
  m_codeAddress = 4331598708
  m_methodToken = {
    m_pRangeSection = 0x0000000100322510
    m_pCodeHeader = 10752294856
  }
  m_pMD = 0x0000000000000000
  m_pJM = 0x0000000000000000
  m_relOffset = 0
  m_pFunctionEntry = 0x0000000000000000
}
(lldb) image lookup -a 4331598708
      Address: libcoreclr.dylib[0x00000000002ef374] (libcoreclr.dylib.__TEXT.__text + 3060276)
      Summary: libcoreclr.dylib`CallEHFunclet + 56
(lldb) clrstack
OS Thread Id: 0x1f459cb (1)
        Child SP               IP Call Site
000000016FDFC158 00000001020d3f08 [HelperMethodFrame: 000000016fdfc158]
000000016FDFC2A0 0000000280E30110 Program.CatchRethrow(System.Action) [/Users/*/Program.cs @ 39]
000000016FDFD410 00000001022ef378 [FaultingExceptionFrame: 000000016fdfd410]
000000016FDFE860 0000000280E30218 Program+<>c.<Main>b__2_1() [/Users/*/Program.cs @ 24]
000000016FDFE8C0 0000000280E30030 Program.CatchRethrow(System.Action) [/Users/*/Program.cs @ 33]
000000016FDFE920 0000000280E2FFA8 Program+<>c.<Main>b__2_0() [/Users/*/Program.cs @ 19]
000000016FDFE950 0000000280E2FE38 Program.CatchIgnore(System.Action) [/Users/*/Program.cs @ 47]
000000016FDFE990 0000000280E2EDBC Program.Main(System.String[]) [/Users/*/Program.cs @ 18]

We have seen the same crash in https://github.com/dotnet/runtime/issues/51250#issuecomment-888293088 no. 3 because there were a lot of System.NullReferenceException exceptions.

P.S. Ultimately I was able to create a simple program that reproduces this problem:

using System;

public class Program
{
    private interface IFoo
    {
        bool IsValid { get; }
    }

    private class Foo : IFoo
    {
        public bool IsValid { get; set; }
    }

    public static void Main(string[] args)
    {
        bool warmup = new Foo().IsValid;
        CatchIgnore(() =>
        CatchRethrow(() =>
        {
            IFoo[] foos = {new Foo(), null};
            foreach (var foo in foos)
            {
                bool check = foo.IsValid;
            }
        }));
    }

    public static void CatchRethrow(Action action)
    {
        try
        {
            action.Invoke();
        }
        catch (Exception e)
        {
            Console.Out.WriteLine("catch");
            Console.Out.Flush();
            throw new Exception("catch", e);  // throw; doesn't work either
        }
    }

    public static void CatchIgnore(Action action)
    {
        try
        {
            action.Invoke();
        }
        catch (Exception)
        {
            Console.Out.WriteLine("ignore");
            Console.Out.Flush();
        }
    }
}

Linked issues: https://github.com/dotnet/runtime/issues/51250, https://github.com/dotnet/runtime/issues/49070.

janvorli commented 2 years ago

My guts feeling is that it is yet another problem similar to #61486 caused by tail calling a VSD stub. I'll investigate it.

janvorli commented 2 years ago

@k15tfu thank you very much for the analysis and the simple repro!

janvorli commented 2 years ago

My feeling was right, I have verified that it is a variant of the #61486. The only difference is that this time the caller was CallEHFunclet instead of the CallDescrWorkerInternal.

janvorli commented 2 years ago

Fixed by #62719

k15tfu commented 2 years ago

@janvorli @jakobbotsch @jkotas Hi! Unfortunately my repro still doesn't work with that fix #62719, but the repro from https://github.com/dotnet/runtime/issues/61486#issuecomment-966825047 works fine now. Could you please reopen this issue and take another look at it?

jakobbotsch commented 2 years ago

I don't see where the repro you posted would be doing any tailcalls so there might be another related issue here. @janvorli do you have time to check?

k15tfu commented 2 years ago

@janvorli Friendly ping.

janvorli commented 2 years ago

I am sorry for the late response, I was OOF since December 15. @jakobbotsch when I was trying to repro this issue about a month ago, the symptoms looked the same as with the other issue. I'll retry the repro it again to see what's wrong.

jakobbotsch commented 2 years ago

FWIW, I tried to repro this on Ubuntu arm64 last month but I was unable to, so it might be MacOS only unless I missed something.

janvorli commented 2 years ago

I've found the cuprit, it was completely unrelated to the issue @jakobbotsch fixed. While the exception occured in a VSD stub, the CallEHFunclet I've seen on the stack was a red herring. It occurred there due to a bug in the hardware exception handling specific to arm64 macOS.

danmoseley commented 2 years ago

reactivating as fix was reverted

k15tfu commented 2 years ago

I see that it is fixed by https://github.com/dotnet/runtime/pull/63596 and backported to 6.0 in https://github.com/dotnet/runtime/pull/64262, thanks.