dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.16k stars 350 forks source link

Crash when debugging with lldb on MacOS #4769

Open UnityAlex opened 1 week ago

UnityAlex commented 1 week ago

Description

When using a native lldb debugger attached to CoreCLR on MacOS (ARM64) breakpoints in certain locations can cause the process to crash.

Reproduction Steps

Sample code:

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Hello, World!");
        Console.ReadKey();
        string foo = null;
        Console.WriteLine($"foo: {foo.Length}");
    }
}

The idea of the sample is to trigger the native exception handling for a null reference exception. Which is where we have our breakpoint in lldb.

  1. Run sample
  2. attach lldb debugger to process
  3. put a breakpoint on function PAL_DispatchException: breakpoint set --name PAL_DispatchException
  4. Press a key in the CoreCLR console for the running process to trigger the exception
  5. See the breakpoint hit in lldb, usually in some memmove on an access violation
  6. Attempt to continue, silent crash occurs. If you wait long enough MacOS will usually give you a dialog with a crash report. It looks like there might be a stack overflow in the exception handling.

Expected behavior

No crash

Actual behavior

Silent crash.

Regression?

No response

Known Workarounds

No response

Configuration

.net version 8.0.201 MacOS -- 14.5 M1 ARM64 Does not happen on windows. I haven't tried linux yet.

Other information

If it helps the beginning few frames of what I suspect is an overflow looks like:

0   libcoreclr.dylib                         0x3289a5d4c CorUnix::GetCurrentPalThread() + 0 (thread.hpp:684) [inlined]
1   libcoreclr.dylib                         0x3289a5d4c CorUnix::InternalGetCurrentThread() + 0 (thread.hpp:689) [inlined]
2   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:428)
3   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
4   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
5   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
6   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
7   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
8   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
9   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
10  libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
11  libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)

This is followed by 500 ish more frames of the same thing.

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

MichalPetryka commented 1 week ago

Do you have the SOS plugin installed in your lldb?

UnityAlex commented 1 week ago

I don't. I can do that if it would help though.

MichalPetryka commented 1 week ago

I was wondering if the issue might be related to the presence of the plugin or the lack thereof.

UnityAlex commented 1 week ago

I am having difficulties getting this plugin working on my machine. When I install following the instructions here: https://github.com/dotnet/diagnostics/blob/main/documentation/installing-sos-instructions.md it appears to break my lldb:

~ % lldb
zsh: killed     lldb

If I uninstall: dotnet-sos uninstall It works fine again. I see some mentions in documentation that I might need to build the sos plugin myself and install that. Do you know if that's still true for MacOS m1 machines?

tommcdon commented 1 week ago

This issue is tracked on https://github.com/dotnet/runtime/issues/99977.

UnityAlex commented 1 week ago

@tommcdon The issue you linked appears to be sos plugin specific. Sorry for the delay it took me a bit to find @lambdageek 's workaround: https://github.com/dotnet/diagnostics/issues/4551#issuecomment-2142927236 to get lldb working with the plugin but I can still reproduce the crash with and without the plugin installed.

vvuk commented 1 week ago

Here's a full set of steps to reproduce:

  1. Set up a copy of lldb that can load dotnet-sos, as described here: https://github.com/dotnet/diagnostics/issues/4551#issuecomment-2181262810
  2. 
    mkdir Foo
    cd Foo

dotnet new console

cat < Program.cs class Program { static void Main(string[] args) { Console.WriteLine("Hello, World!"); Console.ReadKey(); string foo = null; Console.WriteLine($"foo: {foo.Length}"); } } EOF

dotnet build dotnet publish --sc

3. In one window/tab: `./bin/Release/net8.0/osx-arm64/publish/Foo`
4. In another: `~/lldb -n Foo`
5. When lldb attaches, set a breakpoint: `breakpoint set --name PAL_DispatchException`  (_note: this seems to be required to hit the issue; without a breakpoint, I haven't been able to reproduce_)
6. Hit enter in the first window
7. Observe crash in CLR runtime inside Foo in platform_memmove:

~/lldb -n Foo Current symbol store settings: -> Cache: /Users/vladimir/.dotnet/symbolcache -> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 0 (lldb) process attach --name "Foo" Process 13444 stopped

  • thread dotnet/runtime#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x000000019c182db4 libsystem_kernel.dylibread + 8 libsystem_kernel.dylibread: -> 0x19c182db4 <+8>: b.lo 0x19c182dd4 ; <+40> 0x19c182db8 <+12>: pacibsp 0x19c182dbc <+16>: stp x29, x30, [sp, #-0x10]! 0x19c182dc0 <+20>: mov x29, sp Target 0: (Foo) stopped. Executable module set to "/Users/vladimir/tmp/Foo/bin/Release/net8.0/osx-arm64/publish/Foo". Architecture set to: arm64-apple-macosx-. (lldb) breakpoint set --name PAL_DispatchException Breakpoint 1: 2 locations. (lldb) c Process 13444 resuming Process 13444 stopped
  • thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08) frame #0: 0x000000019c1f3248 libsystem_platform.dylib_platform_memmove + 168 libsystem_platform.dylib: -> 0x19c1f3248 <+168>: stp q2, q3, [x0] 0x19c1f324c <+172>: subs x2, x2, #0x40 0x19c1f3250 <+176>: b.ls 0x19c1f326c ; <+204> 0x19c1f3254 <+180>: stp q0, q1, [x3] Target 0: (Foo) stopped. (lldb) bt
  • thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08)
  • frame #0: 0x000000019c1f3248 libsystem_platform.dylib_platform_memmove + 168 frame dotnet/runtime#1: 0x0000000105854414 libcoreclr.dylibSEHExceptionThread(void*) + 1368 frame dotnet/runtime#2: 0x000000019c1c2f94 libsystem_pthread.dylib`_pthread_start + 136 (lldb)
tommcdon commented 1 week ago

@vvuk thanks for providing the repro steps. We have a few clarifying questions:

  1. Does this issue only reproduce when following the directions on https://github.com/dotnet/diagnostics/issues/4551#issuecomment-2181262810 (skipping step 1 in the repro steps above)?
  2. Does this issue reproduce when launching the app from lldb?
  3. Does this issue only reproduce only when setting a breakpoint on PAL_DispatchException?
vvuk commented 1 week ago

Does this issue only reproduce when following the directions on libsosplugin.dylib: CoreCLR host crash on macOS Sonoma 14.4 on arm64 diagnostics#4551 (comment) (skipping step 1 in the repro steps above)?

I can reproduce it without loading libsosplugin at all, using non-modified lldb. It seems like just attaching causes an issue.

Does this issue reproduce when launching the app from lldb?

It doesn't seem to (both with and without libsosplugin). But I've also heard that there are cases where it's not 100% reproducible like it seems to be with the steps above (though I suppose you can skip libsosplugin).

Does this issue only reproduce only when setting a breakpoint on PAL_DispatchException?

Without any breakpoints set, the debugger correctly stops in pthread_kill. If I try to set other breakpoints after attaching, for example on CallDescrWorkerInternal... then weird things happen. I think that CallDescrWorkerInteral is already on the stack so the breakpoint shouldn't be hit, but the process seems to hang instead of crashing.

vvuk commented 1 week ago

This might be already understood, but it seems like there is a bad interaction with the mach exception handler thread that CoreCLR creates and the mechanism by which lldb attaches to an existing process.

If I build a debug runtime and set NONPAL_TRACING=1 and I run the little hello world program above, here's what happens. On process launch:

NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT count 1
NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Enabling handlers for thread 00000103 exception mask 0000007e exception port 00001c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
... bunch of threads ...
Hello World  [the process waits for a keypress at this point]

Then I attach lldb at this point, and type finish. Note not continue -- I need the debugger to actually manipulate the process, which is likely what the effect of setting the breakpoint on PAL_x was. The following trace logs show up after the finish:

NONPAL_TRACE: Enabling handlers for thread 00001f03 exception mask 0000007e exception port 00000c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007307 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x661d00019c1c355c sp 000000016cfb3fe0 fp 000000016cfb4070 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00000103
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00000103 port 00007307
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00002903 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xca7580019c1c355c sp 000000016d4e9a00 fp 000000016d4e9a90 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00002903
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00002903 port 0000730b
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xc23880019c1c355c sp 000000016d949a70 fp 000000016d949b00 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03

Assert failure(PID 84850 [0x00014b72], Thread: 12211081 [0xba5389]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000730f
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007313 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00001207 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xaf6200019c1c355c sp 000000016d2fa090 fp 000000016d2fa120 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000

Assert failure(PID 84850 [0x00014b72], Thread: 12211223 [0xba5417]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: HijackFaultingThread thread 00001207
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00001207 port 00007313

Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

Assert failure(PID 84850 [0x00014b72], Thread: 12211088 [0xba5390]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007317 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00007e03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x19c1c355c sp 000000016dec1cd0 fp 000000016dec1d60 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00007e03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00007e03 port 00007317

Assert failure(PID 84850 [0x00014b72], Thread: 12211503 [0xba552f]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x19c1c355c sp 000000016d946c60 fp 000000016d946cf0 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000731b

Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): pOldContext == NULL
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4169
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1

At other times I did this, I didn't get any of the assertion failures, but just got a stream of EXC_BREAKPOINT exception notifications. At this point lldb is still waiting for finish to finish; attempting to interact with the process gives me error: Command requires a process which is currently stopped. (because it's not stopped). If I hit enter in the process itself, I get another EXC_BREAKPOINT notice, followed by the proper EXC_BAD_ACCESS which prints an Unhandled exception message.

The dotnet process doesn't exit at that point; it's hung, and lldb still thinks it's not stopped.

vvuk commented 1 week ago

Ah ha. If I set PAL_MachExceptionMode=2 (MachException_SuppressDebugging) then everything works as it should on attach. When lldb actually launches the process this is checked and exception handling doesn't grab EXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?

tommcdon commented 2 days ago

Ah ha. If I set PAL_MachExceptionMode=2 (MachException_SuppressDebugging) then everything works as it should on attach. When lldb actually launches the process this is checked and exception handling doesn't grab EXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?

Thanks for the details @vvuk! It seems we should document the PAL_MachExceptionMode=2 workaround which seems to disable PAL handling of breakpoint exceptions. I'll move this issue to the dotnet/diagnostics repo and mark this as a documentation issue.