Open UnityAlex opened 1 week ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Do you have the SOS plugin installed in your lldb?
I don't. I can do that if it would help though.
I was wondering if the issue might be related to the presence of the plugin or the lack thereof.
I am having difficulties getting this plugin working on my machine. When I install following the instructions here: https://github.com/dotnet/diagnostics/blob/main/documentation/installing-sos-instructions.md it appears to break my lldb:
~ % lldb
zsh: killed lldb
If I uninstall: dotnet-sos uninstall
It works fine again. I see some mentions in documentation that I might need to build the sos plugin myself and install that. Do you know if that's still true for MacOS m1 machines?
This issue is tracked on https://github.com/dotnet/runtime/issues/99977.
@tommcdon The issue you linked appears to be sos plugin specific. Sorry for the delay it took me a bit to find @lambdageek 's workaround: https://github.com/dotnet/diagnostics/issues/4551#issuecomment-2142927236 to get lldb working with the plugin but I can still reproduce the crash with and without the plugin installed.
Here's a full set of steps to reproduce:
mkdir Foo
cd Foo
dotnet new console
cat <
dotnet build dotnet publish --sc
3. In one window/tab: `./bin/Release/net8.0/osx-arm64/publish/Foo`
4. In another: `~/lldb -n Foo`
5. When lldb attaches, set a breakpoint: `breakpoint set --name PAL_DispatchException` (_note: this seems to be required to hit the issue; without a breakpoint, I haven't been able to reproduce_)
6. Hit enter in the first window
7. Observe crash in CLR runtime inside Foo in platform_memmove:
~/lldb -n Foo Current symbol store settings: -> Cache: /Users/vladimir/.dotnet/symbolcache -> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 0 (lldb) process attach --name "Foo" Process 13444 stopped
- thread dotnet/runtime#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x000000019c182db4 libsystem_kernel.dylib
read + 8 libsystem_kernel.dylib
read: -> 0x19c182db4 <+8>: b.lo 0x19c182dd4 ; <+40> 0x19c182db8 <+12>: pacibsp 0x19c182dbc <+16>: stp x29, x30, [sp, #-0x10]! 0x19c182dc0 <+20>: mov x29, sp Target 0: (Foo) stopped. Executable module set to "/Users/vladimir/tmp/Foo/bin/Release/net8.0/osx-arm64/publish/Foo". Architecture set to: arm64-apple-macosx-. (lldb) breakpoint set --name PAL_DispatchException Breakpoint 1: 2 locations. (lldb) c Process 13444 resuming Process 13444 stopped- thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08) frame #0: 0x000000019c1f3248 libsystem_platform.dylib
_platform_memmove + 168 libsystem_platform.dylib
: -> 0x19c1f3248 <+168>: stp q2, q3, [x0] 0x19c1f324c <+172>: subs x2, x2, #0x40 0x19c1f3250 <+176>: b.ls 0x19c1f326c ; <+204> 0x19c1f3254 <+180>: stp q0, q1, [x3] Target 0: (Foo) stopped. (lldb) bt- thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08)
- frame #0: 0x000000019c1f3248 libsystem_platform.dylib
_platform_memmove + 168 frame dotnet/runtime#1: 0x0000000105854414 libcoreclr.dylib
SEHExceptionThread(void*) + 1368 frame dotnet/runtime#2: 0x000000019c1c2f94 libsystem_pthread.dylib`_pthread_start + 136 (lldb)
@vvuk thanks for providing the repro steps. We have a few clarifying questions:
Does this issue only reproduce when following the directions on libsosplugin.dylib: CoreCLR host crash on macOS Sonoma 14.4 on arm64 diagnostics#4551 (comment) (skipping step 1 in the repro steps above)?
I can reproduce it without loading libsosplugin at all, using non-modified lldb. It seems like just attaching causes an issue.
Does this issue reproduce when launching the app from lldb?
It doesn't seem to (both with and without libsosplugin). But I've also heard that there are cases where it's not 100% reproducible like it seems to be with the steps above (though I suppose you can skip libsosplugin).
Does this issue only reproduce only when setting a breakpoint on PAL_DispatchException?
Without any breakpoints set, the debugger correctly stops in pthread_kill
. If I try to set other breakpoints after attaching, for example on CallDescrWorkerInternal
... then weird things happen. I think that CallDescrWorkerInteral is already on the stack so the breakpoint shouldn't be hit, but the process seems to hang instead of crashing.
This might be already understood, but it seems like there is a bad interaction with the mach exception handler thread that CoreCLR creates and the mechanism by which lldb attaches to an existing process.
If I build a debug runtime and set NONPAL_TRACING=1
and I run the little hello world program above, here's what happens. On process launch:
NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT count 1
NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Enabling handlers for thread 00000103 exception mask 0000007e exception port 00001c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
... bunch of threads ...
Hello World [the process waits for a keypress at this point]
Then I attach lldb at this point, and type finish
. Note not continue
-- I need the debugger to actually manipulate the process, which is likely what the effect of setting the breakpoint on PAL_x
was. The following trace logs show up after the finish
:
NONPAL_TRACE: Enabling handlers for thread 00001f03 exception mask 0000007e exception port 00000c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007307 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0x661d00019c1c355c sp 000000016cfb3fe0 fp 000000016cfb4070 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00000103
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00000103 port 00007307
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00002903 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0xca7580019c1c355c sp 000000016d4e9a00 fp 000000016d4e9a90 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00002903
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00002903 port 0000730b
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0xc23880019c1c355c sp 000000016d949a70 fp 000000016d949b00 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03
Assert failure(PID 84850 [0x00014b72], Thread: 12211081 [0xba5389]): fWasAttached
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000730f
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007313 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00001207 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0xaf6200019c1c355c sp 000000016d2fa090 fp 000000016d2fa120 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
Assert failure(PID 84850 [0x00014b72], Thread: 12211223 [0xba5417]): fWasAttached
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
NONPAL_TRACE: HijackFaultingThread thread 00001207
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00001207 port 00007313
Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): fWasAttached
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
Assert failure(PID 84850 [0x00014b72], Thread: 12211088 [0xba5390]): fWasAttached
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007317 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00007e03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0x19c1c355c sp 000000016dec1cd0 fp 000000016dec1d60 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00007e03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00007e03 port 00007317
Assert failure(PID 84850 [0x00014b72], Thread: 12211503 [0xba552f]): fWasAttached
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual lr 0x19c1c355c sp 000000016d946c60 fp 000000016d946cf0 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000731b
Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): pOldContext == NULL
File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4169
Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
At other times I did this, I didn't get any of the assertion failures, but just got a stream of EXC_BREAKPOINT
exception notifications. At this point lldb
is still waiting for finish
to finish; attempting to interact with the process gives me error: Command requires a process which is currently stopped.
(because it's not stopped). If I hit enter in the process itself, I get another EXC_BREAKPOINT
notice, followed by the proper EXC_BAD_ACCESS
which prints an Unhandled exception message.
The dotnet
process doesn't exit at that point; it's hung, and lldb
still thinks it's not stopped.
Ah ha. If I set PAL_MachExceptionMode=2
(MachException_SuppressDebugging
) then everything works as it should on attach. When lldb
actually launches the process this is checked and exception handling doesn't grab EXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE
. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?
Ah ha. If I set
PAL_MachExceptionMode=2
(MachException_SuppressDebugging
) then everything works as it should on attach. Whenlldb
actually launches the process this is checked and exception handling doesn't grabEXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE
. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?
Thanks for the details @vvuk! It seems we should document the PAL_MachExceptionMode=2
workaround which seems to disable PAL handling of breakpoint exceptions. I'll move this issue to the dotnet/diagnostics repo and mark this as a documentation issue.
Description
When using a native lldb debugger attached to CoreCLR on MacOS (ARM64) breakpoints in certain locations can cause the process to crash.
Reproduction Steps
Sample code:
The idea of the sample is to trigger the native exception handling for a null reference exception. Which is where we have our breakpoint in lldb.
breakpoint set --name PAL_DispatchException
Expected behavior
No crash
Actual behavior
Silent crash.
Regression?
No response
Known Workarounds
No response
Configuration
.net version 8.0.201 MacOS -- 14.5 M1 ARM64 Does not happen on windows. I haven't tried linux yet.
Other information
If it helps the beginning few frames of what I suspect is an overflow looks like:
This is followed by 500 ish more frames of the same thing.