dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 354 forks source link

libsosplugin cannot be loaded into lldb (macos arm64) #4259

Open YuraAka opened 1 year ago

YuraAka commented 1 year ago

Description

I try to install dotnet-sos to get an ability to debug dotnet programs on mac, but after installing it according to this manual, lldb shows error on startup:

$ lldb
error: this file does not represent a loadable dylib
error: 'setsymbolserver' is not a valid command.
(lldb) 

Futher context:

$ cat .lldbinit 
#START - ADDED BY SOS INSTALLER
plugin load /Users/yuraaka/.dotnet/sos/libsosplugin.dylib
setsymbolserver -ms
#END - ADDED BY SOS INSTALLER

Also, I tried to build libsosplugin from source, but got the same result. What do I do wrong?

Reproduction Steps

  1. Install dotnet-sos
  2. Run lldb
  3. Get an error

Expected behavior

No error is appeared

Actual behavior

Error "this file does not represent a loadable dylib".

Regression?

No response

Known Workarounds

No response

Configuration

$ dotnet-sos --version 7.0.442301+6245a3eeff5a12218eb5b615788d776027133e91



### Other information

_No response_
ghost commented 1 year ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I try to install dotnet-sos to get an ability to debug dotnet programs on mac, but after installing it according to [this manual](https://github.com/dotnet/diagnostics/blob/main/documentation/building/osx-instructions.md), lldb shows error on startup: ``` $ lldb error: this file does not represent a loadable dylib error: 'setsymbolserver' is not a valid command. (lldb) ``` Futher context: ``` $ cat .lldbinit #START - ADDED BY SOS INSTALLER plugin load /Users/yuraaka/.dotnet/sos/libsosplugin.dylib setsymbolserver -ms #END - ADDED BY SOS INSTALLER ``` Also, I tried to build libsosplugin [from source](https://github.com/dotnet/diagnostics), but got the same result. What do I do wrong? ### Reproduction Steps 1. Install dotnet-sos 2. Run lldb 3. Get an error ### Expected behavior No error is appeared ### Actual behavior Error "this file does not represent a loadable dylib". ### Regression? _No response_ ### Known Workarounds _No response_ ### Configuration - Apple M2 Max - macOS Ventura 13.5.2 ``` $ lldb -v lldb-1403.0.17.67 Apple Swift version 5.8.1 (swiftlang-5.8.0.124.5 clang-1403.0.22.11.100) $ dotnet-sos --version 7.0.442301+6245a3eeff5a12218eb5b615788d776027133e91 ``` ### Other information _No response_
Author: YuraAka
Assignees: -
Labels: `area-Diagnostics-coreclr`
Milestone: -
filipnavara commented 1 year ago

This is known regression in Xcode. I remember talking with @EgorBo about it so maybe he remembers the details?

filipnavara commented 1 year ago

I tracked down the Discord thread where we discussed it. Apparently it was broken in Xcode 14.3.1 and worked in Xcode 15 (which has different problems, so beware). A more detailed description was also cross posted to Apple forum.

EgorBo commented 1 year ago

cc @hoyosjs

hoyosjs commented 1 year ago

Apologies - I should have updated the forum thread after our discussion with Apple. They very promptly solved the issue in their codebase. It's just now becoming available with Xcode 15 (although things seem to have some issues there right now that need some investigation cc @mikem8361 )

tommcdon commented 1 year ago

Apparently it was broken in Xcode 14.3.1 and worked in Xcode 15 (which has different problems, so beware). A more detailed description was also cross posted to Apple forum.

@filipnavara are any of the different problems issues we should investigate, and if so would you mind opening new issues to track it? It also sounds like we can close this particular bug because SOS loading is now fixed in XCode 15, so I will close this issue.

filipnavara commented 1 year ago

are any of the different problems issues we should investigate, and if so would you mind opening new issues to track it?

I am not sure. I will file separate issue if necessary. With some Xcode 15 beta I was getting the following error on lldb startup: Error: Fail to initialize coreclr 80070008.

tommcdon commented 1 year ago

With some Xcode 15 beta I was getting the following error on lldb startup: Error: Fail to initialize coreclr 80070008.

If you are still seeing this error, we will re-activate this issue to track that particular problem

mikem8361 commented 1 year ago

Yes, I've repro'ed the Error: Fail to initialize coreclr 80070008 error on our M1. Initializing the managed hosting layer is failing for some reason which means managed commands like dumpheap and eeheap won't work. We will continue to investigate but Apple has made debugging this scenario difficult (can not attach to lldb to debug SOS).

EgorBo commented 1 year ago

I workoarounded the initial issue on my Apple M2 machine by compiling LLDB from sources

ylatuya commented 11 months ago

I can still reproduce the issue with XCode 15.0.1 (15A507):

➜  ~ dotnet-sos --version
8.0.452401+966acd12b91675a4d06a7572ff47c587f827beaf
➜  ~ lldb --version 
lldb-1500.0.22.8
Apple Swift version 5.9 (swiftlang-5.9.0.128.108 clang-1500.0.40.1)
➜  ~ lldb          
error: this file does not represent a loadable dylib
error: 'setsymbolserver' is not a valid command.
(lldb) 

I workoarounded the initial issue on my Apple M2 machine by compiling LLDB from sources

Building lldb from sources does not fixes the issue for me.

ylatuya commented 11 months ago

I think the problem is that libsosplugin.dylib is an x86_64-only library:

➜  llvm-build file /Users/andoni/.dotnet/sos/libsosplugin.dylib                                                                                                              
/Users/andoni/.dotnet/sos/libsosplugin.dylib: Mach-O 64-bit dynamically linked shared library x86_64  

A workaround is forcing lldb to run as x86_64:

➜  llvm-build arch -arch x86_64 lldb
Current symbol store settings:
-> Cache: /Users/andoni/.dotnet/symbolcache
-> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 0
(lldb)

The fix for this issue is to provide libsosplugin.dylib and libsos.dylib as a fat library with x86_64 and arm64 support

filipnavara commented 11 months ago

dotnet sos has a command line parameter to install an extension for specific architecture.

mikem8361 commented 11 months ago

dotnet-sos install --arch Arm64 is the command. I'm pretty sure this will fix your issue.

ylatuya commented 11 months ago

Thanks, I didn't know dotnet-sos was already providing builds for different architectures. It's strange that the x64 version was installed by default instead of the arm64, I was probably using the x64 dotnet version rather than the arm64 one.

Using the arm64 version fixes the issue and I can now reproduce the Error: Fail to initialize coreclr 80070008 issue.

am11 commented 1 month ago

With 15.3, it goes straight to sigkill.

$ xcodebuild -version   
Xcode 15.3
Build version 15E204a

$ lldb helloworld/dist/helloworld 
zsh: killed     lldb helloworld/dist/helloworld

~/Library/Logs/DiagnosticReports/lldb-2024-09-17-073242.ips shows:

"exception" : {"port":0,"signal":"SIGKILL","guardId":0,"codes":"0x0000000000000000, 0x0000000000000000","violations":["SET_EXCEPTION_BEHAVIOR"],"message":" SET_EXCEPTION_BEHAVIOR on mach port 0 (guarded with 0x0000000000000000)","subtype":"GUARD_TYPE_MACH_PORT","type":"EXC_GUARD","rawCodes":[0,0]},

with PAL_MachExceptionMode=7, it just fails to load the plugin:

$ PAL_MachExceptionMode=7 lldb helloworld/dist/helloworld
SOS_HOSTING: Fail to initialize hosting runtime '/usr/local/share/dotnet/shared/Microsoft.NETCore.App/6.0.29/libcoreclr.dylib' 80004005
Unrecognized command 'setsymbolserver' because managed hosting failed or was disabled. See sethostruntime command for details.
(lldb) target create "helloworld/dist/helloworld"
Current executable set to '/Users/adeel/projects/helloworld/dist/helloworld' (arm64).

both are arm64 binaries, so this is about something else?

$ file ~/.dotnet/sos/libsosplugin.dylib /usr/local/share/dotnet/shared/Microsoft.NETCore.App/6.0.29/libcoreclr.dylib
/Users/adeel/.dotnet/sos/libsosplugin.dylib:                                  Mach-O 64-bit dynamically linked shared library arm64
/usr/local/share/dotnet/shared/Microsoft.NETCore.App/6.0.29/libcoreclr.dylib: Mach-O 64-bit dynamically linked shared library arm64
tommcdon commented 1 month ago

I believe folks have tried the following workaround with some success - https://github.com/dotnet/diagnostics/issues/4551#issuecomment-2142927236.

am11 commented 1 month ago

Same issue on Sequoia 15.0 and Xcode / llvm / lldb 16.0. (released on Monday / 16th)

Better workaround with Apple's lldb (standard installation):

$ sudo cp /Applications/Xcode.app/Contents/Developer/usr/bin/lldb /usr/local/bin
$ sudo install_name_tool -add_rpath /Applications/Xcode.app/Contents/SharedFrameworks /usr/local/bin/lldb
$ sudo codesign --force --sign - /usr/local/bin/lldb

(I chose /usr/local/bin/lldb since it is in PATH before /usr/bin)

Now open a new terminal and start using lldb with libsosplugin (clrstack -f etc. are working). There is no need to specify entitlements or setting PAL_MachExceptionMode. It's just that the Apple's lldb doesn't have any entitlement set, so plugin dylib with different signature fails to load. With adhoc, apparently it's not required to specify the entitlements.

cc @lambdageek @janvorli

janvorli commented 1 month ago

@am11, that's awesome, thank you so much for sharing this workaround!

mikem8361 commented 1 month ago

Does the C# commands like dumpheap -stat work? We have seen problems initializing the .NET hosting on arm64 MacOS.

am11 commented 1 month ago

Does the C# commands like dumpheap -stat work?

Apparently working (when the program is stopped at the exception):

(lldb) dumpheap -stat
Statistics:
          MT Count TotalSize Class Name
000102ed9f50     1        24 System.Reflection.Metadata.TypeNameParseOptions
000102958248     1        24 System.Collections.Generic.StringEqualityComparer
00010295be88     1        24 System.OrdinalCaseSensitiveComparer
00010295b488     1        24 System.Collections.Generic.NonRandomizedStringEqualityComparer+OrdinalIgnoreCaseComparer
...
Total 1,307 objects, 133,745 bytes

@mikem8361 btw https://github.com/dotnet/diagnostics/issues/52 is still relavant for lldb/Unix, e.g. if a class name has a unicode char dumpheap -stat renders ? (NörttiNirvana became N?rttiNirvana), while Console.WriteLine output in lldb REPL prints it correctly. So it's probably related to direct vs. indirect stdout (via lldb APIs). I had to switch to en-US to get , number grouping separator because the Finnish one has non-breaking space (char code 160) as a grouping separator, which was looking like:

0001040a92e0     5     8?392 System.Object[]
000105193d58    39    12?424 System.Int32[]
000105196e50   805    71?272 System.String
Total 1?320 objects, 140?633 bytes

(same goes for any non-ASCII char)

tommcdon commented 1 month ago

Limitation due to default signing and entitlements on macOS lldb. Workaround documented here: https://github.com/dotnet/diagnostics/blob/main/documentation/FAQ.md. Moving to Future and marking as tracking-external-issue.