dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 355 forks source link

support dumps on OSX #577

Closed danmoseley closed 3 years ago

danmoseley commented 4 years ago

@mikem8361 , @tommcdon mentioned to me that although we may not have a clean way of creating a dump on macOS today, it might be possible to add a quick-and-dirty way, that just captures the entire process memory, and could still be opened by lldb and examined with SOS.

Is that correct? If so, is it feasible?

Right now, if a test hangs or crashes in a test run on macOS there is little we can do without a repro. So even a quick-and-dirty facility might be worth having and hopefully @epananth could help wire it up into our test infrastructure.

cc @stephentoub @bartonjs

mikem8361 commented 4 years ago

OSX can generate "system" coredumps. "ulimit -c unlimited" in the shell the app is going to run and the dumps show up in /cores with the name "core.".

The problem is on the lldb side. lldb for macOS has a bug in it that makes really hard to use SOS. The bug is the lldb API that SOS uses to get the "OS thread id" returns the thread index and not the "id" and SOS/DAC can't match it/find the thread data.

k15tfu commented 3 years ago

@mikem8361 Hi! Okay, we have to map OSID to LLDB thread index manually, and probably I know how to do it for the main thread:

(lldb) thread list
Process 0 stopped
* thread #1: tid = 0x0000, 0x00007fff7995b2c6 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  thread #2: tid = 0x0001, 0x00007fff7995522a libsystem_kernel.dylib`mach_msg_trap + 10, stop reason = signal SIGSTOP
  thread #3: tid = 0x0002, 0x00007fff7995d36e libsystem_kernel.dylib`poll + 10, stop reason = signal SIGSTOP
  thread #4: tid = 0x0003, 0x00007fff799561ee libsystem_kernel.dylib`__open + 10, stop reason = signal SIGSTOP
  thread #5: tid = 0x0004, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #6: tid = 0x0005, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #7: tid = 0x0006, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #8: tid = 0x0007, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #9: tid = 0x0008, 0x00007fff799591ea libsystem_kernel.dylib`__accept + 10, stop reason = signal SIGSTOP
  thread #10: tid = 0x0009, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #11: tid = 0x000a, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #12: tid = 0x000b, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #13: tid = 0x000c, 0x00007fff7995522a libsystem_kernel.dylib`mach_msg_trap + 10, stop reason = signal SIGSTOP
  thread #14: tid = 0x000d, 0x00007fff79a0d3f0 libsystem_pthread.dylib`start_wqthread, stop reason = signal SIGSTOP
  thread #15: tid = 0x000e, 0x0000000000000000, stop reason = signal SIGSTOP
  thread #16: tid = 0x000f, 0x00007fff7995886a libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #17: tid = 0x0010, 0x00007fff79956ef2 libsystem_kernel.dylib`read + 10, stop reason = signal SIGSTOP
(lldb) clrthreads
ThreadCount:      7
UnstartedThread:  0
BackgroundThread: 5
PendingThread:    0
DeadThread:       1
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
XXXX    1    dda48 00007FE51880A200    20020 Preemptive  000000018F1CBE40:000000018F1CBFD0 00007FE518814600 0     Ukn System.InvalidOperationException 000000018f196b60
XXXX    2    dda4f 00007FE518836400    21220 Preemptive  0000000000000000:0000000000000000 00007FE518814600 0     Ukn (Finalizer)
XXXX    3    dda51 00007FE519017800  1020220 Preemptive  0000000000000000:0000000000000000 00007FE518814600 0     Ukn (Threadpool Worker)
XXXX    4    dda52 00007FE519025400  1021220 Preemptive  000000028F0E7C80:000000028F0E7FD0 00007FE518814600 0     Ukn (Threadpool Worker)
XXXX    5    dda56 00007FE51885A800    21220 Preemptive  000000018F0FA370:000000018F0FBFD0 00007FE518814600 0     Ukn
XXXX    6    dda5c 00007FE51885FE00  1021220 Preemptive  000000018F0FE248:000000018F0FFFD0 00007FE518814600 0     Ukn (Threadpool Worker)
   1    7        0 00007FE519046E00    31820 Preemptive  0000000000000000:0000000000000000 00007FE518814600 0     Ukn
(lldb) clrstack
OS Thread Id: 0x0 (1)
Failed to start stack walk: 8000ffff

In this case it would be setsostid dda48 1. But what about other threads? For example, how to find the right OSID for #13? (after trying all the cases, I know it's setsostid dda56 d)

mikem8361 commented 3 years ago

Using setsostid it is a hit and miss operation. You need to "guess" what thread # is what OSID. And the command only works on one thread at time.

mikem8361 commented 3 years ago

Not sure what we can do here for regular MachO dumps. We now have these ELF dumps generated on OSX that can be opened by dotnet-dump that allows SOS commands and the managed state to be inspected.