benesch / cgosymbolizer

cgosymbolizer teaches the Go runtime to include cgo frames in backtraces
Other
28 stars 2 forks source link

Package deadlock the whole Go process, 90% of the time, under recent macOS #3

Open folays opened 1 year ago

folays commented 1 year ago

FYI, This package nearly doesn't work since some recent versions of MacOS (at least Intel ones).

The symptom is : 90% of the runs, the progression of the Go programs stops, yet use 100% CPU. Like a deadlock, or livelock more so.

It probably depends on the "activity" of the Go program, but as an example, in my case, on a 20 seconds run, I have ~90 % of failure. On ~10% of my runs, it gets to completion, and when it does complete, I get the deep C symbols in my profiles.

See https://github.com/golang/go/issues/45558#issuecomment-820764029

[...] deadlock is happening because the program is trying to fetch a stack trace while calling the C function calloc. That can happen if a signal is received during the call to calloc. That will cause a deadlock if the cgosymbolizer code itsef calls calloc or malloc. The github.com/benesch/cgosymbolizer code uses macOS libunwind, and at least some versions of macOS libunwind call malloc. So it's not surprising that a deadlock can occur.

Also https://github.com/golang/go/issues/45558#issuecomment-822017019

On macOS, unwind.h is an Apple/LLVM-provided implementation that was introduced with Big Sur according to this comment in unwind.h [...]

The above issue also suggests :

You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function cgo_traceback, to simply return if arg->sig_ctx != 0.

Which indeed fix the symptoms, but the traces (at least in my case) won't have C symbols, as if I didn't use this package.

I don't have any solution, I just figured out it should be mentioned somewhere.

This package is still useful, but sadly "luck" is involved, have to run it multiple times and avoid the "hang" symptom...

FWIW I'm on MacOS Ventura 13.4.1 (Intel CPU).

benesch commented 1 year ago

Thanks for the heads up, Eric. I’m afraid I won’t be able to help. I no longer work much with Go. If you figure out a fix, please feel free to fork. I’d be happy to mark this package as deprecated in favor of a working alternative.

On Wed, Sep 13, 2023 at 9:25 PM Eric Gouyer @.***> wrote:

FYI, This package nearly doesn't work since some recent versions of MacOS (at least Intel ones).

The symptom is : 90% of the runs, the progression of the Go programs stops, yet use 100% CPU. Like a deadlock, or livelock more so.

It probably depends on the "activity" of the Go program, but as an example, in my case, on a 20 seconds run, I have ~90 % of failure. On ~10% of my runs, it gets to completion, and when it does complete, I get the deep C symbols in my profiles.

See golang/go#45558 (comment) https://github.com/golang/go/issues/45558#issuecomment-820764029

[...] deadlock is happening because the program is trying to fetch a stack trace while calling the C function calloc. That can happen if a signal is received during the call to calloc. That will cause a deadlock if the cgosymbolizer code itsef calls calloc or malloc. The github.com/benesch/cgosymbolizer code uses macOS libunwind, and at least some versions of macOS libunwind call malloc. So it's not surprising that a deadlock can occur.

Also golang/go#45558 (comment) https://github.com/golang/go/issues/45558#issuecomment-822017019

On macOS, unwind.h is an Apple/LLVM-provided implementation that was introduced with Big Sur according to this comment in unwind.h [...]

The above issue also suggests :

You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function cgo_traceback, to simply return if arg->sig_ctx != 0.

Which indeed fix the symptoms, but the traces (at least in my case) won't have C symbols, as if I didn't use this package.

I don't have any solution, I just figured out it should be mentioned somewhere.

This package is still useful, but sadly "luck" is involved, have to run it multiple times and avoid the "hang" symptom...

FWIW I'm on MacOS Ventura 13.4.1 (Intel CPU).

— Reply to this email directly, view it on GitHub https://github.com/benesch/cgosymbolizer/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIBCNUA4IBEEUOP2K23X2JMJHANCNFSM6AAAAAA4XKGL7Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>

folays commented 1 year ago

I just come back to trying multiple hours to find a solution, but I failed.

Hope you don't mind me layout here what I've read / tried ; I'm ok to talk here "alone" and not expect an answer from you, since you're no longer using Go as much. It's for future readers and hope someone other will find a solution.

https://maskray.me/blog/2020-11-08-stack-unwinding

/!\ I've read somewhere that .eh_frame_hdr could prevent malloc() from libunwind, because .eh_frame_hdr enable optimisations (binary search tree) that would let libunwind not needing to allocate anymore.

Of course clang (a least MacOS's default one) does not support --eh-frame-hdr :(

I think I understood that it's the linker which do all the work of generating those debug sections. I'm not even sure.

https://clang.llvm.org/docs/UsersManual.html

-femit-dwarf-unwind= When to emit DWARF unwind (EH frame) info. This is a Mach-O-specific option

  • I've also tried -femit-dwarf-unwind=always

I'm not even sure IF llvm/libunwind is == to https://www.nongnu.org/libunwind/

https://reviews.llvm.org/D11897

All compiler flags I told above, I've tried to put them as much as possible in CGO_CFLAGS and the rest in CGO_LDFLAGS. With a go build -v -x which made me see that everything was recompiled each time.

https://faultlore.com/blah/compact-unwinding/

https://lists.nongnu.org/archive/html/libunwind-devel/2016-02/msg00027.html

The only known work around is to implement your own dl_iterate_phdr that doesn't call malloc by hooking into low level APIs that notify you every time a new shared object is loaded.

The attached patch adds the function 'unw_set_iterate_phdr_function' to the libunwind API, allowing a custom implementation of dl_iterate_phdr() to be hooked in. That implementation would need to maintain a local cache (or however else you wanted to manage it) to remove the need for taking the glibc loader lock.

  • plus 2 people which for this dl_iterate_phdr cached the results (using the patch)

https://www.mail-archive.com/libunwind-devel@nongnu.org/msg02305.html

We recently ran into an issue on where a signal handler was calling unw_step() and deadlocked because the thread it interrupted was inside of dl_iterate_phdr (holding its lock) and unw_step ended up calling dl_iterate_phdr too, causing a deadlock. We found a few threads mentioning this in the past: [...]

I confirm that "returning early" in cgo_traceback(), just after either unw_getcontext() and unw_init_local() and the first 3 unw_step() do not expose the bug. Some more unw_step() triggers it, so I guess that the diagnosis above is correct :

As said in the link above :

It's not resolved as far as I know. The only known work around is to implement your own dl_iterate_phdr that doesn't call malloc by hooking into low level APIs that notify you every time a new shared object is loaded. The main trade-off is upfront/fixed memory cost vs pay-for-what-you-use (current glibc approach). It's been a while, so I may not be remembering all the details right.

So I think that the libunwind choose to not pay an upfront cost, and will at run-time call dl_iterate_phdr only when needed.

Which makes me realise that maaaaybe at Go init() time, call a C function which would look over all dynamic librairies linked to the program, and call some unw_*() functions to force an upfront call to dl_iterate_phdr()... ?

Other random ideas / things to try

My computer : MacOS Ventura 13.4.1, Intel CPU, clang 14.0.3.

folays commented 1 year ago

(on macOS Intel, latest Ventura, Xcode/clang are the latest official from Apple)

Sorry, still laying out here some random attempts and discussion here.

I'm still showing at the bottom how to recompile and link with a locally git fetched uptream llvm's libunwind there, with sources then showing in the lldb debugger.

attempt to hook and divert reentrant calls to malloc() to a mmap(NULL)

I've worked on https://github.com/folays/cgosymbolizer_macos/tree/test which could be used on a local build with a :

git clone https://github.com/folays/cgosymbolizer_macos.git
cd cgosymbolizer_macos/
git checkout test

and a go work use /path/to/cgosymbolizer_macos/

The intention was to hook *alloc() calls at RUNTIME (WHICH WORK) without LD_PRELOAD equivalent. I've used a (roughly speaking) static array of spinlocks[65536] :

The intention was that if an reentrant *alloc() was ongoing, and if it was "rare" (only for a few unlucky calls to malloc() from deep dl_iterate_phdr() , to return some allocations from a mmap(NULL) ...

lldb (macOS equiv. to gdb) shown me some SEGFAULT in libunwind at address 0x0 when unwinding deep from LuaJIT calls. I guess there is also maybe with LuaJIT not correctly generating unwinding data for JITed code.

(I'm sure I get my calls to LuaJIT right tho)

attempt to recompile libunwind to have debug symbols of it, when using lldb

cd /tmp/ # please do it there, some path below reuses it
git clone https://github.com/llvm/llvm-project.git
cd llvm-project/

cmake -G Ninja -S runtimes -B build -DLLVM_ENABLE_RUNTIMES="libunwind" -DCMAKE_BUILD_TYPE="Debug" # Configure
ninja -C build unwind # Build
ninja -C build check-unwind # Test

Build the Go program with this libunwind instead of clang's apple's distributed one :

# probably some flags are not needed here. I'm beginning to be tired.
#  it mainly uses -rpath and -L and -l to try to force /tmp/llvm-project/build/lib/libunwind
# it compiles ./cmd/your-go-package-main/ to /tmp/program.bin
CGO_CFLAGS="-g -fno-omit-frame-pointer" CGO_LDFLAGS="-unwindlib=libunwind -Wl,-rpath,/tmp/llvm-project/build/lib/ -L/tmp/llvm-project/build/lib/ -lunwind" go build -gcflags=all='-N -l' -v -x -o /tmp/program.bin ./cmd/your-go-package-main/

# now run /tmp/program.bin ; Personally I run it like this :
GODEBUG=asyncpreemptoff=1,cgocheck=0 /tmp/program.bin

I get a little more output with this locally rebuilt libunwind. I've got on stdout (or stderr, didn't checked) some :

libunwind: malformed __unwind_info at 0x1019FFE68 bad second level page

  • The program gets execution farther than with default libunwind (it still hangs after some time, but works longer, and sometimes I get lucky and it completes before hanging)
  • I get the full C stack in the profiles output when it completes
  • I have roughly 25 of those messages

Also, when it hangs (which still happens regularly, but less often), lldb nows show me the source code :

# note : not always thread #1.
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)

# top of the stack
    frame #0: 0x000000010214f460 libunwind.1.0.dylib`libunwind::LocalAddressSpace::get32(this=0x000000010215c9b9, addr=0) at AddressSpace.hpp:175:5
   156  /// LocalAddressSpace is used as a template parameter to UnwindCursor when
   157  /// unwinding a thread in the same process.  The wrappers compile away,
   158  /// making local unwinds fast.
   159  class _LIBUNWIND_HIDDEN LocalAddressSpace {
   160  public:
   [...]
   173    uint32_t         get32(pint_t addr) { <<< in this function of the class
   174      uint32_t val;
-> 175      memcpy(&val, (void *)addr, sizeof(val));
   176      return val;
   177    }

# called from :
frame #1: 0x000000010214a908 libunwind.1.0.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(addressSpace=0x000000010215c9b9, fdeStart=0, fdeInfo=0x000000c000010d68, cieInfo=0x000000c000010d30, useCIEInfo=false) at DwarfParser.hpp:177:43
   168  /// Parse a FDE into a CIE_Info and an FDE_Info. If useCIEInfo is
   169  /// true, treat cieInfo as already-parsed CIE_Info (whose start offset
   170  /// must match the one specified by the FDE) rather than parsing the
   171  /// one indicated within the FDE.
   172  template <typename A>
   173  const char *CFI_Parser<A>::decodeFDE(A &addressSpace, pint_t fdeStart,
   174                                       FDE_Info *fdeInfo, CIE_Info *cieInfo,
   175                                       bool useCIEInfo) {
   176    pint_t p = fdeStart;
-> 177    pint_t cfiLength = (pint_t)addressSpace.get32(p);
   178    p += 4;
   179    if (cfiLength == 0xffffffff) {
   180      // 0xffffffff means length is really next 8 bytes
   181      cfiLength = (pint_t)addressSpace.get64(p);
   182      p += 8;
   183    }
   [...]

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010214f460 libunwind.1.0.dylib`libunwind::LocalAddressSpace::get32(this=0x000000010215c9b9, addr=0) at AddressSpace.hpp:175:5
    frame #1: 0x000000010214a908 libunwind.1.0.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(addressSpace=0x000000010215c9b9, fdeStart=0, fdeInfo=0x000000c000010d68, cieInfo=0x000000c000010d30, useCIEInfo=false) at DwarfParser.hpp:177:43
    frame #2: 0x000000010214cc32 libunwind.1.0.dylib`libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf(addressSpace=0x000000010215c9b9, pc=140703308999942, fdeStart=0, registers=0x000000c0000114a8, isSignalFrame=0x000000c000011599, stage2=false) at DwarfInstructions.hpp:196:7
    frame #3: 0x000000010214cb0d libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarfFDE(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:1020:12
    frame #4: 0x000000010214ca1c libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithCompactEncoding(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:1032:14
    frame #5: 0x000000010214bb46 libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:2903:20
    frame #6: 0x000000010214a29c libunwind.1.0.dylib`::__unw_step(cursor=0x000000c000011498) at libunwind.cpp:182:14
    frame #7: 0x0000000100e78fda loop.bin`cgo_traceback + 394
    frame #8: 0x0000000100e79f57 loop.bin`x_cgo_callers + 71
    frame #9: 0x00007ff80ade75ed libsystem_platform.dylib`_sigtramp + 29
    frame #10: 0x00007ff80ac06d06 libsystem_malloc.dylib`_malloc_zone_malloc + 1
    frame #11: 0x0000000100f8861b loop.bin`sqlite3MemMalloc + 27
    frame #12: 0x0000000100ebefe6 loop.bin`mallocWithAlarm + 230
    frame #13: 0x0000000100e9884a loop.bin`sqlite3Malloc + 90
    frame #14: 0x0000000100e987d4 loop.bin`sqlite3_malloc + 68
    frame #15: 0x0000000100e9402a loop.bin`decNumberToSQLite3Blob + 26
    frame #16: 0x0000000100e95f17 loop.bin`decimalAdd + 263
    frame #17: 0x0000000100e90631 loop.bin`decimalAddFunc + 113
    frame #18: 0x0000000100eec958 loop.bin`sqlite3VdbeExec + 41368
    frame #19: 0x0000000100ea443c loop.bin`sqlite3Step + 588
    frame #20: 0x0000000100e9f387 loop.bin`sqlite3_step + 103
    frame #21: 0x0000000100e7b63d loop.bin`_lfsq_sqlite3_step + 29
    frame #22: 0x0000000100e7af15 loop.bin`l_lfsq_stmt_execute + 133
    frame #23: 0x00000001021ff3d6 libluajit-5.1.2.1.1694940244.dylib`___lldb_unnamed_symbol343 + 68
    frame #24: 0x000000010221926a libluajit-5.1.2.1.1694940244.dylib`lua_pcall + 154
    frame #25: 0x0000000100e7c697 loop.bin`ylua_pcall + 39
    frame #26: 0x0000000100e7cbde loop.bin`_cgo_7e8f3114fa7f_Cfunc_ylua_pcall + 62
    frame #27: 0x000000010007ed08 loop.bin`runtime.asmcgocall.abi0 + 104
    frame #28: 0x000000010000d77f loop.bin`runtime.cgocall + 127
    frame #29: 0x0000000100b3390b loop.bin`github.com/folays/luajit-go._Cfunc_ylua_pcall.abi0 + 75
    frame #30: 0x0000000100b3b8b8 loop.bin`github.com/folays/luajit-go.(*State).Y_lua_pcall + 88
    frame #31: 0x0000000100b3d679 loop.bin`github.com/folays/luajit-go.(*State)._run + 569
    frame #32: 0x0000000100b3d1f2 loop.bin`github.com/folays/luajit-go.(*State)._runFuncWithResults + 242
    frame #33: 0x0000000100b3d065 loop.bin`github.com/folays/luajit-go.(*State).RunFuncWithResults + 165
[... my Go program stack...]
    frame #41: 0x000000010029eeed loop.bin`github.com/spf13/cobra.(*Command).execute + 3341
    frame #42: 0x000000010029ff4f loop.bin`github.com/spf13/cobra.(*Command).ExecuteC + 2511
    frame #43: 0x000000010029f3f2 loop.bin`github.com/spf13/cobra.(*Command).Execute + 50
    frame #44: 0x0000000100e78dc6 loop.bin`main.main + 38
    frame #45: 0x000000010004a107 loop.bin`runtime.main + 615
    frame #46: 0x000000010007f061 loop.bin`runtime.goexit.abi0 + 1

So... two mixed problems there I think :

  1. lots of Go projects (which do not use LuaJIT) stopped to use https://github.com/benesch/cgosymbolizer/ due to some deadlock issues (or more so CPU hang 100%)
  2. I may have an additional program with LuaJIT which, even if I were to find a solution to (1), would still exhibit the program due to using LuaJIT with JIT'ed code in the stack

Maybe the rebuilt llvm which outputs some libunwind: malformed __unwind_info at 0x.... bad second level page could be of interest (skip continuing on the stack ?).

I also don't know why the CPU goes to 100% whereas the symptoms seems to be libunwind catching a SEGFAULT due to doing a memcpy(NULL, ...) + segfault. I guess it don't die for some reasons, whilst other threads are maybe in a retry loop.

Note that every time that I had the problem, lldb shown me a EXC_BAD_ACCESS (code=1, address=0x0).