DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.61k stars 554 forks source link

HANG in elfutils libdw in looking up line numbers with drsyms #6611

Closed derekbruening closed 7 months ago

derekbruening commented 7 months ago

Noticed this in Dr. Memory where with the update to use elfutils libdw, release build seem to hang when looking up line numbers:

(gdb) bt
#0  0x00007fffe3c15ad0 in ?? ()
#1  0x00007fffe3e3a088 in Dwarf_Abbrev_Hash_find (htab=htab@entry=0x7ffde464b8f0, hval=hval@entry=1)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/lib/dynamicsizehash_concurrent.c:475
#2  0x00007fffe3e3c35e in __libdw_getabbrev (dbg=0x7ffde3dd5050, cu=cu@entry=0x7ffde464b8a8, offset=0, lengthp=lengthp@entry=0x7ffde3c9c610, result=result@entry=0x0)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_getabbrev.c:86
#3  0x00007fffe3e3bb54 in __libdw_findabbrev (cu=0x7ffde464b8a8, code=5) at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_tag.c:56
#4  0x00007fffe3e3a8aa in __libdw_dieabbrev (readp=<synthetic pointer>, die=0x7ffde3c9c830)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/libdwP.h:762
#5  __libdw_find_attr (die=die@entry=0x7ffde3c9c830, search_name=search_name@entry=16, codep=codep@entry=0x7ffde3c9c6d0, formp=formp@entry=0x7ffde3c9c6d4)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_child.c:49
#6  0x00007fffe3e3a22f in dwarf_attr (die=die@entry=0x7ffde3c9c830, search_name=search_name@entry=16, result=result@entry=0x7ffde3c9c6d0)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_attr.c:45
#7  0x00007fffe3e397f4 in dwarf_getsrclines (cudie=0x7ffde3c9c830, lines=0x7ffde3c9c758, nlines=0x7ffde3c9c750)
    at /home/runner/work/drmemory/drmemory/build_drmemory-debug-64/dynamorio/ext/drsyms/dwarf_getsrclines.c:1243
#8  0x00007fffe3df62fa in get_lines_from_cu (mod=0x7fffe41230d8, cu_die=0x7ffde3c9c830, lines_out=0x7ffde3c9c7a8)
    at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_dw.c:236
#9  0x00007fffe3df63dd in search_addr2line_in_cu (mod=0x7fffe41230d8, pc=4464, cu_die=0x7ffde3c9c830, sym_info=0x7ffde3c9cc60)
    at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_dw.c:262
#10 0x00007fffe3df61c4 in drsym_dwarf_search_addr2line (mod_in=0x7fffe41230d8, pc=4464, sym_info=0x7ffde3c9cc60)
    at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_dw.c:199
#11 0x00007fffe3df1d6f in drsym_unix_lookup_address (mod_in=0x7fffe3fcdda0, modoffs=4464, out=0x7ffde3c9cc60, flags=1)
    at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_unix_common.c:683
#12 0x00007fffe3df03aa in drsym_lookup_address_local (modpath=0x7fffe4017510 "dr/test/drmem-user", modoffs=4464, out=0x7ffde3c9cc60, 
    flags=1) at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_unix_frontend.c:153
#13 0x00007fffe3df058d in drsym_lookup_address (modpath=0x7fffe4017510 "dr/test/drmem-user", modoffs=4464, out=0x7ffde3c9cc60, flags=1)
    at /home/runner/work/drmemory/drmemory/dynamorio/ext/drsyms/drsyms_unix_frontend.c:244
#14 0x00007fffe3cddb97 in lookup_func_and_line (frame=0x7fffe41130d0, name_info=0x7fffe40174b8, modoffs=4464)
    at /home/runner/work/drmemory/drmemory/common/callstack.c:548
#15 0x00007fffe3cf5ced in packed_frame_to_symbolized (pcs=0x7fffe4161440, frame=0x7fffe41130d0, idx=0) at /home/runner/work/drmemory/drmemory/common/callstack.c:2303
#16 0x00007fffe3cf6ae9 in packed_callstack_to_symbolized (pcs=0x7fffe4161440, scs=0x7ffde3c9ce60) at /home/runner/work/drmemory/drmemory/common/callstack.c:2347
#17 0x00007fffe3d2d53d in report_error (etp=0x7ffde3c9d1c0, mc=0x7ffde3c9ddb0, pcs=0x0) at /home/runner/work/drmemory/drmemory/drmemory/report.c:2797
#18 0x00007fffe3d38ae4 in report_unaddressable_access (loc=0x7ffde3c9db00, addr=0x7ffff3c00330 "", sz=4, access_type=2, container_start=0x7ffff3c00330 "", 
    container_end=0x7ffff3c00334 "", mc=0x7ffde3c9ddb0) at /home/runner/work/drmemory/drmemory/drmemory/report.c:3092
#19 0x00007fffe3c5089b in handle_mem_ref_internal (flags=9, loc=0x7ffde3c9db00, addr=0x7ffff3c00330 "", sz=4, mc=0x7ffde3c9ddb0, opnum=0, comb=0x7ffde3c9db40)
    at /home/runner/work/drmemory/drmemory/drmemory/slowpath.c:1741
#20 0x00007fffe3c515d5 in check_mem_opnd (opc=56, flags=9, loc=0x7ffde3c9db00, opnd=..., sz=4, mc=0x7ffde3c9ddb0, opnum=0, comb=0x7ffde3c9db40)
    at /home/runner/work/drmemory/drmemory/drmemory/slowpath.c:1815
#21 0x00007fffe3c46460 in slow_path_with_mc (drcontext=0x7ffde3c84080, pc=0x7ffff3bfb170 <incomplete sequence \307>, 
    decode_pc=0x7ffff3bfb170 <incomplete sequence \307>, mc=0x7ffde3c9ddb0) at /home/runner/work/drmemory/drmemory/drmemory/slowpath.c:878
#22 0x00007fffe3c47043 in slow_path (pc=0x7ffff3bfb170 <incomplete sequence \307>, decode_pc=0x7ffff3bfb170 <incomplete sequence \307>)
    at /home/runner/work/drmemory/drmemory/drmemory/slowpath.c:952

Let it run w/ breakpoints in drsyms_dw.c -- never returns out of libdw.

(gdb) bt
#0  resize_worker (htab=htab@entry=0x7ffde464b8f0) at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/lib/dynamicsizehash_concurrent.c:306
#1  0x00007fffe3e3a080 in Dwarf_Abbrev_Hash_find (htab=htab@entry=0x7ffde464b8f0, hval=hval@entry=1)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/lib/dynamicsizehash_concurrent.c:476
#2  0x00007fffe3e3c35e in __libdw_getabbrev (dbg=0x7ffde3dd5050, cu=cu@entry=0x7ffde464b8a8, offset=0, lengthp=lengthp@entry=0x7ffde3c9c610, result=result@entry=0x0)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_getabbrev.c:86

Seems to be stuck here inside Dwarf_Abbrev_Hash_find at third_party/elfutils/lib/dynamicsizehash_concurrent.c:476:

  while (pthread_rwlock_tryrdlock(&htab->resize_rwl) != 0)
    resize_worker(htab);

htab is maybe not initialized properly?

(gdb) p htab
$6 = (Dwarf_Abbrev_Hash *) 0x7ffde464b8f0
(gdb) p *htab
$2 = {
  size = 41,
  old_size = 7308831422233340207,
  filled = 0,
  table = 0x7ffde47b8050,
  old_table = 0x64756c636e692f72,
  resizing_state = 0,
  next_init_block = 0,
  num_initialized_blocks = 0,
  next_move_block = 0,
  num_moved_blocks = 0,
  resize_rwl = {
    __data = {
      __readers = 11,
      __writers = 0,
      __wrphase_futex = 0,
      __writers_futex = 0,
      __pad3 = 0,
      __pad4 = 0,
      __cur_writer = 0,
      __shared = 0,
      __rwelision = 0 '\000',
      __pad1 = "\000\000\000\000\000\000",
      __pad2 = 0,
      __flags = 0
    },
    __size = "\v", '\000' <repeats 54 times>,
    __align = 11
  }
}
derekbruening commented 7 months ago

Debug build of DR works fine.

DrMemory doesn't seem to be running release-build tests pre-commit? What about DR which does do release-build tests: did drcov tests all pass? What is different here?

derekbruening commented 7 months ago

I spent some time digging further into this. There is some weird pthread lock behavior.

__readers == 0xb has 2 set for this single-threaded app!

It is set to 0x8 and then 0xb here:

(gdb) p &cu_die->cu.abbrev_hash.resize_rwl.__data.__readers
$8 = (unsigned int *) 0x7ffde464b940
(gdb) p sizeof(cu_die->cu.abbrev_hash.resize_rwl.__data.__readers)
$9 = 4
(gdb) watch *(int*)0x7ffde464b940
Hardware watchpoint 6: *(int*)0x7ffde464b940
(gdb) x/2wx 0x7ffde464b940
0x7ffde464b940: 0x00000000      0x00000000
(gdb) c
Continuing.

Hardware watchpoint 6: *(int*)0x7ffde464b940

Old value = 0
New value = 8
0x00007ffff78d41c2 in ___pthread_rwlock_tryrdlock (rwlock=rwlock@entry=0x7ffde464b940) at ./nptl/pthread_rwlock_tryrdlock.c:88
88      ./nptl/pthread_rwlock_tryrdlock.c: No such file or directory.
(gdb) x/2wx 0x7ffde464b940
0x7ffde464b940: 0x00000008      0x00000000
(gdb) c
Continuing.

Hardware watchpoint 6: *(int*)0x7ffde464b940

Old value = 8
New value = 11
0x00007ffff78d43ba in __pthread_rwlock_wrunlock (rwlock=0x7ffde464b940) at ./nptl/pthread_rwlock_common.c:568
568     ./nptl/pthread_rwlock_common.c: No such file or directory.
(gdb) bt
#0  0x00007ffff78d43ba in __pthread_rwlock_wrunlock (rwlock=0x7ffde464b940) at ./nptl/pthread_rwlock_common.c:568
#1  ___pthread_rwlock_unlock (rwlock=rwlock@entry=0x7ffde464b940) at ./nptl/pthread_rwlock_unlock.c:40
#2  0x00007fffe3e3a15b in Dwarf_Abbrev_Hash_find (htab=htab@entry=0x7ffde464b8f0, hval=hval@entry=5)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/lib/dynamicsizehash_concurrent.c:486
#3  0x00007fffe3e3baa4 in __libdw_findabbrev (cu=0x7ffde464b8a8, code=5) at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/dwarf_tag.c:48
#4  0x00007fffe3e3a8aa in __libdw_dieabbrev (readp=<synthetic pointer>, die=0x7ffde3c9c830)
    at /home/runner/work/drmemory/drmemory/dynamorio/third_party/elfutils/libdw/libdwP.h:762
#5  __libdw_find_attr (die=die@entry=0x7ffde3c9c830, search_name=search_name@entry=16, codep=codep@entry=0x7ffde3c9c6d0, formp=formp@entry=0x7ffde3c9c6d4)
(gdb) memquery $pc
7ffff786c000-7ffff79c1000 r-xp 00026000 fe:01 21234345                   /usr/lib/x86_64-linux-gnu/libc.so.6

That's that same Dwarf_Abbrev_Hash_find calling:

      pthread_rwlock_unlock(&htab->resize_rwl);

Which apparently will unlock either a write or read lock. But it seems weird that it did tryrdlock and then a write unlock??

derekbruening commented 7 months ago

A higher-level approach: it reproduces on DR on the client.drcallstack-test test on my local machine in release build only; works in debug there. That test passes on GA CI in release build, however, so it is limited to certain libraries or toolchains: and as later work shows it seems to depend on the glibc version.

I tracked it down by tweaking debug build to look like release. When I disabled DEBUG_MEMORY and HEAP_ACCOUNTING, debug build then reproduced the hang, even at -checklevel 0. I then narrowed it down to the memset to 0xab in heap_mmap_ex() which is what prevents the hang. Enabling just that memset in release build: works! A memset to 0 hangs; any non-0 value avoids the hang.

Narrowing further: it's the memset for the mmap in privload_tls_init(). Narrowing within that 2-page region: it's offset 0x19d0, which turns into %fs:0x2d0, which is the pthread tid field.

It seems that if the tid field is zero, the pthread lock checks for whether a lock is owned incorrectly think the lock is indeed owned since the owner field of the lock is zero and that matches the (invalid) tid of zero. This explains the weird lock behavior in the callstack above at the hang point.

derekbruening commented 7 months ago

I thought this may be related to #5437 and the libc/ld GLRO vars: but the variables initialized there are not in the TLS mmap and are unrelated. This is related to that issue in the sense that it's coming from the merging of pthreads into libc and having too tight of a coupling with ld.so with undocumented initialization between them, making if very hard to replace ld.so as DR is doing.

For a short term solution, my plan is to locate this field (%fs:0x2d0 for my glibc 2.37) programmatically by decoding an exported routine known to reference it up front such as pthread_mutex_consistent. Long-term solutions are discussed in #5437.