apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.56k stars 3.98k forks source link

Fix pthread_mutex_trylock deadlock in jemalloc #2727

Closed chenBright closed 3 months ago

chenBright commented 3 months ago

What problem does this PR solve?

Issue Number: resolve #2726

Problem Summary:

2692 未使用__dl_sym的原因是,UT无法运行,报错信息:symbol lookup error: ./libbrpc.so: undefined symbol: pthread_mutex_trylock。相关issue:#2266 #1086 。

报错原因总结:libpthread.so先于libbrpc.dbg.so加载,导致使用__dl_sym RTLD_NEXT在后续加载的动态库中找不到pthread_mutex_trylock符号。

解决方法有两个:

  1. libbrpc.dbg.so先于libpthread.so加载。
  2. 最终的可执行文件静态链接bRPC静态库。

具体分析:

man文档提到RTLD_NEXT的作用:

Find the next occurrence of the desired symbol in the search order after the current object.

在本场景下,大致意思是在从加载顺序在libbrpc.dbg.so之后的动态库中查找pthread_mutex_*符号。那么,libbrpc.dbg.so要先于libpthread.so加载,才能找到pthread_mutex_*系列符号。

在master分支下编译出brpc_channel_unittest程序用作调试。为了更好地展示,会对输出进行适当的处理(过滤、删减)。

通过LD_DEBUG=libs查看动态库加载顺序,发现libpthread.so先于libbrpc.dbg.so加载了。

LD_DEBUG=libs ./brpc_channel_unittest | grep 'needed by\|generating link map'

file=libgflags.so.2.2 [0];  needed by ./brpc_channel_unittest [0]
file=libgflags.so.2.2 [0];  generating link map
file=libprotobuf.so.17 [0];  needed by ./brpc_channel_unittest [0]
file=libprotobuf.so.17 [0];  generating link map
==================================================
file=libpthread.so.0 [0];  needed by ./brpc_channel_unittest [0]
file=libpthread.so.0 [0];  generating link map
==================================================
file=libssl.so.1.1 [0];  needed by ./brpc_channel_unittest [0]
file=libssl.so.1.1 [0];  generating link map
file=libcrypto.so.1.1 [0];  needed by ./brpc_channel_unittest [0]
file=libcrypto.so.1.1 [0];  generating link map
file=libdl.so.2 [0];  needed by ./brpc_channel_unittest [0]
file=libdl.so.2 [0];  generating link map
file=libz.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=libz.so.1 [0];  generating link map
file=librt.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=librt.so.1 [0];  generating link map
file=libleveldb.so.1d [0];  needed by ./brpc_channel_unittest [0]
file=libleveldb.so.1d [0];  generating link map
file=libtcmalloc_and_profiler.so.4 [0];  needed by ./brpc_channel_unittest [0]
file=libtcmalloc_and_profiler.so.4 [0];  generating link map
==================================================
file=libbrpc.dbg.so [0];  needed by ./brpc_channel_unittest [0]
file=libbrpc.dbg.so [0];  generating link map
==================================================
file=libstdc++.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libstdc++.so.6 [0];  generating link map
file=libm.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libm.so.6 [0];  generating link map
file=libgcc_s.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=libgcc_s.so.1 [0];  generating link map
file=libc.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libc.so.6 [0];  generating link map
file=libsnappy.so.1 [0];  needed by /usr/lib/x86_64-linux-gnu/libleveldb.so.1d [0]
file=libsnappy.so.1 [0];  generating link map
file=libunwind.so.8 [0];  needed by /usr/lib/x86_64-linux-gnu/libtcmalloc_and_profiler.so.4 [0]
file=libunwind.so.8 [0];  generating link map
file=libprotoc.so.17 [0];  needed by ./libbrpc.dbg.so [0]
file=libprotoc.so.17 [0];  generating link map
file=liblzma.so.5 [0];  needed by /lib/x86_64-linux-gnu/libunwind.so.8 [0]
file=liblzma.so.5 [0];  generating link map

同时,发现了使用dlsym也有同样的报错,但是dlsym不会让进程退出,而是通过dlerror返回错误信息(#2726 的死锁问题是因为这一块申请内存导致的)。

calling init: ./libbrpc.dbg.so
./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)

所以,此时sys_pthread_mutex_trylockNULL。UT之所以没有crash,应该是所有UT以及依赖的库都没用pthread_mutex_trylock

另一方面,没有pthread_mutex_lockpthread_mutex_unlock相关的报错,换而言之,它们的符号是能被找到的。那么,这两个符号来自于哪里呢?

增加一行代码,方便识别出pthread_mutex_lockpthread_mutex_unlock符号的相关绑定信息。

static void init_sys_mutex_lock() {
    if (_dl_sym) {
        sys_pthread_mutex_trylock = (MutexOp)dlsym(RTLD_NEXT, "pthread_mutex_trylock");

        sys_pthread_mutex_lock = (MutexOp)_dl_sym(RTLD_NEXT, "pthread_mutex_lock", (void*)init_sys_mutex_lock);
        sys_pthread_mutex_unlock = (MutexOp)_dl_sym(RTLD_NEXT, "pthread_mutex_unlock", (void*)init_sys_mutex_lock);

        sys_pthread_mutex_trylock = (MutexOp)dlsym(RTLD_NEXT, "pthread_mutex_trylock");
    }
    ...
}

通过LD_DEBUG=bindings,libs找到了,pthread_mutex_lockpthread_mutex_unlock符号来自于libc.so.6(两个pthread_mutex_trylock报错之间的输出)。

LD_DEBUG=bindings,libs ./brpc_channel_unittest 

......
7240:   calling init: ./libbrpc.dbg.so
7240:   binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_mutex_lock'
7240:   binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_mutex_unlock'
==================================================
7240:   ./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)
7240:   binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_lock'
7240:   binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_unlock'
7240:   ./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)
==================================================

libc.so.6搜索pthread_mutex_*相关符号,确实没有pthread_mutex_trylock的符号。

nm -D /usr/lib/x86_64-linux-gnu/libc.so.6 | grep pthread_mutex

0000000000094480 T pthread_mutex_destroy
00000000000944b0 T pthread_mutex_init
00000000000944e0 T pthread_mutex_lock
0000000000094510 T pthread_mutex_unlock

libc.so中的pthread_mutex_*相关函数应该是stub function,参考[1] [[2] [3]。

stub function is a function which cannot be implemented on a particular machine or operating system. Stub functions always return an error, and set errno to ENOSYS (Function not implemented).

在这个场景下,即使pthread_mutex_lockpthread_mutex_unlock使用了错误的函数,pthread_mutex_trylockNULL,也不会影响进程运行。因为libpthread.so先加载了,这时候进程使用的pthread_mutex_*符号都来自于libpthread.so,即libbrpc.dbg.so的hook失效了。

What is changed and the side effects?

Changed:

  1. 使用__dl_sym加载pthread_mutex_try,规避malloc库死锁问题。使用时需要满足以下其中一点:
    1. libbrpc.dbg.so先于libpthread.so加载。(UT使用了这个方法)
    2. 最终的可执行文件静态链接bRPC静态库。
  2. 对于像#2266 无法修改链接顺序或者无法使用静态库的场景,支持使用NO_PTHREAD_MUTEX_HOOK宏关闭pthread_mutex_*相关的hook。关闭后,只是contention profiler采集不到pthread_mutex的竞争,在可接受范围内。

Side effects:


Check List:

chenBright commented 3 months ago

@wwbmmm 有空看看