Open lintanghui opened 3 years ago
死锁时候的栈
Thread 26 (Thread 0x7f20e3744700 (LWP 7233)): #0 0x00007f20f015728d in nanosleep () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f20f0180dc4 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:32 #2 0x0000558cca6d5378 in bthread::TaskGroup::push_rq(unsigned long) () #3 0x0000558cca6d37bc in bthread::TaskGroup::ready_to_run(unsigned long, bool) () #4 0x0000558cca6d5ac8 in int bthread::TaskGroup::start_background<false>(unsigned long*, bthread_attr_t const*, void* (*)(void*), void*) () #5 0x0000558cca6c97db in bthread_start_background () #6 0x0000558cca4d5f4c in braft::run_closure_in_bthread_nosig(google::protobuf::Closure*, bool) () #7 0x0000558cca4df204 in braft::ClosureQueue::clear() () #8 0x0000558cca4e57a3 in braft::BallotBox::clear_pending_tasks() () #9 0x0000558cca46bf01 in braft::NodeImpl::step_down(long, bool, butil::Status const&) () #10 0x0000558cca4657d0 in braft::NodeImpl::check_dead_nodes(braft::Configuration const&, long) () #11 0x0000558cca465a85 in braft::NodeImpl::handle_stepdown_timeout() () #12 0x0000558cca476b94 in braft::StepdownTimer::run() () #13 0x0000558cca482833 in braft::RepeatedTimerTask::on_timedout() () #14 0x0000558cca482adc in braft::RepeatedTimerTask::run_on_timedout_in_new_thread(void*) () #15 0x0000558cca6d2700 in bthread::TaskGroup::task_runner(long) () #16 0x0000558cca6f8861 in bthread_make_fcontext () #17 0x0000000000000000 in ?? () Thread 25 (Thread 0x7f20e3f45700 (LWP 7232)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f20f1378bb5 in __GI___pthread_mutex_lock (mutex=0x558ccea9bb58) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000558cca6e3d01 in pthread_mutex_lock () #3 0x0000558cc9e93ed2 in butil::Mutex::lock() () #4 0x0000558cc9e95990 in std::lock_guard<butil::Mutex>::lock_guard(butil::Mutex&) () #5 0x0000558cca490a7e in braft::NodeImpl::leader_id() () #6 0x0000558cca48f72b in braft::Node::leader_id() () #7 0x0000558cc9de18df in node::Replica::leader[abi:cxx11]() const () #8 0x0000558cc9eb2c72 in node::ApplyClosure::Run() () #9 0x0000558cca4d5dcd in braft::run_closure(void*) () #10 0x0000558cca6d2700 in bthread::TaskGroup::task_runner(long) () #11 0x0000558cca6f8861 in bthread_make_fcontext () #12 0x0000000000000000 in ?? () Thread 24 (Thread 0x7f20e4746700 (LWP 7231)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f20f1378bb5 in __GI___pthread_mutex_lock (mutex=0x558ccea9bb58) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000558cca6e3d01 in pthread_mutex_lock () #3 0x0000558cc9e93ed2 in butil::Mutex::lock() () #4 0x0000558cc9e95990 in std::lock_guard<butil::Mutex>::lock_guard(butil::Mutex&) () #5 0x0000558cca490a7e in braft::NodeImpl::leader_id() () #6 0x0000558cca48f72b in braft::Node::leader_id() () #7 0x0000558cc9de18df in node::Replica::leader[abi:cxx11]() const () #8 0x0000558cc9eb2c72 in node::ApplyClosure::Run() () #9 0x0000558cca4d5dcd in braft::run_closure(void*) () #10 0x0000558cca6d2700 in bthread::TaskGroup::task_runner(long) () #11 0x0000558cca6f8861 in bthread_make_fcontext () #12 0x0000000000000000 in ?? ()
gdb 查看 0x558ccea9bb58 这个mutex的持有者是 LWP7233. 高并发的情况下,如果其他节点在获取leader_id(),那么当step_down的时候可能发生死锁。
具体出问题的地方如下
void NodeImpl::handle_stepdown_timeout() { BAIDU_SCOPED_LOCK(_mutex); // ... }
void ClosureQueue::clear() { std::deque<Closure*> saved_queue; { BAIDU_SCOPED_LOCK(_mutex); saved_queue.swap(_queue); _first_index = 0; } bool run_bthread = false; for (std::deque<Closure*>::iterator it = saved_queue.begin(); it != saved_queue.end(); ++it) { if (*it) { (*it)->status().set_error(EPERM, "leader stepped down"); run_closure_in_bthread_nosig(*it, _usercode_in_pthread); // 这里如果_rq满了会导致线程切出然后死锁 run_bthread = true; } }
handle_stepdown_timeout的时候会首先添加一个mutex.然后在clear_pending_task的时候会创建bthread.如果这个时候_rq满了 会进入slepp重试。进入sleep的时候会导致线程写出从而没法切回来导致死锁。
总结: 在mutex内部创建bthread,如果负载比较高可能导致线程切出从而导致死锁
死锁时候的栈
gdb 查看 0x558ccea9bb58 这个mutex的持有者是 LWP7233. 高并发的情况下,如果其他节点在获取leader_id(),那么当step_down的时候可能发生死锁。
具体出问题的地方如下
handle_stepdown_timeout的时候会首先添加一个mutex.然后在clear_pending_task的时候会创建bthread.如果这个时候_rq满了 会进入slepp重试。进入sleep的时候会导致线程写出从而没法切回来导致死锁。
总结: 在mutex内部创建bthread,如果负载比较高可能导致线程切出从而导致死锁