baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.95k stars 881 forks source link

brpc+用户计算线程池卡住 #299

Closed ChenChuang closed 3 years ago

ChenChuang commented 3 years ago

Describe the bug (描述bug) 我们有一个计算引擎,需要在单独的线程池调用。因此,我们采用了如下的设计方案

  1. brpc负责收发消息,在rpc处理方法中,把消息转换为计算任务投递到一个全局队列中,然后通过 bthread::Mutex + bthread::ConditionVariable 等待任务完成(如下面代码中的 Wait 方法)
  2. 计算线程池(N*pthread)不断从全局队列中取出任务,进行计算后,通过 Done 方法通知正在等待的 rpc 处理 bthread
class Task {
  void Done() {
    {
      std::unique_lock<bthread::Mutex> lock(mutex_);
      done_ = true;
    }
    cond_.notify_one();
  }

  void Wait() {
    std::unique_lock<bthread::Mutex> lock(mutex_);
    while (!done_) {
      cond_.wait(lock);
    }
  }

  bthread::Mutex mutex_;
  bthread::ConditionVariable cond_;
  bool done_ = false;
}

我们想请教两个问题

  1. 这种同一个 bthread::Mutex/ConditionVariable 被 pthread 和 bthread 同时使用的方式,是否合理?
  2. 我们发现在高负载情况下,出现了卡死的情况,是否跟我们这种使用方式有关系?

卡死时持续滚动如下日志: [ERROR] [2021-06-10 11:00:25.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:26.082] [52858#57107] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:26.103] [52858#57195] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:27.082] [52858#57152] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:27.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:28.082] [52858#57122] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:28.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096

典型堆栈 [1](rpc处理线程卡在上面的 Wait 方法): Thread 20 (Thread 0x7f84ee7fc700 (LWP 57277)):

0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6

1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42

2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142

3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589

4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622

5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101

6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60

7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39

Python Exception <type 'exceptions.IndexError'> list index out of range:

8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229

9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127

10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)

at /src/retrieve/service_impl.cpp:16

11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)

at /src/proto/retrieve_api.pb.cc:245

12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499

13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136

14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297

15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158

16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77

17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0

18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

典型堆栈 [2](计算线程卡在上面的 Done 方法): Thread 196 (Thread 0x7f8bb080d700 (LWP 57094)):

0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6

1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6

2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675

3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287

4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69

5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94

6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33

7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229

8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)

at /src/common/pool/execute_queue.h:142

9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732

10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720

11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115

12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6

13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0

14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

To Reproduce (复现方法) 高负载后可能出现

Expected behavior (期望行为) 负载降低后,服务可自动恢复正常,不要一直卡住

Versions (各种版本) OS: centos7 Compiler: gcc 4.8.5 brpc: 0.9.6 protobuf: 3.6.1

Additional context/screenshots (更多上下文/截图)

AdiaLoveTrance commented 10 months ago

有解决吗