An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6
1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42
2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142
3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589
4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101
6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60
7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39
Python Exception <type 'exceptions.IndexError'> list index out of range:
8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229
9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127
10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/retrieve/service_impl.cpp:16
11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/proto/retrieve_api.pb.cc:245
12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499
13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136
14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297
15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158
16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77
17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6
1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6
2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675
3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287
4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69
5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94
6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33
7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229
8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)
at /src/common/pool/execute_queue.h:142
9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732
10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720
11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115
12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6
13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
Describe the bug (描述bug) 我们有一个计算引擎,需要在单独的线程池调用。因此,我们采用了如下的设计方案
我们想请教两个问题
卡死时持续滚动如下日志: [ERROR] [2021-06-10 11:00:25.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:26.082] [52858#57107] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:26.103] [52858#57195] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:27.082] [52858#57152] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:27.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096 [ERROR] [2021-06-10 11:00:28.082] [52858#57122] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048 [ERROR] [2021-06-10 11:00:28.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
典型堆栈 [1](rpc处理线程卡在上面的 Wait 方法): Thread 20 (Thread 0x7f84ee7fc700 (LWP 57277)):
0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6
1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42
2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142
3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589
4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101
6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60
7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39
Python Exception <type 'exceptions.IndexError'> list index out of range:
8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229
9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127
10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499
13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136
14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297
15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158
16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77
17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
典型堆栈 [2](计算线程卡在上面的 Done 方法): Thread 196 (Thread 0x7f8bb080d700 (LWP 57094)):
0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6
1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6
2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675
3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287
4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69
5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94
6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33
7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229
8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)
9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732
10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720
11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115
12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6
13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
To Reproduce (复现方法) 高负载后可能出现
Expected behavior (期望行为) 负载降低后,服务可自动恢复正常,不要一直卡住
Versions (各种版本) OS: centos7 Compiler: gcc 4.8.5 brpc: 0.9.6 protobuf: 3.6.1
Additional context/screenshots (更多上下文/截图)