apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.56k stars 3.98k forks source link

bthread::id_create_impl的coredump #1188

Closed acelyc111 closed 1 year ago

acelyc111 commented 4 years ago

Describe the bug (描述bug) 使用brpc库的Doris进程出现如下coredump栈:

Core was generated by `/home/work/app/doris/c3prc-bigbi/be/package/be/lib/palo_be'.
Program terminated with signal 11, Segmentation fault.
#0  bthread::id_create_impl (id=id@entry=0x7f09b1140290, data=data@entry=0x7ac5688, on_error=on_error@entry=0x0,
    on_error2=on_error2@entry=0x1b9fea0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:333
333 /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp: 没有那个文件或目录.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  bthread::id_create_impl (id=id@entry=0x7f09b1140290, data=data@entry=0x7ac5688, on_error=on_error@entry=0x0,
    on_error2=on_error2@entry=0x1b9fea0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:333
#1  0x0000000001d1387d in bthread_id_create2 (id=id@entry=0x7f09b1140290, data=data@entry=0x7ac5688,
    on_error=on_error@entry=0x1b9fea0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:693
#2  0x0000000001b9a86d in brpc::Controller::call_id (this=this@entry=0x7ac5688) at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/brpc/controller.cpp:1213
#3  0x0000000001b9634d in brpc::Channel::CallMethod (this=0x1a31af00, method=0x21555800, controller_base=0x7ac5688, request=0x1efb80260, response=0x7ac58d8, done=0x7ac5680) at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/brpc/channel.cpp:394
#4  0x00000000013659bf in palo::PInternalService_Stub::transmit_data (this=<optimized out>, controller=0x7ac5688, request=0x1efb80260, response=0x7ac58d8, done=0x7ac5680) at /builds/olap/doris/gensrc/build/gen_cpp/palo_internal_service.pb.cc:319
#5  0x00000000015fb4a1 in doris::DataStreamSender::Channel::send_batch (this=this@entry=0x1efb80160, batch=batch@entry=0x0, eos=eos@entry=true) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:232
#6  0x00000000015fc03a in doris::DataStreamSender::Channel::close_internal (this=0x1efb80160) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:289
#7  0x00000000015fc215 in close (state=0x905baa00, this=<optimized out>) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:296
#8  doris::DataStreamSender::close (this=0xad029c0, state=0x905baa00, exec_status=...) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:607
#9  0x00000000010208d3 in doris::PlanFragmentExecutor::open_internal (this=this@entry=0x2655c5930) at /builds/olap/doris/be/src/runtime/plan_fragment_executor.cpp:326
#10 0x0000000001020acc in doris::PlanFragmentExecutor::open (this=this@entry=0x2655c5930) at /builds/olap/doris/be/src/runtime/plan_fragment_executor.cpp:259
#11 0x0000000000fb1267 in doris::FragmentExecState::execute (this=0x2655c58c0) at /builds/olap/doris/be/src/runtime/fragment_mgr.cpp:211
#12 0x0000000000fb2d16 in doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=0x692fc00, exec_state=..., cb=...) at /builds/olap/doris/be/src/runtime/fragment_mgr.cpp:394
#13 0x0000000000fb96b8 in __invoke_impl<void, void (doris::FragmentMgr::*&)(std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)>), doris::FragmentMgr*&, std::shared_ptr<doris::FragmentExecState>&, std::function<void(doris::PlanFragmentExecutor*)>&> (__t=@0x20fbf210: 0x692fc00, __f=
    @0x20fbf1d0: (void (doris::FragmentMgr::*)(doris::FragmentMgr * const, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)>)) 0xfb2cf0 <doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>)>) at /usr/include/c++/7.3.0/bits/invoke.h:73
#14 __invoke<void (doris::FragmentMgr::*&)(std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)>), doris::FragmentMgr*&, std::shared_ptr<doris::FragmentExecState>&, std::function<void(doris::PlanFragmentExecutor*)>&> (__fn=
    @0x20fbf1d0: (void (doris::FragmentMgr::*)(doris::FragmentMgr * const, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)>)) 0xfb2cf0 <doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>)>) at /usr/include/c++/7.3.0/bits/invoke.h:95
#15 __call<void, 0, 1, 2> (__args=..., this=0x20fbf1d0) at /usr/include/c++/7.3.0/functional:632
#16 operator()<> (this=0x20fbf1d0) at /usr/include/c++/7.3.0/functional:718
#17 boost::detail::function::void_function_obj_invoker0<std::_Bind_result<void, void (doris::FragmentMgr::*(doris::FragmentMgr*, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>))(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>)>, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:159
#18 0x0000000000fb24d4 in operator() (this=0x3b1cd01c0) at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:759
#19 doris::fragment_executor (param=0x3b1cd01c0) at /builds/olap/doris/be/src/runtime/fragment_mgr.cpp:419
#20 0x00007f0b08218dc5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f0b0852473d in clone () from /lib64/libc.so.6
(gdb) f 0
#0  bthread::id_create_impl (id=id@entry=0x7f09b1140290, data=data@entry=0x7ac5688, on_error=on_error@entry=0x0,
    on_error2=on_error2@entry=0x1b9fea0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:333
333 in /root/doris/doris-dev/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp
(gdb) p butex
$1 = (uint32_t *) 0x0
(gdb)

还有一个类似的栈:

Core was generated by `/home/work/app/doris/c3prc-whalecore/be/package/be/lib/palo_be'.
Program terminated with signal 6, Aborted.
#0  0x00007fafe031f1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fafe031f1d7 in raise () from /lib64/libc.so.6
#1  0x00007fafe03208c8 in abort () from /lib64/libc.so.6
#2  0x000000000230f3b6 in google::DumpStackTraceAndExit () at src/utilities.cc:147
#3  0x00000000023066bd in google::LogMessage::Fail () at src/logging.cc:1599
#4  0x0000000002308544 in google::LogMessage::SendToLog (this=0x7faf5a0f28a0) at src/logging.cc:1553
#5  0x00000000023061e4 in google::LogMessage::Flush (this=0x7faf5a0f28a0) at src/logging.cc:1422
#6  0x0000000002308f79 in google::LogMessageFatal::~LogMessageFatal (this=<optimized out>, __in_chrg=<optimized out>) at src/logging.cc:2125
#7  0x000000000259b0a0 in bthread::id_create_impl (id=id@entry=0x7faf5a0f2900, data=data@entry=0x83f09408, on_error=on_error@entry=0x0,
    on_error2=on_error2@entry=0x2427bb0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /var/local/incubator-doris/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:331
#8  0x000000000259b5cd in bthread_id_create2 (id=id@entry=0x7faf5a0f2900, data=data@entry=0x83f09408,
    on_error=on_error@entry=0x2427bb0 <brpc::Controller::HandleSocketFailed(bthread_id_t, void*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)
    at /var/local/incubator-doris/thirdparty/src/incubator-brpc-0.9.5/src/bthread/id.cpp:693
#9  0x000000000242257d in brpc::Controller::call_id (this=this@entry=0x83f09408) at /var/local/incubator-doris/thirdparty/src/incubator-brpc-0.9.5/src/brpc/controller.cpp:1213
#10 0x000000000241e05d in brpc::Channel::CallMethod (this=0xcd14600, method=0x10bd2400, controller_base=0x83f09408, request=0x139d77180, response=0x83f09658, done=0x83f09400) at /var/local/incubator-doris/thirdparty/src/incubator-brpc-0.9.5/src/brpc/channel.cpp:394
#11 0x000000000134fbff in palo::PInternalService_Stub::transmit_data (this=<optimized out>, controller=0x83f09408, request=0x139d77180, response=0x83f09658, done=0x83f09400) at /builds/olap/doris/gensrc/build/gen_cpp/palo_internal_service.pb.cc:319
#12 0x00000000015d8a91 in doris::DataStreamSender::Channel::send_batch (this=this@entry=0x139d77080, batch=batch@entry=0x139d77138, eos=eos@entry=true) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:232
#13 0x00000000015d8d64 in doris::DataStreamSender::Channel::send_current_batch (this=this@entry=0x139d77080, eos=eos@entry=true) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:275
#14 0x00000000015d9661 in doris::DataStreamSender::Channel::close_internal (this=0x139d77080) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:287
#15 0x00000000015d9805 in close (state=0x1712ed800, this=<optimized out>) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:296
#16 doris::DataStreamSender::close (this=0x48cc6820, state=0x1712ed800, exec_status=...) at /builds/olap/doris/be/src/runtime/data_stream_sender.cpp:607
#17 0x0000000001054f13 in doris::PlanFragmentExecutor::open_internal (this=this@entry=0x863465f0) at /builds/olap/doris/be/src/runtime/plan_fragment_executor.cpp:351
#18 0x0000000001055114 in doris::PlanFragmentExecutor::open (this=this@entry=0x863465f0) at /builds/olap/doris/be/src/runtime/plan_fragment_executor.cpp:284
#19 0x0000000000fdc7d7 in doris::FragmentExecState::execute (this=0x86346580) at /builds/olap/doris/be/src/runtime/fragment_mgr.cpp:209
#20 0x0000000000fde5f6 in doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=0x6e9b180, exec_state=..., cb=...) at /builds/olap/doris/be/src/runtime/fragment_mgr.cpp:393
#21 0x0000000000fe4724 in operator() (a2=<error reading variable: access outside bounds of object referenced via synthetic pointer>, a1=..., p=<optimized out>, this=<optimized out>) at /var/local/thirdparty/installed/include/boost/bind/mem_fn_template.hpp:280
#22 operator()<boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)> >, boost::_bi::list0> (a=<synthetic pointer>, f=..., this=<optimized out>)
    at /var/local/thirdparty/installed/include/boost/bind/bind.hpp:398
#23 operator() (this=<optimized out>) at /var/local/thirdparty/installed/include/boost/bind/bind.hpp:1294
#24 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > >, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...)
    at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:159
#25 0x0000000000edc7e8 in operator() (this=0x7faf5a0f2fc0) at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:759
#26 doris::ThreadPool::work_thread (this=0x6e9b200, thread_id=<optimized out>) at /builds/olap/doris/be/src/util/thread_pool.hpp:120
#27 0x0000000001a20a1d in thread_proxy ()
#28 0x00007fafe00d5dc5 in start_thread () from /lib64/libpthread.so.0
#29 0x00007fafe03e173d in clone () from /lib64/libc.so.6
(gdb)

相关代码: https://github.com/apache/incubator-brpc/blob/a6ccc96aeb92d178b38885dc7ca3c525e5699648/src/bthread/id.cpp#L321-L345 To Reproduce (复现方法) 无明确复现方法,但出现频次还挺高

Expected behavior (期望行为) 正常运行

Versions (各种版本) OS: Compiler: brpc: 0.9.5 protobuf:

Additional context/screenshots (更多上下文/截图)

acelyc111 commented 4 years ago

https://github.com/apache/incubator-brpc/issues/1179 相关联的地方是他们都是从资源池中获取资源,但拿到的资源似乎都有些问题。

jamesge commented 4 years ago

最好先排除TimerThread的问题 (主干已更新), c++的内存问题可能广泛关联。

lorinlee commented 3 years ago

1325 猜测是ResourcePool返回了一个超过65535的id,原因是看ResourcePool的逻辑是65536个group 65536个Block 256个元素,其容量远大于2^32。而bthread_id_t的计算方式是 resource_id << 32 | version,那么resource_id == 0 和 resource_id = 65536生成的bthread_id_t的值是一样的,这俩id在return_resource的时候会把resource_id == 0的元素return两遍,后续get_resouce就会有问题。 #1179 这个也有可能是这个原因,不过1179看起来不光是TimerTask相同,而且还连在一起了,概率感觉比较低,还没有想明白

这个issue另外一个问题是butex是nullptr,感觉可能是创建的时候就失败了,之前没有check,这个PR加了个check,#1326

@jamesge @acelyc111 辛苦帮忙review下看我的猜测是否合理哈,感谢

wwbmmm commented 1 year ago

We close this issue because it is irreproducible and inactive for a long time. If you can reproduce this issue with the latest version of bRPC, please reopen this issue and tell us how to reproduce.

helloworld0xff commented 3 months ago

解决了吗?1.10版本出现概率还挺高的

helloworld0xff commented 3 months ago

1325 猜测是ResourcePool返回了一个超过65535的id,原因是看ResourcePool的逻辑是65536个group 65536个Block 256个元素,其容量远大于2^32。而bthread_id_t的计算方式是 resource_id << 32 | version,那么resource_id == 0 和 resource_id = 65536生成的bthread_id_t的值是一样的,这俩id在return_resource的时候会把resource_id == 0的元素return两遍,后续get_resouce就会有问题。 #1179 这个也有可能是这个原因,不过1179看起来不光是TimerTask相同,而且还连在一起了,概率感觉比较低,还没有想明白

这个issue另外一个问题是butex是nullptr,感觉可能是创建的时候就失败了,之前没有check,这个PR加了个check,#1326

@jamesge @acelyc111 辛苦帮忙review下看我的猜测是否合理哈,感谢

不对,我这边也是这个地方崩溃了,但是_ngroup的值是1,id值没有超过int max