baidu / sofa-pbrpc

A light-weight RPC implement of google protobuf RPC framework.
Other
2.13k stars 655 forks source link

RpcClientImpl 在调用stop的时候hung死 #99

Closed baimushan closed 8 years ago

baimushan commented 8 years ago

栈状态

0 0x00007f1834e7b22d in pthread_join () from /lib64/libpthread.so.0

1 0x00000000006215ab in sofa::pbrpc::ThreadGroupImpl::stop (this=0x228f6b0) at src/sofa/pbrpc/thread_group_impl.h:182

2 0x00000000006176c7 in sofa::pbrpc::RpcClientImpl::Stop (this=0x22be000) at src/sofa/pbrpc/rpc_client_impl.cc:109

查看io_service的内存信息如下 (gdb) p (boost::asio::detail::task_io_service \ const) 0x22738e0 $26 = {boost::asio::detail::service_base = {boost::asio::ioservice::service = {boost::noncopyable::noncopyable = {}, _vptr.service = 0xa24a90 <vtable for boost::asio::detail::task_ioservice+16>, key = {typeinfo = 0xa245a0 <typeinfo for boost::asio::detail::typeid_wrapper>, id = 0x0}, owner = @0x228f6d0, next_ = 0x0}, static id = {boost::asio::ioservice::id = {boost::noncopyable::noncopyable = {}, }, }}, onethread = false, mutex = {boost::noncopyable::noncopyable = {}, mutex = {data = {lock = 0, count = 0, owner = 0, nusers = 7, kind = 0, spins = 0, list = { prev = 0x0, next = 0x0}}, size = '\000' <repeats 12 times>, "\a", '\000' <repeats 26 times>, align = 0}}, task = 0x2266e10, taskoperation = {boost::asio::detail::task_io_serviceoperation = {next = 0x0, func_ = 0x0, taskresult = 0}, }, taskinterrupted = false, outstandingwork = {value_ = 3}, opqueue = {boost::noncopyable::noncopyable = {}, front = 0x0, back = 0x0}, stopped = false, shutdown_ = false, first_idlethread = 0x7f182991fce0}

我理解调用后stop函数后task_io_service 的 outstanding_work_变量会被减为0 并退出他的run函数。 从而使得pthread_join函数成功返回。可能的问题点在哪里呢?

baimushan commented 8 years ago

停的时候stream_map的值为3这个有关系吗

baimushan commented 8 years ago

还发现一个现象

0 0x00007f1833fb2163 in epoll_wait () from /lib64/libc.so.6

1 0x000000000061f888 in boost::asio::detail::epoll_reactor::run (this=0x2266e10, block=, ops=...) at /usr/local/include/boost/asio/detail/impl/epoll_reactor.ipp:392

2 0x0000000000624671 in boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=0x22738e0) at /usr/local/include/boost/asio/detail/impl/task_io_service.ipp:396

3 boost::asio::detail::task_io_service::run (this=0x22738e0, ec=...) at /usr/local/include/boost/asio/detail/impl/task_io_service.ipp:153

4 0x000000000062521e in boost::asio::io_service::run (this=0x228f6d0) at /usr/local/include/boost/asio/impl/io_service.ipp:59

5 sofa::pbrpc::ThreadGroupImpl::thread_run (param=0x22bf100) at src/sofa/pbrpc/thread_group_impl.h:263

6 0x00007f1834e7a9d1 in start_thread () from /lib64/libpthread.so.0

7 0x00007f1833fb1b6d in clone () from /lib64/libc.so.6

hang在了epoll_wait上, 但是输出了epoll_reactor中的信息 (gdb) p timerfd $6 = 515 所以不会给 timeout 传入 -1 。很奇怪为啥会hang死在这个地方。

zd-double commented 8 years ago

@baimushan 你把使用的方法和场景说一下,我尝试本地复现

baimushan commented 8 years ago

server 不停的启动停止,退出前我们会stop rpclient实例, 就会遇到这样的case。 从epoll_wait的使用上应该不能hang啊??

zd-double commented 8 years ago

根据你描述的场景,没有复现出hang住的情况,能否留一下邮箱或其他联系方式方便沟通。

baimushan commented 8 years ago

我的qq 406455861

zd-double commented 8 years ago

@baimushan ,近期在我们的环境复现了你说的问题,修复代码已经merge到master分支,请知晓,谢谢!