drogonframework / drogon

Drogon: A C++14/17/20 based HTTP web application framework running on Linux/macOS/Unix/Windows
MIT License
11.42k stars 1.1k forks source link

回调函数地址非法? #2133

Open shong99 opened 1 month ago

shong99 commented 1 month ago

在容器运行环境有出现过几次trantor库的异常,从堆栈内存分析似乎是访问的回调函数地址为非法的,但更具体的无法确认。运行环境的网络请求很频繁且数据量很大。

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056107971fed0 in ?? ()
[Current thread is 1 (Thread 0x7f552c07d700 (LWP 35934))]
(gdb) bt
#0  0x000056107971fed0 in ?? ()
#1  0x00007f552e60bbcc in trantor::Channel::handleEventSafely() () from /usr/local/lib/CET/libtrantor.so.1
#2  0x00007f552e60bc7f in trantor::Channel::handleEvent() () from /usr/local/lib/CET/libtrantor.so.1
#3  0x00007f552e600080 in trantor::EventLoop::loop() () from /usr/local/lib/CET/libtrantor.so.1
#4  0x00007f552e602342 in trantor::EventLoopThread::loopFuncs() () from /usr/local/lib/CET/libtrantor.so.1
#5  0x00007f55350e4b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f5534cb0fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#7  0x00007f5534dc2eff in __init_misc (argc=<optimized out>, argv=0x7f552c07d700, envp=0x561078c5c900) at init-misc.c:33
#8  0x0000000000000000 in ?? ()

下面是通过栈指针查到的信息,handleEventSafely偏移地址92我通过反汇编猜测大概是readCallback_?

(gdb) x /34a $rsp
0x7f552c07cd18: 0x7f552e60bbcc <_ZN7trantor7Channel17handleEventSafelyEv+92>    0x56107bcf3748
0x7f552c07cd28: 0x7f552e60bc7f <_ZN7trantor7Channel11handleEventEv+111> 0x7f552c07ce10
0x7f552c07cd38: 0x34e84ca07af18718      0x56107af18720
0x7f552c07cd48: 0x7f552c07ce10  0x7f552c07ce20
0x7f552c07cd58: 0x7f552e600080 <_ZN7trantor9EventLoop4loopEv+144>       0x561078a5f8d0
0x7f552c07cd68: 0x561078954208  0x561078a5f7e0
0x7f552c07cd78: 0x7f552c07cdf0  0x7f552c07ce10
0x7f552c07cd88: 0x7f552e602342 <_ZN7trantor15EventLoopThread9loopFuncsEv+626>   0x0
0x7f552c07cd98: 0x100000000000000       0x7f552c07ce10
0x7f552c07cda8: 0x561078a5f880  0x7f552c07cdd0
0x7f552c07cdb8: 0x7f552c07cd9f  0x7f552e93d510 <_ZNSt13__future_base13_State_baseV29_M_do_setEPSt8functionIFSt10unique_ptrINS_12_Result_baseENS3_8_DeleterEEvEEPb>
0x7f552c07cdc8: 0x0     0x561078a5f808
0x7f552c07cdd8: 0x7f552c07cda0  0x7f552e602c90 <_ZNSt14_Function_base13_Base_managerINSt13__future_base13_State_baseV27_SetterIPN7trantor9EventLoopEOS6_EEE10_M_managerERSt9_Any_dataRKSA_St18_Manager_operation>
0x7f552c07cde8: 0x7f552e602d20 <_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_13_State_baseV27_SetterIPN7trantor9EventLoopEOSA_EEE9_M_invokeERKSt9_Any_data> 0x0
0x7f552c07cdf8: 0x7f552c07cda8  0x7f552c07cdb0
0x7f552c07ce08: 0x7f552c07cdb8  0x1
0x7f552c07ce18: 0x7f552c07d700  0x0

下面是寄存器信息和栈帧信息

(gdb) i r
rax            0x56107971fed0      94628756979408
rbx            0x56107afbe990      94628782795152
rcx            0x56107a93d9b0      94628775975344
rdx            0x561078c5c900      94628745693440
rsi            0x56107a93d9b0      94628775975344
rdi            0x56107bcf3750      94628796643152
rbp            0x56107bcf3740      0x56107bcf3740
rsp            0x7f552c07cd18      0x7f552c07cd18
r8             0x56107ad76a40      94628780403264
r9             0x56107ad76a20      94628780403232
r10            0x7                 7
r11            0x246               582
r12            0x7f552c07ce20      140003787656736
r13            0x7f552c07ce30      140003787656752
r14            0x1                 1
r15            0x561078a5f8b0      94628743608496
rip            0x56107971fed0      0x56107971fed0
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x0                 0
k1             0x0                 0
k2             0x0                 0
k3             0x0                 0
k4             0x0                 0
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0
(gdb) i f
Stack level 0, frame at 0x7f552c07cd20:
 rip = 0x56107971fed0; saved rip = 0x7f552e60bbcc
 called by frame at 0x7f552c07cd30
 Arglist at 0x7f552c07cd10, args: 
 Locals at 0x7f552c07cd10, Previous frame's sp is 0x7f552c07cd20
 Saved registers:
  rip at 0x7f552c07cd18

版本信息: drogon 1.7.5 trantor 1.5.5

an-tao commented 1 month ago

先升级到最新版本试试,这个版本太老了

shong99 commented 1 month ago

因为我们在很多生产环境部署的drogon都是这个版本的,要升级的话需要一些流程。

目前我想问的是之前有没有在高负载环境下出现这种回调函数地址为非法的情况,这个问题在一个现场出现3次了

an-tao commented 1 month ago

没被报过这个问题,可能有竞态条件,高负载触发的几率增大。 你们环境是使用drogon做client还是server?

shong99 commented 1 month ago

是server端

这问题还能分析吗,回调函数这块我看的有点头大...

an-tao commented 1 month ago

可以,你要debug编译,然后看coredump的调用堆栈看看崩在哪里了,再考虑修复,但是这个版本是两年前的,估计你修复了也没法在新的版本上打补丁了。只能报告一下错误原因,我再走查一下新版本是不是有这个问题。。。

fantasy-peak commented 1 month ago

本地压力测试可以复现吗

shong99 commented 1 month ago

本地没复现出来,只在现场出现过

shong99 commented 3 weeks ago

通过修改源码,我已经复现出问题了,主要修改的地方是两个,一个是socket的析构函数中屏蔽释放socket,另一个是epoll_ctl屏蔽对tcp的channel的取消注册,通过这种模拟可以复现出问题。

另外,现场环境替换了添加日志的drogon库trantor,通过日志也可以发现channel指针理论上应该被释放了,但是仍在调用read回调函数,所以应该是epoll删除这个指针失败,同时socket应该也没释放成功且还在接受消息

2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] connectDestroyed
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] Channel: remove, chn ptr=0x5573D118F890 owner:0x5573D14C93D0 TcpConnectionImpl
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] EventLoop: removeChannel
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] EpollPoller::removeChannel
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] ~TcpConnectionImpl: free ptr: 0x5573D14C93D0
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] handleEventSafely: handle read:0x5573D118F8B0 chnP=0x5573D118F890 owner:0x5573D14C93D0 

@an-tao @fantasy-peak

fantasy-peak commented 3 weeks ago

最新trantor也有这个问题吗

nqf commented 3 weeks ago

我刚看了一下, 你的意思是 EpollPoller::update 函数中 ::epollctl(epollfd, operation, fd, &event) 执行失败了对吗? https://github.com/an-tao/trantor/blob/65f245539215a8c25e04cd475c13d16044209a66/trantor/net/inner/poller/EpollPoller.cc#L183 https://github.com/an-tao/trantor/blob/65f245539215a8c25e04cd475c13d16044209a66/trantor/net/inner/poller/EpollPoller.cc#L203

shong99 commented 3 weeks ago

是的,目前猜测是这样的

nqf commented 3 weeks ago
1.5.5
void TcpServer::handleCloseInLoop(const TcpConnectionPtr &connectionPtr)
{
    size_t n = connSet_.erase(connectionPtr);
    (void)n;
    assert(n == 1);
    auto connLoop = connectionPtr->getLoop();
    if (connLoop == loop_)
    {
        static_cast<TcpConnectionImpl *>(connectionPtr.get())
            ->connectDestroyed();
    }
    else
    {
        connLoop->queueInLoop([connectionPtr]() {
            static_cast<TcpConnectionImpl *>(connectionPtr.get())
                ->connectDestroyed();
        });
    }
}
最新的
void TcpServer::handleCloseInLoop(const TcpConnectionPtr &connectionPtr)
{
    size_t n = connSet_.erase(connectionPtr);
    (void)n;
    assert(n == 1);
    auto connLoop = connectionPtr->getLoop();

    // NOTE: always queue this operation in connLoop, because this connection
    // may be in loop_'s current active channels, waiting to be processed.
    // If `connectDestroyed()` is called here, we will be using an wild pointer
    // later.
    connLoop->queueInLoop(
        [connectionPtr]() { connectionPtr->connectDestroyed(); });
}

https://github.com/an-tao/trantor/pull/206 或许已经被修复了, 在最新版中

fantasy-peak commented 2 weeks ago

@shong99 升级最新版本了吗