Drachtio server seg fault

elbow commented 4 years ago

Hi,

Maybe related to the leak, but getting a seg fault in drachtio:

Core was generated by `/usr/local/bin/drachtio --daemon -f /etc/drachtio/drachtio.conf.xml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  std::__shared_ptr<drachtio::IIP, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr_base.h:923
923       __shared_ptr(const __shared_ptr&) noexcept = default;
[Current thread is 1 (Thread 0x7fc9c483e740 (LWP 30868))]
(gdb) bt
#0  std::__shared_ptr<drachtio::IIP, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr_base.h:923
#1  std::shared_ptr<drachtio::IIP>::shared_ptr (this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr.h:107
#2  drachtio::SipDialogController::clearIIP (this=this@entry=0x210cb00, leg=leg@entry=0x8f829950) at ../src/sip-dialog-controller.cpp:1691
#3  0x00000000004ba035 in drachtio::SipDialogController::endRetransmitFinalResponse (this=0x210cb00, irq=0x8cb0b360, tp=<optimized out>, dlg=std::shared_ptr (count 3, weak 1) 0x8904c720) at ../src/sip-dialog-controller.cpp:1830
#4  0x00000000004ce5d9 in std::_Mem_fn_base<void (drachtio::SipDialogController::*)(nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>), true>::operator()<nta_incoming_s*&, tport_s*&, std::shared_ptr<drachtio::SipDialog>&, void> (__object=<optimized out>, this=<optimized out>) at /usr/include/c++/5/functional:600
#5  std::_Bind<std::_Mem_fn<void (drachtio::SipDialogController::*)(nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)> (drachtio::SipDialogController*, nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)>::__call<void, void*&&, 0ul, 1ul, 2ul, 3ul>(std::tuple<void*&&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul>) (
    __args=<optimized out>, this=<optimized out>) at /usr/include/c++/5/functional:1074
#6  std::_Bind<std::_Mem_fn<void (drachtio::SipDialogController::*)(nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)> (drachtio::SipDialogController*, nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)>::operator()<void*, void>(void*&&) (this=<optimized out>) at /usr/include/c++/5/functional:1133
#7  std::_Function_handler<void (void*), std::_Bind<std::_Mem_fn<void (drachtio::SipDialogController::*)(nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)> (drachtio::SipDialogController*, nta_incoming_s*, tport_s*, std::shared_ptr<drachtio::SipDialog>)> >::_M_invoke(std::_Any_data const&, void*&&) (__functor=..., 
    __args#0=<optimized out>) at /usr/include/c++/5/functional:1871
#8  0x00000000004f1333 in std::function<void (void*)>::operator()(void*) const (__args#0=0x0, this=0x93a7b028) at /usr/include/c++/5/functional:2267
#9  drachtio::TimerQueue::doTimer (this=0x210d048, timer=<optimized out>) at ../src/timer-queue.cpp:283
#10 0x000000000056faff in su_timer_expire ()
#11 0x00000000005b4125 in su_base_port_run ()
#12 0x0000000000564c30 in su_port_run ()
#13 0x0000000000565d7e in su_root_run ()
#14 0x0000000000434867 in drachtio::DrachtioController::run (this=0x20cc280) at ../src/controller.cpp:1160
#15 0x00000000004120f3 in main (argc=4, argv=0x7ffe0084c118) at ../src/main.cpp:47

davehorton commented 4 years ago

can you post the server log and describe the scenario? This is a native build with the latest from develop branch?

elbow commented 4 years ago

It is built yesterday from develop (but the docker image was also doing it). I have a core dump (2.4GB). You want the whole log - its a lot - or the part where it crashed and restarted?

davehorton commented 4 years ago

I think if you can zip and email me the whole log that would be best

davehorton commented 4 years ago

I do not think this is directly related to the memory leak issue so I would proceed with that testing

elbow commented 4 years ago

OK - in your email, thanks

davehorton commented 4 years ago

Thanks for the log. I see the issue. It does look like a bug in drachtio server

The specific call that caused the problem went like this:

received INVITE from A
sent INVITE on to B as B2bUA
100 Trying and then 180 Ringing from B
received CANCEL from A
sent 200 OK to Cancel and 487 back to A
received ACK from A
sent CANCEL to B
received 200 OK to INVITE from B
ONLY THEN received 200 OK to CANCEL from B
sent 200 OK to INVITE to A (this is the server bug)
resent 200 OK a bunch of times
crashed when retransmit timer expired since the dialog it expected to find was reclaimed way early when we got the CANCEL

So, definitely a bug in the server I will address. Strange that "B" sent a 200 OK to invite after we sent it a CANCEL.

Also strange the drachio-srf tried to send the 200 OK also, since it had received the cancel. This may be something I can look at fixing in drachtio-srf as well.

elbow commented 4 years ago

Hi Dave,

If you have a callid or similar I'll look up the details of the call flow. Looks like glare to me - OK to the INVITE and the CANCEL crossed on the wire.

Its obviously rare or we'd be seeing more crashes.

Steve

davehorton commented 4 years ago

Call-ID:

Call-ID: 63036270242d87c1714ef1ed6e0d1048@197.155.248.133:5060

I will send you the entire trace of the call.

It is not a glare condition....the 200 OK comes many seconds after we sent the CANCEL. And then the 200 OK to the CANCEL quite a few secs after that. Strange

elbow commented 4 years ago

Haven't forgotten about this - wew are getting this from time to time but we've been scrambling and I didn't have a chance to gather all the logs. Please bear with me.

davehorton commented 4 years ago

OK. I think I see what the problem is from the earlier log. When I have spare time I will try to look into a fix.

elbow commented 4 years ago

Thanks!

davehorton commented 4 years ago

steve, quick question on this. I have been trying to recreate using Srf#createB@BUA but cannot recreeate the crash. In your code, what are you using to handle the B2BUA messaging? Are you doing something different than a simple createB2BUA?

The key thing I need to recreate is incorrectly sending a 200 OK to A after we have previously sent them a 487 final response, and it seems like the createB2BUA actually guards against doing that..

elbow commented 4 years ago

Hi Dave,

You can by all means look at our code, I'll add you on the repository.

The code uses a simringer for "inbound" calls, so is it possible that is involved?

While we try to nail down the root issue is there a way to avoid the segfault. We are getting it quite frequently

root@ct-kam01:/home/steve# (cat /var/log/syslog.2 /var/log/syslog.1 /var/log/syslog) | grep 'in drachtio'
Mar 24 10:52:02 ct-kam01 kernel: [393134.847581] drachtio[26053]: segfault at 10 ip 00000000004b4c4f sp 00007ffcd6021fb0 error 4 in drachtio[400000+354000]
Mar 24 10:55:34 ct-kam01 kernel: [393346.864212] drachtio[5328]: segfault at 2e6 ip 000000000053fd8a sp 00007fffc2aaa970 error 4 in drachtio[400000+354000]
Mar 24 12:06:10 ct-kam01 kernel: [397582.997556] traps: drachtio[5574] general protection ip:53fd8a sp:7ffd7ef19e30 error:0 in drachtio[400000+354000]
Mar 24 12:17:33 ct-kam01 kernel: [398265.654733] drachtio[10512]: segfault at 10 ip 00000000004b4c4f sp 00007ffc701e3cc0 error 4 in drachtio[400000+354000]
Mar 24 12:26:26 ct-kam01 kernel: [398798.906688] drachtio[11209]: segfault at 10 ip 00000000004b4c4f sp 00007fff9040fce0 error 4 in drachtio[400000+354000]
Mar 24 12:45:14 ct-kam01 kernel: [399926.535706] drachtio[11799]: segfault at 10 ip 00000000004b4c4f sp 00007fffd00b2690 error 4 in drachtio[400000+354000]
Mar 24 12:46:42 ct-kam01 kernel: [400015.001200] drachtio[13243]: segfault at 10 ip 00000000004b4c4f sp 00007ffc25f4c7c0 error 4 in drachtio[400000+354000]
Mar 24 15:13:30 ct-kam01 kernel: [408822.509335] drachtio[13343]: segfault at 6e00000081 ip 000000000053fd8a sp 00007ffebf2f04d0 error 4 in drachtio[400000+354000]
Mar 27 08:10:11 ct-kam01 kernel: [642621.606968] traps: drachtio[9588] general protection ip:53fd8a sp:7fffd5112630 error:0 in drachtio[400000+354000]
Mar 27 08:57:48 ct-kam01 kernel: [645478.683342] drachtio[26861]: segfault at 10 ip 00000000004b4c4f sp 00007ffcd05719b0 error 4 in drachtio[400000+354000]
Mar 27 09:42:25 ct-kam01 kernel: [648155.639012] drachtio[29374]: segfault at 10 ip 00000000004b4c4f sp 00007ffcf1b61fb0 error 4 in drachtio[400000+354000]
Mar 27 09:43:55 ct-kam01 kernel: [648245.666130] drachtio[31845]: segfault at a6 ip 000000000053fd8a sp 00007ffd74d30120 error 4 in drachtio[400000+354000]
Mar 27 09:45:53 ct-kam01 kernel: [648363.786066] traps: drachtio[31934] general protection ip:531cbe sp:7ffe50454e00 error:0 in drachtio[400000+354000]
Mar 27 09:49:35 ct-kam01 kernel: [648585.875335] drachtio[32043]: segfault at 10 ip 00000000004b4c4f sp 00007ffe4466cc30 error 4 in drachtio[400000+354000]
Mar 27 09:53:40 ct-kam01 kernel: [648831.046355] traps: drachtio[32238] general protection ip:531cbe sp:7ffc37310110 error:0 in drachtio[400000+354000]
Mar 27 09:55:12 ct-kam01 kernel: [648922.658110] traps: drachtio[32459] general protection ip:53fd8a sp:7ffdfda286d0 error:0 in drachtio[400000+354000]
Mar 27 10:48:47 ct-kam01 kernel: [652138.303311] drachtio[32543]: segfault at 10 ip 00000000004b4c4f sp 00007ffde1059380 error 4 in drachtio[400000+354000]
Mar 27 11:37:48 ct-kam01 kernel: [655078.799465] traps: drachtio[3191] general protection ip:531cbe sp:7ffcb7ef1e10 error:0 in drachtio[400000+354000]
Mar 27 11:40:38 ct-kam01 kernel: [655248.529328] drachtio[5911]: segfault at 10 ip 00000000004b4c4f sp 00007ffe190a0e00 error 4 in drachtio[400000+354000]
Mar 27 11:43:39 ct-kam01 kernel: [655430.380642] drachtio[6066]: segfault at 10 ip 00000000004b4c4f sp 00007fff5f7effe0 error 4 in drachtio[400000+354000]
Mar 27 13:34:24 ct-kam01 kernel: [662075.230445] drachtio[6254]: segfault at 10 ip 00000000004b4c4f sp 00007fff2ae91f50 error 4 in drachtio[400000+354000]
Mar 27 13:36:27 ct-kam01 kernel: [662197.643523] drachtio[12450]: segfault at 10 ip 00000000004b4c4f sp 00007ffdaafa71d0 error 4 in drachtio[400000+354000]
Mar 27 13:52:01 ct-kam01 kernel: [663132.054710] drachtio[12591]: segfault at 10 ip 00000000004b4c4f sp 00007fff58522750 error 4 in drachtio[400000+354000]
Mar 27 15:29:40 ct-kam01 kernel: [668990.692409] drachtio[13411]: segfault at 10 ip 00000000004b4c4f sp 00007ffdd6a06f10 error 4 in drachtio[400000+354000]
Mar 27 15:31:40 ct-kam01 kernel: [669110.932571] drachtio[18629]: segfault at 10 ip 00000000004b4c4f sp 00007ffd77cdd3a0 error 4 in drachtio[400000+354000]
Mar 27 16:15:01 ct-kam01 kernel: [671712.328839] drachtio[18740]: segfault at 10 ip 00000000004b4c4f sp 00007ffe2212cf20 error 4 in drachtio[400000+354000]
Mar 27 16:16:31 ct-kam01 kernel: [671801.760627] drachtio[21146]: segfault at 76 ip 000000000053fd8a sp 00007ffcb563d2e0 error 4 in drachtio[400000+354000]
Mar 27 16:40:12 ct-kam01 kernel: [673223.106774] drachtio[21226]: segfault at 10 ip 00000000004b4c4f sp 00007ffc75b298c0 error 4 in drachtio[400000+354000]
Mar 27 19:07:46 ct-kam01 kernel: [682077.132578] drachtio[22591]: segfault at 10 ip 00000000004b4c4f sp 00007ffc5dd03530 error 4 in drachtio[400000+354000]
Mar 27 22:43:36 ct-kam01 kernel: [695026.837867] drachtio[30548]: segfault at 10 ip 00000000004b4c4f sp 00007ffe467c1860 error 4 in drachtio[400000+354000]
Mar 28 11:08:07 ct-kam01 kernel: [739697.756504] drachtio[10135]: segfault at 10 ip 00000000004b4c4f sp 00007ffec11e08f0 error 4 in drachtio[400000+354000]

davehorton commented 4 years ago

can you send me another log (and stack trace if you have it) ? I spent half a day trying to recreate the crash, replicating the signaling I was seeing in your log and for some reason was not able to. I'd like to see one (or a couple) of other similar logs to make sure I am understanding the scenario

elbow commented 4 years ago

Hi,

Info from two crashes in your mailbox - one the segfault in SipDialogController::clearIIP, the other is something different.

You are already a contributor on our rtcproxy project on github.

Let me know if a login on the server running drachtio would be useful.

Thanks very much, Steve

davehorton commented 4 years ago

One thing I notice from your logs is that for some reason you are reusing the same SIp Call-ID on completely different calls. This is not a good idea at all, and I strongly suggest you stop doing it. Whatever is generating the SIP Call-IDs should generate a unique id for every call.

For instance, in the drachtio-crash-2-drachtio.log the SIP Call-ID which is used on the incoming INVITE that eventually leads to the crash:

Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060

is first used for an incoming invite at 2020-03-29 14:41:34 is next used on a completely differerent invite at 14:42:13

Why are you doing this? While it is not strictly against the RFC 3261 it is a tremendously bad idea, makes troubleshooting more difficult, and may possibly create issues for the sip server state machine.

Please track down where you are doing this and fix it.

elbow commented 4 years ago

Hi Dave,

There was only one call - 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 came from Telviva at 14:41:34.

Via the simringer we sent this to the phone as callid NtdXJxbHU~rp~ 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 - we use a different but related callid as you see.

At 14:42:13 the webrtc client phone send a reinvite as such:

INVITE sip:197.155.250.156:4433;transport=wss SIP/2.0 Via: SIP/2.0/WSS 3m86ph4f85m0.invalid;branch=z9hG4bK9607140;received=105.12.3.219;rport=15660 Max-Forwards: 69 To: sip:0665841170@197.155.250.130;tag=as6be1f05c From: "324" sip:smurqlu6@3m86ph4f85m0.invalid;tag=ljul9hf8g7 Call-ID: NtdXJxbHU~rp~4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 CSeq: 7711 INVITE Contact: sip:smurqlu6@3m86ph4f85m0.invalid;transport=ws Content-Type: application/sdp Session-Expires: 90;refresher=uac Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO Supported: timer,ice,replaces,outbound User-Agent: JsSIP 3.3.3 Content-Length: 2098

You can see its in-dialog since it has both tags.

It is surprising to see this but clearly the server side things this call is cancelled and the client side thinks it is answered. Obviously messages have been lost or are queued somehere.

This re-invite was sent to Telviva via b2bua as 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 which is as it should be since its an in-dialog request.

INVITE sip:0665841170@197.155.250.130:5060 SIP/2.0 Via: SIP/2.0/UDP 197.155.250.156:6060;rport;branch=z9hG4bKr2NNvymXFUmpa Route: <sip:197.155.248.129;lr=on> Max-Forwards: 70 From: sip:6598324@197.155.248.129:5060;tag=BBH27y3r1HgrH To: "Player" sip:0665841170@197.155.250.130;tag=as6be1f05c Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 CSeq: 103 INVITE Contact: Content-Type: application/sdp Content-Length: 765

Hope that helps.

Steve

On Mon, 30 Mar 2020 at 19:25, Dave Horton notifications@github.com wrote:

One thing I notice from your logs is that for some reason you are reusing the same SIp Call-ID on completely different calls. This is not a good idea at all, and I strongly suggest you stop doing it. Whatever is generating the SIP Call-IDs should generate a unique id for every call.

For instance, in the drachtio-crash-2-drachtio.log the SIP Call-ID which is used on the incoming INVITE that eventually leads to the crash:

Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060

is first used for an incoming invite at 2020-03-29 14:41:34 is next used on a completely differerent invite at 14:42:13

Why are you doing this? While it is not strictly against the RFC 3261 it is a tremendously bad idea, makes troubleshooting more difficult, and may possibly create issues for the sip server state machine.

Please track down where you are doing this and fix it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davehorton/drachtio-server/issues/120#issuecomment-606133716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2IVVRCLTOP5Y76VA6YSDRKDIZXANCNFSM4LR32SSA .

-- Steve Davies: Technical Director, Connection Telecom (Pty) Ltd Email is preferred, but: Phone: 0878200200

davehorton commented 4 years ago

do you have the ability to test a source build, or do you need docker image?

elbow commented 4 years ago

I'm running the develop branch that I built from source. Checked it out last week. So source is 100% - you can tell me what branch and I'm right there.

Thanks, Steve

On Mon, 30 Mar 2020 at 20:33, Dave Horton notifications@github.com wrote:

do you have the ability to test a source build, or do you need docker image?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davehorton/drachtio-server/issues/120#issuecomment-606168368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2IVU5VBIUXLBX7LEMAQDRKDQYJANCNFSM4LR32SSA .

-- Steve Davies: Technical Director, Connection Telecom (Pty) Ltd Email is preferred, but: Phone: 0878200200

davehorton commented 4 years ago

I have not been able to recreate the crash unfortunately. However it appears to me that the problem exists because the sip stack allows the app to send a 200 OK to an incoming INVITE that has already been canceled, so I have committed a fix for this on the develop branch.

Please build and test the latest develop branch with the commit above. Also, after running for a while, when this scenario happens you should see a new log message:

DrachtioController::setupLegForIncomingRequest - app provided a response but INVITE has already been canceled

After running for a while (hopefully without a crash) check the logs to see if you see any instances of that log message.

elbow commented 4 years ago

Hi Dave - thank you, I'm building it now.

elbow commented 4 years ago

Hi,

I'm going to close this with thanks for your help.

When I built drachtio-server unoptimised then the seg faults stopped.

However I am getting assert failures leading to aborts - I'm going to open a separate issue for that since I have done some investigation.

drachtio / drachtio-server

Drachtio server seg fault #120