Closed elbow closed 4 years ago
can you post the server log and describe the scenario? This is a native build with the latest from develop branch?
It is built yesterday from develop (but the docker image was also doing it). I have a core dump (2.4GB). You want the whole log - its a lot - or the part where it crashed and restarted?
I think if you can zip and email me the whole log that would be best
I do not think this is directly related to the memory leak issue so I would proceed with that testing
OK - in your email, thanks
Thanks for the log. I see the issue. It does look like a bug in drachtio server
The specific call that caused the problem went like this:
So, definitely a bug in the server I will address. Strange that "B" sent a 200 OK to invite after we sent it a CANCEL.
Also strange the drachio-srf tried to send the 200 OK also, since it had received the cancel. This may be something I can look at fixing in drachtio-srf as well.
Hi Dave,
If you have a callid or similar I'll look up the details of the call flow. Looks like glare to me - OK to the INVITE and the CANCEL crossed on the wire.
Its obviously rare or we'd be seeing more crashes.
Steve
Call-ID:
Call-ID: 63036270242d87c1714ef1ed6e0d1048@197.155.248.133:5060
I will send you the entire trace of the call.
It is not a glare condition....the 200 OK comes many seconds after we sent the CANCEL. And then the 200 OK to the CANCEL quite a few secs after that. Strange
Haven't forgotten about this - wew are getting this from time to time but we've been scrambling and I didn't have a chance to gather all the logs. Please bear with me.
OK. I think I see what the problem is from the earlier log. When I have spare time I will try to look into a fix.
Thanks!
steve, quick question on this. I have been trying to recreate using Srf#createB@BUA but cannot recreeate the crash. In your code, what are you using to handle the B2BUA messaging? Are you doing something different than a simple createB2BUA?
The key thing I need to recreate is incorrectly sending a 200 OK to A after we have previously sent them a 487 final response, and it seems like the createB2BUA actually guards against doing that..
Hi Dave,
You can by all means look at our code, I'll add you on the repository.
The code uses a simringer for "inbound" calls, so is it possible that is involved?
While we try to nail down the root issue is there a way to avoid the segfault. We are getting it quite frequently
root@ct-kam01:/home/steve# (cat /var/log/syslog.2 /var/log/syslog.1 /var/log/syslog) | grep 'in drachtio'
Mar 24 10:52:02 ct-kam01 kernel: [393134.847581] drachtio[26053]: segfault at 10 ip 00000000004b4c4f sp 00007ffcd6021fb0 error 4 in drachtio[400000+354000]
Mar 24 10:55:34 ct-kam01 kernel: [393346.864212] drachtio[5328]: segfault at 2e6 ip 000000000053fd8a sp 00007fffc2aaa970 error 4 in drachtio[400000+354000]
Mar 24 12:06:10 ct-kam01 kernel: [397582.997556] traps: drachtio[5574] general protection ip:53fd8a sp:7ffd7ef19e30 error:0 in drachtio[400000+354000]
Mar 24 12:17:33 ct-kam01 kernel: [398265.654733] drachtio[10512]: segfault at 10 ip 00000000004b4c4f sp 00007ffc701e3cc0 error 4 in drachtio[400000+354000]
Mar 24 12:26:26 ct-kam01 kernel: [398798.906688] drachtio[11209]: segfault at 10 ip 00000000004b4c4f sp 00007fff9040fce0 error 4 in drachtio[400000+354000]
Mar 24 12:45:14 ct-kam01 kernel: [399926.535706] drachtio[11799]: segfault at 10 ip 00000000004b4c4f sp 00007fffd00b2690 error 4 in drachtio[400000+354000]
Mar 24 12:46:42 ct-kam01 kernel: [400015.001200] drachtio[13243]: segfault at 10 ip 00000000004b4c4f sp 00007ffc25f4c7c0 error 4 in drachtio[400000+354000]
Mar 24 15:13:30 ct-kam01 kernel: [408822.509335] drachtio[13343]: segfault at 6e00000081 ip 000000000053fd8a sp 00007ffebf2f04d0 error 4 in drachtio[400000+354000]
Mar 27 08:10:11 ct-kam01 kernel: [642621.606968] traps: drachtio[9588] general protection ip:53fd8a sp:7fffd5112630 error:0 in drachtio[400000+354000]
Mar 27 08:57:48 ct-kam01 kernel: [645478.683342] drachtio[26861]: segfault at 10 ip 00000000004b4c4f sp 00007ffcd05719b0 error 4 in drachtio[400000+354000]
Mar 27 09:42:25 ct-kam01 kernel: [648155.639012] drachtio[29374]: segfault at 10 ip 00000000004b4c4f sp 00007ffcf1b61fb0 error 4 in drachtio[400000+354000]
Mar 27 09:43:55 ct-kam01 kernel: [648245.666130] drachtio[31845]: segfault at a6 ip 000000000053fd8a sp 00007ffd74d30120 error 4 in drachtio[400000+354000]
Mar 27 09:45:53 ct-kam01 kernel: [648363.786066] traps: drachtio[31934] general protection ip:531cbe sp:7ffe50454e00 error:0 in drachtio[400000+354000]
Mar 27 09:49:35 ct-kam01 kernel: [648585.875335] drachtio[32043]: segfault at 10 ip 00000000004b4c4f sp 00007ffe4466cc30 error 4 in drachtio[400000+354000]
Mar 27 09:53:40 ct-kam01 kernel: [648831.046355] traps: drachtio[32238] general protection ip:531cbe sp:7ffc37310110 error:0 in drachtio[400000+354000]
Mar 27 09:55:12 ct-kam01 kernel: [648922.658110] traps: drachtio[32459] general protection ip:53fd8a sp:7ffdfda286d0 error:0 in drachtio[400000+354000]
Mar 27 10:48:47 ct-kam01 kernel: [652138.303311] drachtio[32543]: segfault at 10 ip 00000000004b4c4f sp 00007ffde1059380 error 4 in drachtio[400000+354000]
Mar 27 11:37:48 ct-kam01 kernel: [655078.799465] traps: drachtio[3191] general protection ip:531cbe sp:7ffcb7ef1e10 error:0 in drachtio[400000+354000]
Mar 27 11:40:38 ct-kam01 kernel: [655248.529328] drachtio[5911]: segfault at 10 ip 00000000004b4c4f sp 00007ffe190a0e00 error 4 in drachtio[400000+354000]
Mar 27 11:43:39 ct-kam01 kernel: [655430.380642] drachtio[6066]: segfault at 10 ip 00000000004b4c4f sp 00007fff5f7effe0 error 4 in drachtio[400000+354000]
Mar 27 13:34:24 ct-kam01 kernel: [662075.230445] drachtio[6254]: segfault at 10 ip 00000000004b4c4f sp 00007fff2ae91f50 error 4 in drachtio[400000+354000]
Mar 27 13:36:27 ct-kam01 kernel: [662197.643523] drachtio[12450]: segfault at 10 ip 00000000004b4c4f sp 00007ffdaafa71d0 error 4 in drachtio[400000+354000]
Mar 27 13:52:01 ct-kam01 kernel: [663132.054710] drachtio[12591]: segfault at 10 ip 00000000004b4c4f sp 00007fff58522750 error 4 in drachtio[400000+354000]
Mar 27 15:29:40 ct-kam01 kernel: [668990.692409] drachtio[13411]: segfault at 10 ip 00000000004b4c4f sp 00007ffdd6a06f10 error 4 in drachtio[400000+354000]
Mar 27 15:31:40 ct-kam01 kernel: [669110.932571] drachtio[18629]: segfault at 10 ip 00000000004b4c4f sp 00007ffd77cdd3a0 error 4 in drachtio[400000+354000]
Mar 27 16:15:01 ct-kam01 kernel: [671712.328839] drachtio[18740]: segfault at 10 ip 00000000004b4c4f sp 00007ffe2212cf20 error 4 in drachtio[400000+354000]
Mar 27 16:16:31 ct-kam01 kernel: [671801.760627] drachtio[21146]: segfault at 76 ip 000000000053fd8a sp 00007ffcb563d2e0 error 4 in drachtio[400000+354000]
Mar 27 16:40:12 ct-kam01 kernel: [673223.106774] drachtio[21226]: segfault at 10 ip 00000000004b4c4f sp 00007ffc75b298c0 error 4 in drachtio[400000+354000]
Mar 27 19:07:46 ct-kam01 kernel: [682077.132578] drachtio[22591]: segfault at 10 ip 00000000004b4c4f sp 00007ffc5dd03530 error 4 in drachtio[400000+354000]
Mar 27 22:43:36 ct-kam01 kernel: [695026.837867] drachtio[30548]: segfault at 10 ip 00000000004b4c4f sp 00007ffe467c1860 error 4 in drachtio[400000+354000]
Mar 28 11:08:07 ct-kam01 kernel: [739697.756504] drachtio[10135]: segfault at 10 ip 00000000004b4c4f sp 00007ffec11e08f0 error 4 in drachtio[400000+354000]
can you send me another log (and stack trace if you have it) ? I spent half a day trying to recreate the crash, replicating the signaling I was seeing in your log and for some reason was not able to. I'd like to see one (or a couple) of other similar logs to make sure I am understanding the scenario
Hi,
Info from two crashes in your mailbox - one the segfault in SipDialogController::clearIIP, the other is something different.
You are already a contributor on our rtcproxy project on github.
Let me know if a login on the server running drachtio would be useful.
Thanks very much, Steve
One thing I notice from your logs is that for some reason you are reusing the same SIp Call-ID on completely different calls. This is not a good idea at all, and I strongly suggest you stop doing it. Whatever is generating the SIP Call-IDs should generate a unique id for every call.
For instance, in the drachtio-crash-2-drachtio.log the SIP Call-ID which is used on the incoming INVITE that eventually leads to the crash:
Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060
is first used for an incoming invite at 2020-03-29 14:41:34 is next used on a completely differerent invite at 14:42:13
Why are you doing this? While it is not strictly against the RFC 3261 it is a tremendously bad idea, makes troubleshooting more difficult, and may possibly create issues for the sip server state machine.
Please track down where you are doing this and fix it.
Hi Dave,
There was only one call - 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 came from Telviva at 14:41:34.
Via the simringer we sent this to the phone as callid NtdXJxbHU~rp~ 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 - we use a different but related callid as you see.
At 14:42:13 the webrtc client phone send a reinvite as such:
INVITE sip:197.155.250.156:4433;transport=wss SIP/2.0 Via: SIP/2.0/WSS 3m86ph4f85m0.invalid;branch=z9hG4bK9607140;received=105.12.3.219;rport=15660 Max-Forwards: 69 To: sip:0665841170@197.155.250.130;tag=as6be1f05c From: "324" sip:smurqlu6@3m86ph4f85m0.invalid;tag=ljul9hf8g7 Call-ID: NtdXJxbHU~rp~4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 CSeq: 7711 INVITE Contact: sip:smurqlu6@3m86ph4f85m0.invalid;transport=ws Content-Type: application/sdp Session-Expires: 90;refresher=uac Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO Supported: timer,ice,replaces,outbound User-Agent: JsSIP 3.3.3 Content-Length: 2098
You can see its in-dialog since it has both tags.
It is surprising to see this but clearly the server side things this call is cancelled and the client side thinks it is answered. Obviously messages have been lost or are queued somehere.
This re-invite was sent to Telviva via b2bua as 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060 which is as it should be since its an in-dialog request.
INVITE sip:0665841170@197.155.250.130:5060 SIP/2.0
Via: SIP/2.0/UDP 197.155.250.156:6060;rport;branch=z9hG4bKr2NNvymXFUmpa
Route: <sip:197.155.248.129;lr=on>
Max-Forwards: 70
From: sip:6598324@197.155.248.129:5060;tag=BBH27y3r1HgrH
To: "Player" sip:0665841170@197.155.250.130;tag=as6be1f05c
Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060
CSeq: 103 INVITE
Contact:
Hope that helps.
Steve
On Mon, 30 Mar 2020 at 19:25, Dave Horton notifications@github.com wrote:
One thing I notice from your logs is that for some reason you are reusing the same SIp Call-ID on completely different calls. This is not a good idea at all, and I strongly suggest you stop doing it. Whatever is generating the SIP Call-IDs should generate a unique id for every call.
For instance, in the drachtio-crash-2-drachtio.log the SIP Call-ID which is used on the incoming INVITE that eventually leads to the crash:
Call-ID: 4ec011541ff92c7b1a585c8124d508c9@197.155.250.130:5060
is first used for an incoming invite at 2020-03-29 14:41:34 is next used on a completely differerent invite at 14:42:13
Why are you doing this? While it is not strictly against the RFC 3261 it is a tremendously bad idea, makes troubleshooting more difficult, and may possibly create issues for the sip server state machine.
Please track down where you are doing this and fix it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davehorton/drachtio-server/issues/120#issuecomment-606133716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2IVVRCLTOP5Y76VA6YSDRKDIZXANCNFSM4LR32SSA .
-- Steve Davies: Technical Director, Connection Telecom (Pty) Ltd Email is preferred, but: Phone: 0878200200
do you have the ability to test a source build, or do you need docker image?
I'm running the develop branch that I built from source. Checked it out last week. So source is 100% - you can tell me what branch and I'm right there.
Thanks, Steve
On Mon, 30 Mar 2020 at 20:33, Dave Horton notifications@github.com wrote:
do you have the ability to test a source build, or do you need docker image?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davehorton/drachtio-server/issues/120#issuecomment-606168368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2IVU5VBIUXLBX7LEMAQDRKDQYJANCNFSM4LR32SSA .
-- Steve Davies: Technical Director, Connection Telecom (Pty) Ltd Email is preferred, but: Phone: 0878200200
I have not been able to recreate the crash unfortunately. However it appears to me that the problem exists because the sip stack allows the app to send a 200 OK to an incoming INVITE that has already been canceled, so I have committed a fix for this on the develop branch.
Please build and test the latest develop branch with the commit above. Also, after running for a while, when this scenario happens you should see a new log message:
DrachtioController::setupLegForIncomingRequest - app provided a response but INVITE has already been canceled
After running for a while (hopefully without a crash) check the logs to see if you see any instances of that log message.
Hi Dave - thank you, I'm building it now.
Hi,
I'm going to close this with thanks for your help.
When I built drachtio-server unoptimised then the seg faults stopped.
However I am getting assert failures leading to aborts - I'm going to open a separate issue for that since I have done some investigation.
Hi,
Maybe related to the leak, but getting a seg fault in drachtio: