Open elbow opened 4 years ago
this possibly looks to be a race condition where you get an incoming REGISTER before everything is initialized. Are you able to recreate the crash outside of that condition?
also, can you provide a log at debug level for both drachtio server and sofia
Traffic comes thick and fast - there are quite a few thousand clients. So if drachtio crashes at a busy time you can be sure that as soon as it reads incoming traffic it will get a packet.
It is curious that it doesn't happen with the local request handler - every time it gets into this cycle of crashes we can change the request handler back to the local one and it comes up no problem.
I'll provide the extra logging etc but bear with me, we need to do some roll-back since we can't complete the move to the new k8s cluster.
Thanks, Steve
For completeness, in another example it didn't crash immediately - responded to 55 registers and then crashed on the next
2020-04-20 05:59:16.885322 Starting drachtio version v0.8.4-2-gb9cf802
...startup logs...
2020-04-20 05:59:16.934375 Starting sofia event loop in main thread: 140495091922752
2020-04-20 05:59:16.934447 tport_type_udp.c:519 tport_udp_error() tport_udp_error: Connection refused (111) [icmp type=3 code=3]
2020-04-20 05:59:16.934488 tport_type_udp.c:524 tport_udp_error() reported by [127.0.0.1]:0
2020-04-20 05:59:16.934530 nta.c:2867 agent_tp_error() nta_agent: tport: 127.0.0.1:6060: Connection refused
2020-04-20 05:59:17.073057 recv 656 bytes from wss/[105.242.164.9]:49280 at 05:59:17.072802:
REGISTER sip:rtc.telviva.com SIP/2.0
Via: SIP/2.0/WSS totdd2i7688f.invalid;branch=z9hG4bK5669083
Max-Forwards: 69
To: <sip:7642307@rtc.telviva.com>
From: "2307" <sip:7642307@rtc.telviva.com>;tag=bb5k9uq447
Call-ID: h09vii4mg3p9761eujlsu5
CSeq: 1158 REGISTER
X-PushId-Platform: web
X-PushId: 0d386491-360b-4589-a228-a5af7261f7af
Contact: <sip:f1kr7m4k@totdd2i7688f.invalid;transport=ws>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:4fa5d347-ef52-4ef6-ad15-15a6dd94c044>";expires=600
Expires: 600
Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO
Supported: path,gruu,outbound
User-Agent: TelvivaOne web 1.0 000
Content-Length: 0
2020-04-20 05:59:17.074767 RequestHandler::startRequest: sending http POST: http://10.151.192.13:8080/sip4.fibrephone.telviva.com/?method=REGISTER&domain=rtc.telviva.com&protocol=tcp&source_address=105.242.164.9&fromUser=7642307&toUser=7642307&uriUser=&contentType=&uri=sip%3artc.telviva.com
2020-04-20 05:59:17.159073 http 200 response received from server in 0.0828 secs: {"action":"route","data":{"tag":"rtcproxy"}}
2020-04-20 05:59:17.159359 No connected clients found to handle incoming register request
2020-04-20 05:59:17.159464 ClientController::selectClientForTag - no clients registered for tag: rtcproxy
2020-04-20 05:59:17.221003 SipDialogController::addIncomingRequestTransaction - adding transactionId bb6fab1a-1148-40be-bc36-ba878d2c6ffd for irq:0x1a8a7d0
2020-04-20 05:59:17.222293 send 320 bytes to wss/[105.242.164.9]:49280 at 05:59:17.221953:
SIP/2.0 480 Temporarily Unavailable
Via: SIP/2.0/WSS totdd2i7688f.invalid;branch=z9hG4bK5669083;received=105.242.164.9;rport=49280
From: "2307" <sip:7642307@rtc.telviva.com>;tag=bb5k9uq447
To: <sip:7642307@rtc.telviva.com>;tag=4jaFXUp90008j
Call-ID: h09vii4mg3p9761eujlsu5
CSeq: 1158 REGISTER
Content-Length: 0
54 more registers arrive and are responded to
then:
2020-04-20 05:59:29.406623 recv 655 bytes from wss/[129.205.174.139]:56818 at 05:59:29.406433:
REGISTER sip:rtc.telviva.com SIP/2.0
Via: SIP/2.0/WSS 0puo2tutr873.invalid;branch=z9hG4bK1066410
Max-Forwards: 69
To: <sip:9240115@rtc.telviva.com>
From: "115" <sip:9240115@rtc.telviva.com>;tag=4jktn6br06
Call-ID: pglmhgrc6lp4n62a82dl7o
CSeq: 1086 REGISTER
X-PushId-Platform: web
X-PushId: d2f4b2e5-a4b6-4f33-bca0-1bebae00fb9b
Contact: <sip:ma2s3qcg@0puo2tutr873.invalid;transport=ws>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:215f487d-34d6-41a7-a53f-59abe2e36dea>";expires=600
Expires: 600
Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO
Supported: path,gruu,outbound
User-Agent: XxxOne web 1.0 000
Content-Length: 0
2020-04-20 05:59:29.406967 RequestHandler::startRequest: sending http POST: http://10.151.192.13:8080/sip4.fibrephone.xxx.com/?method=REGISTER&domain=rtc.xxx.com&protocol=tcp&source_address=129.205.174.139&fromUser
=9240115&toUser=9240115&uriUser=&contentType=&uri=sip%3artc.xxx.com
2020-04-20 05:59:29.447798 http 200 response received from server in 0.0395 secs: {"action":"route","data":{"tag":"rtcproxy"}}
<<<CRASH>>>
2020-04-20 05:59:29.845084 Starting drachtio version v0.8.4-2-gb9cf802
My colleague reports:
I applied the iptables rules but and waited 5 minutes. By that time rtcproxy had established a connection. As soon as I removed them I got crashes
So from that report it can't be a race during initialisation? - drachtio been running a long time and the client inbound connection all setup. Still crashes.
agree, I'll wait for the logs ...
any luck in gathering logs?
Hi Dave,
Because it is disruptive to the service we have to wait until our late evening to do the testing again.
That is in say 6 hours time.
@davehorton You won't be happy with this update. However:
If we run drachtio-server against our remote request-handler with normal logging we get that segfault very often:
Tue 2020-04-21 18:33:04 SAST 15145 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:05 SAST 15155 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:09 SAST 15166 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:33 SAST 15198 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:33 SAST 15184 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:35 SAST 15208 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:52 SAST 15244 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:52 SAST 15221 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:53 SAST 15254 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:21 SAST 15265 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:21 SAST 15286 0 1 11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:23 SAST 15297 0 1 11 * /usr/local/bin/drachtio
If we push the sofia logging up to 9 and switch drachtio logging to debug:
<!-- sofia (internal sip library) log level, from 0 (minimal) to 9 (verbose) -->
<sofia-loglevel>9</sofia-loglevel>
<!-- general process log level: notice, warning, error, info, debug. Default: info -->
<loglevel>debug</loglevel>
Then we don't get any crashes.
All I can guess is that the time taken to do the logging is enough to slow something down so that the race is avoided.
So we can't give you debug level logs of the crash since there is no crash if the logging is set to debug level.
We restarted Drachtio a bunch of times in case it was intermittent and cou;dn't get any crash to happen.
How do you want us to proceed?
Thanks, Steve
could you try with drachtio at debug and sofia at log level 3?
also, your logs are showing an error right after startup, before any calls:
2020-04-20 20:31:38.385746 Starting sofia event loop in main thread: 139745329432384
2020-04-20 20:31:38.385807 tport_type_udp.c:519 tport_udp_error() tport_udp_error: Connection refused (111) [icmp type=3 code=3]
2020-04-20 20:31:38.385841 tport_type_udp.c:524 tport_udp_error() reported by [127.0.0.1]:0
I'd like to get to the bottom of this as well. Can you just show me a log with debug level (drachtio and sofia) after startup? In that log it should not be necessary to receive any calls
This is from startup for the first few seconds. As it start a bunch of packets arrive and I cut the log after the first few replies go back.
My drachtio config binds to "*" like so, which I guess is where the loopback address come from:
<contacts>
<contact>sip:*:6060;transport=udp,tcp</contact>
<contact>sips:*:6061;transport=tls</contact>
<contact>sips:111.222.159:4433;transport=wss</contact>
</contacts>
Steve
We are using a request-handler with our drachtio.
We are trying our k8s and switched this to another instance that is about 25 milliseconds away.
In every other respect the service returns the same response - we tested with contructing requests with curl, eg:
When we switch over then drachtio segfaults on the first packet it tries to process (or maybe first register?)
and so on and so forth.
The coredump says that the crash is like so:
digging into the stack frames: