Non-stop segfaults when we switch to a non-local

elbow commented 4 years ago

We are using a request-handler with our drachtio.

    <request-handlers>
        <request-handler sip-method="*" http-method="POST">http://10.95.16.234:8080/sip4.xxx.com</request-handler>
    </request-handlers>

We are trying our k8s and switched this to another instance that is about 25 milliseconds away.

In every other respect the service returns the same response - we tested with contructing requests with curl, eg:

curl -X POST -H 'Content-Type: text/plain' -d "`cat /tmp/packet`"  'http://10.151.192.13:8080/sip4.xxx.com/?method=REGISTER&domain=rtc.xxx.com&protocol=tcp&source_address=10.0.98.192&fromUser=12345&toUser=12345&uriUser=&contentType=&uri=sip:rtc.xxx.com

When we switch over then drachtio segfaults on the first packet it tries to process (or maybe first register?)

2020-04-20 20:31:38.340011 Starting drachtio version v0.8.4-9-g608a588
2020-04-20 20:31:38.340176 Logging threshold:                     4
2020-04-20 20:31:38.340252 Route for outbound connection:         sip-method: *, http-method: POST, http-url: http://10.151.192.13:8080/sip4.fibrephone.telviva.com
2020-04-20 20:31:38.340297 DrachtioController::run tls key file:         /etc/letsencrypt/live/sip.fibrephone.telviva.com/privkey.pem
2020-04-20 20:31:38.340342 DrachtioController::run tls certificate file: /etc/letsencrypt/live/sip.fibrephone.telviva.com/cert.pem
2020-04-20 20:31:38.340374 DrachtioController::run tls chain file:       /etc/letsencrypt/live/sip.fibrephone.telviva.com/fullchain.pem
2020-04-20 20:31:38.340404 DrachtioController::run listening for applications on tcp port 9022 and tls port 0
2020-04-20 20:31:38.341065 DrachtioController::run mtu size for udp packets: 4096
2020-04-20 20:31:38.341130 DrachtioController::run - sipcapture/Homer enabled: udp:10.155.250.80:9060;hep=3;capture_id=104
2020-04-20 20:31:38.341163 Prometheus support disabled
2020-04-20 20:31:38.341194 tcp keep alives will be sent to clients every 45 seconds
2020-04-20 20:31:38.341234 ClientController::threadFunc - ClientController: io_context run loop started (or restarted)
2020-04-20 20:31:38.341334 DrachtioController::run: starting sip stack on sip:*:6060;transport=udp,tcp
2020-04-20 20:31:38.344682 SipTransport::addTransports - creating transport: 0x16a7420: udp/197.155.250.159:6060
2020-04-20 20:31:38.346675 SipTransport::addTransports - creating transport: 0x16a78d0: tcp/197.155.250.159:6060
2020-04-20 20:31:38.348680 SipTransport::addTransports - creating transport: 0x16a7db0: udp/10.155.250.159:6060
2020-04-20 20:31:38.350634 SipTransport::addTransports - creating transport: 0x16a8290: tcp/10.155.250.159:6060
2020-04-20 20:31:38.352587 SipTransport::addTransports - creating transport: 0x16b6720: udp/172.17.0.1:6060
2020-04-20 20:31:38.354563 SipTransport::addTransports - creating transport: 0x16b6bd0: tcp/172.17.0.1:6060
2020-04-20 20:31:38.356511 SipTransport::addTransports - creating transport: 0x16b7080: udp/127.0.0.1:6060
2020-04-20 20:31:38.358502 SipTransport::addTransports - creating transport: 0x16b7530: tcp/127.0.0.1:6060
2020-04-20 20:31:38.360498 SipTransport::addTransports - creating transport: 0x16b79e0: udp/[::1]:6060
2020-04-20 20:31:38.362499 SipTransport::addTransports - creating transport: 0x16b7e90: tcp/[::1]:6060
2020-04-20 20:31:38.364522 DrachtioController::run: adding additional sip address sips:*:6061;transport=tls
2020-04-20 20:31:38.365539 tport_tls.c:439 tls_init_context() tls: initialized ECDH
2020-04-20 20:31:38.366256 tport_tls.c:439 tls_init_context() tls: initialized ECDH
2020-04-20 20:31:38.366917 tport_tls.c:439 tls_init_context() tls: initialized ECDH
2020-04-20 20:31:38.367570 tport_tls.c:439 tls_init_context() tls: initialized ECDH
2020-04-20 20:31:38.368218 tport_tls.c:439 tls_init_context() tls: initialized ECDH
2020-04-20 20:31:38.370549 SipTransport::addTransports - creating transport: 0x16bd0a0: tls/197.155.250.159:6061
2020-04-20 20:31:38.372523 SipTransport::addTransports - creating transport: 0x16c0570: tls/10.155.250.159:6061
2020-04-20 20:31:38.374478 SipTransport::addTransports - creating transport: 0x16c8790: tls/172.17.0.1:6061
2020-04-20 20:31:38.376463 SipTransport::addTransports - creating transport: 0x16d0910: tls/127.0.0.1:6061
2020-04-20 20:31:38.378457 SipTransport::addTransports - creating transport: 0x16d8670: tls/[::1]:6061
2020-04-20 20:31:38.380453 DrachtioController::run: adding additional sip address sips:197.155.250.159:4433;transport=wss
2020-04-20 20:31:38.383072 SipTransport::addTransports - creating transport: 0x16e5d80: wss/197.155.250.159:4433
2020-04-20 20:31:38.385074 SipTransport::logTransports - there are : 16 transports
2020-04-20 20:31:38.385128 SipTransport::logTransports - wss/197.155.250.159:4433 (sips:197.155.250.159:4433;transport=wss, external-ip: , local-net: )
2020-04-20 20:31:38.385163 SipTransport::logTransports - tls/127.0.0.1:6061 (sips:*:6061;transport=tls, external-ip: , local-net: 127.0.0.1/32)
2020-04-20 20:31:38.385194 SipTransport::logTransports - tls/172.17.0.1:6061 (sips:*:6061;transport=tls, external-ip: , local-net: )
2020-04-20 20:31:38.385226 SipTransport::logTransports - tls/10.155.250.159:6061 (sips:*:6061;transport=tls, external-ip: , local-net: 10.0.0.0/8)
2020-04-20 20:31:38.385257 SipTransport::logTransports - tls/197.155.250.159:6061 (sips:*:6061;transport=tls, external-ip: , local-net: )
2020-04-20 20:31:38.385288 SipTransport::logTransports - tls/[::1]:6061 (sips:*:6061;transport=tls, external-ip: , local-net: )
2020-04-20 20:31:38.385322 SipTransport::logTransports - udp/197.155.250.159:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: ), mtu size: 4096
2020-04-20 20:31:38.385356 SipTransport::logTransports - udp/10.155.250.159:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: 10.0.0.0/8), mtu size: 4096
2020-04-20 20:31:38.385388 SipTransport::logTransports - tcp/127.0.0.1:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: 127.0.0.1/32)
2020-04-20 20:31:38.385419 SipTransport::logTransports - tcp/197.155.250.159:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: )
2020-04-20 20:31:38.385450 SipTransport::logTransports - tcp/10.155.250.159:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: 10.0.0.0/8)
2020-04-20 20:31:38.385482 SipTransport::logTransports - udp/172.17.0.1:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: ), mtu size: 4096
2020-04-20 20:31:38.385514 SipTransport::logTransports - tcp/172.17.0.1:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: )
2020-04-20 20:31:38.385546 SipTransport::logTransports - udp/127.0.0.1:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: 127.0.0.1/32), mtu size: 4096
2020-04-20 20:31:38.385580 SipTransport::logTransports - udp/[::1]:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: ), mtu size: 4096
2020-04-20 20:31:38.385611 SipTransport::logTransports - tcp/[::1]:6060 (sip:*:6060;transport=udp,tcp, external-ip: , local-net: )
2020-04-20 20:31:38.385746 Starting sofia event loop in main thread: 139745329432384
2020-04-20 20:31:38.385807 tport_type_udp.c:519 tport_udp_error() tport_udp_error: Connection refused (111) [icmp type=3 code=3]
2020-04-20 20:31:38.385841 tport_type_udp.c:524 tport_udp_error()       reported by [127.0.0.1]:0
2020-04-20 20:31:38.385876 nta.c:2878 agent_tp_error() nta_agent: tport: 127.0.0.1:6060: Connection refused
2020-04-20 20:31:38.588883 recv 654 bytes from wss/[105.233.226.138]:49220 at 20:31:38.588281:
REGISTER sip:rtc.telviva.com SIP/2.0
Via: SIP/2.0/WSS 71eppljkhela.invalid;branch=z9hG4bK1157744
Max-Forwards: 69
To: <sip:6598323@rtc.telviva.com>
From: "323" <sip:6598323@rtc.telviva.com>;tag=d9pt58vst6
Call-ID: iodlpr6jmo75bseg11v5nq
CSeq: 229 REGISTER
X-PushId-Platform: web
X-PushId: ca879a05-899f-4dd7-9770-89d34e5b3bb9
Contact: <sip:a1effamk@71eppljkhela.invalid;transport=ws>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:c73aac43-e6e5-4c23-b661-9db463c11628>";expires=600
Expires: 600
Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO
Supported: path,gruu,outbound
User-Agent: TelvivaOne web 1.0 000
Content-Length: 0

2020-04-20 20:31:38.591674 RequestHandler::startRequest: sending http POST: http://10.151.192.13:8080/sip4.fibrephone.telviva.com/?method=REGISTER&domain=rtc.telviva.com&protocol=tcp&source_address=105.233.226.138&fromUser=6598323&toUser=6598323&uriUser=&contentType=&uri=sip%3artc.telviva.com
2020-04-20 20:31:38.654289 http 200 response received from server in 0.0611 secs: {"action":"route","data":{"tag":"rtcproxy"}}
2020-04-20 20:31:45.334359 Starting drachtio version v0.8.4-9-g608a588

and so on and so forth.

The coredump says that the crash is like so:

(gdb) bt
#0  0x000000000059fe1d in drachtio::SipTransport::isInNetwork (this=0xc0, address=0x7fffb31e3330 "sip.telviva.com") at ../src/sip-transports.cpp:123
#1  0x00000000005a2660 in drachtio::SipTransport::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)>::operator()(const std::shared_ptr<drachtio::SipTransport> &, const std::shared_ptr<drachtio::SipTransport> &) const (__closure=0x7fffb31e3320, pA=std::shared_ptr (count 2, weak 1) 0x16bc110, pB=<error reading variable: Cannot access memory at address 0x128>)
    at ../src/sip-transports.cpp:466
#2  0x00000000005a6246 in __gnu_cxx::__ops::_Val_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >::operator()<std::shared_ptr<drachtio::SipTransport>, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > > >(std::shared_ptr<drachtio::SipTransport> &, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >) (
    this=0x7fffb31e3320, __val=std::shared_ptr (count 2, weak 1) 0x16bc110, __it=<error reading variable: Cannot access memory at address 0x128>) at /usr/include/c++/5/bits/predefined_ops.h:169
#3  0x00000000005a5e12 in std::__unguarded_linear_insert<__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > >, __gnu_cxx::__ops::_Val_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> > >(__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__ops::_Val_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >) (__last=std::shared_ptr (empty) 0x0, __comp=...)
    at /usr/include/c++/5/bits/stl_algo.h:1823
#4  0x00000000005a576a in std::__insertion_sort<__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> > >(__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >) (__first=std::shared_ptr (empty) 0x0, __last=std::shared_ptr (count 1, weak 1) 0x16bb360, __comp=...)
    at /usr/include/c++/5/bits/stl_algo.h:1850
#5  0x00000000005a4ff2 in std::__final_insertion_sort<__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> > >(__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >) (__first=std::shared_ptr (empty) 0x0, __last=std::shared_ptr (count 1, weak 1) 0x16bb360, __comp=...)
    at /usr/include/c++/5/bits/stl_algo.h:1885
#6  0x00000000005a4b5b in std::__sort<__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> > >(__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__ops::_Iter_comp_iter<drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >) (__first=std::shared_ptr (empty) 0x0, __last=std::shared_ptr (count 1, weak 1) 0x16bb360, __comp=...)
    at /usr/include/c++/5/bits/stl_algo.h:1966
#7  0x00000000005a463a in std::sort<__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport> > >, drachtio::SipTransport::findAppropriateTransport(char const*, char const*)::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)> >(__gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, __gnu_cxx::__normal_iterator<std::shared_ptr<drachtio::SipTransport>*, std::vector<std::shared_ptr<drachtio::SipTransport>, std::allocator<std::shared_ptr<drachtio::SipTransport> > > >, drachtio::SipTransport::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)>) (__first=std::shared_ptr (empty) 0x0,
    __last=std::shared_ptr (count 1, weak 1) 0x16bb360, __comp=...) at /usr/include/c++/5/bits/stl_algo.h:4729
#8  0x00000000005a3012 in drachtio::SipTransport::findAppropriateTransport (remoteHost=0x179fa40 "sip:sip.telviva.com", proto=0x741266 "udp") at ../src/sip-transports.cpp:482
#9  0x0000000000538ebe in drachtio::SipDialogController::doSendRequestOutsideDialog (this=0x16e6100, pData=0x7f18f0006ea8) at ../src/sip-dialog-controller.cpp:456
#10 0x0000000000533537 in (anonymous namespace)::cloneSendSipRequest (p=0x16a57e0, msg=0x7fffb31e40e0, arg=0x7f18f0006ea8) at ../src/sip-dialog-controller.cpp:51
#11 0x0000000000675aef in su_base_port_execute_msgs ()
#12 0x0000000000675828 in su_base_port_getmsgs ()
#13 0x0000000000675c07 in su_base_port_run ()
#14 0x0000000000626758 in su_port_run ()
#15 0x00000000006278a6 in su_root_run ()
#16 0x0000000000425417 in drachtio::DrachtioController::run (this=0x16a57e0) at ../src/controller.cpp:1160
#17 0x0000000000413fc9 in main (argc=4, argv=0x7fffb31e4c98) at ../src/main.cpp:47

digging into the stack frames:

(gdb) frame 0
#0  0x000000000059fe1d in drachtio::SipTransport::isInNetwork (this=0xc0, address=0x7fffb31e3330 "sip.telviva.com") at ../src/sip-transports.cpp:123
123     if (!m_netmask) return false;
(gdb) p m_netmask
Cannot access memory at address 0x150
(gdb) p this
$1 = (const drachtio::SipTransport * const) 0xc0
(gdb) p *this
Cannot access memory at address 0xc0
(gdb) frame 1
#1  0x00000000005a2660 in drachtio::SipTransport::<lambda(const std::shared_ptr<drachtio::SipTransport>&, const std::shared_ptr<drachtio::SipTransport>&)>::operator()(const std::shared_ptr<drachtio::SipTransport> &, const std::shared_ptr<drachtio::SipTransport> &) const (__closure=0x7fffb31e3320, pA=std::shared_ptr (count 2, weak 1) 0x16bc110, pB=<error reading variable: Cannot access memory at address 0x128>)
    at ../src/sip-transports.cpp:466
466       if (pB->isInNetwork(host.c_str())) {
(gdb) p pB
$2 = <error reading variable: Cannot access memory at address 0x128>
(gdb) p *pB
Could not find operator*.
(gdb) p host
$3 = "sip.telviva.com"

davehorton commented 4 years ago

this possibly looks to be a race condition where you get an incoming REGISTER before everything is initialized. Are you able to recreate the crash outside of that condition?

davehorton commented 4 years ago

also, can you provide a log at debug level for both drachtio server and sofia

elbow commented 4 years ago

Traffic comes thick and fast - there are quite a few thousand clients. So if drachtio crashes at a busy time you can be sure that as soon as it reads incoming traffic it will get a packet.

It is curious that it doesn't happen with the local request handler - every time it gets into this cycle of crashes we can change the request handler back to the local one and it comes up no problem.

I'll provide the extra logging etc but bear with me, we need to do some roll-back since we can't complete the move to the new k8s cluster.

Thanks, Steve

elbow commented 4 years ago

For completeness, in another example it didn't crash immediately - responded to 55 registers and then crashed on the next

2020-04-20 05:59:16.885322 Starting drachtio version v0.8.4-2-gb9cf802
...startup logs...
2020-04-20 05:59:16.934375 Starting sofia event loop in main thread: 140495091922752
2020-04-20 05:59:16.934447 tport_type_udp.c:519 tport_udp_error() tport_udp_error: Connection refused (111) [icmp type=3 code=3]
2020-04-20 05:59:16.934488 tport_type_udp.c:524 tport_udp_error()       reported by [127.0.0.1]:0
2020-04-20 05:59:16.934530 nta.c:2867 agent_tp_error() nta_agent: tport: 127.0.0.1:6060: Connection refused
2020-04-20 05:59:17.073057 recv 656 bytes from wss/[105.242.164.9]:49280 at 05:59:17.072802:
REGISTER sip:rtc.telviva.com SIP/2.0
Via: SIP/2.0/WSS totdd2i7688f.invalid;branch=z9hG4bK5669083
Max-Forwards: 69
To: <sip:7642307@rtc.telviva.com>
From: "2307" <sip:7642307@rtc.telviva.com>;tag=bb5k9uq447
Call-ID: h09vii4mg3p9761eujlsu5
CSeq: 1158 REGISTER
X-PushId-Platform: web
X-PushId: 0d386491-360b-4589-a228-a5af7261f7af
Contact: <sip:f1kr7m4k@totdd2i7688f.invalid;transport=ws>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:4fa5d347-ef52-4ef6-ad15-15a6dd94c044>";expires=600
Expires: 600
Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO
Supported: path,gruu,outbound
User-Agent: TelvivaOne web 1.0 000
Content-Length: 0

2020-04-20 05:59:17.074767 RequestHandler::startRequest: sending http POST: http://10.151.192.13:8080/sip4.fibrephone.telviva.com/?method=REGISTER&domain=rtc.telviva.com&protocol=tcp&source_address=105.242.164.9&fromUser=7642307&toUser=7642307&uriUser=&contentType=&uri=sip%3artc.telviva.com
2020-04-20 05:59:17.159073 http 200 response received from server in 0.0828 secs: {"action":"route","data":{"tag":"rtcproxy"}}
2020-04-20 05:59:17.159359 No connected clients found to handle incoming register request
2020-04-20 05:59:17.159464 ClientController::selectClientForTag - no clients registered for tag: rtcproxy
2020-04-20 05:59:17.221003 SipDialogController::addIncomingRequestTransaction - adding transactionId bb6fab1a-1148-40be-bc36-ba878d2c6ffd for irq:0x1a8a7d0
2020-04-20 05:59:17.222293 send 320 bytes to wss/[105.242.164.9]:49280 at 05:59:17.221953:
SIP/2.0 480 Temporarily Unavailable
Via: SIP/2.0/WSS totdd2i7688f.invalid;branch=z9hG4bK5669083;received=105.242.164.9;rport=49280
From: "2307" <sip:7642307@rtc.telviva.com>;tag=bb5k9uq447
To: <sip:7642307@rtc.telviva.com>;tag=4jaFXUp90008j
Call-ID: h09vii4mg3p9761eujlsu5
CSeq: 1158 REGISTER
Content-Length: 0

54 more registers arrive and are responded to

then:

2020-04-20 05:59:29.406623 recv 655 bytes from wss/[129.205.174.139]:56818 at 05:59:29.406433:
REGISTER sip:rtc.telviva.com SIP/2.0
Via: SIP/2.0/WSS 0puo2tutr873.invalid;branch=z9hG4bK1066410
Max-Forwards: 69
To: <sip:9240115@rtc.telviva.com>
From: "115" <sip:9240115@rtc.telviva.com>;tag=4jktn6br06
Call-ID: pglmhgrc6lp4n62a82dl7o
CSeq: 1086 REGISTER
X-PushId-Platform: web
X-PushId: d2f4b2e5-a4b6-4f33-bca0-1bebae00fb9b
Contact: <sip:ma2s3qcg@0puo2tutr873.invalid;transport=ws>;+sip.ice;reg-id=1;+sip.instance="<urn:uuid:215f487d-34d6-41a7-a53f-59abe2e36dea>";expires=600
Expires: 600
Allow: INVITE,ACK,CANCEL,BYE,UPDATE,MESSAGE,OPTIONS,REFER,INFO
Supported: path,gruu,outbound
User-Agent: XxxOne web 1.0 000
Content-Length: 0

2020-04-20 05:59:29.406967 RequestHandler::startRequest: sending http POST: http://10.151.192.13:8080/sip4.fibrephone.xxx.com/?method=REGISTER&domain=rtc.xxx.com&protocol=tcp&source_address=129.205.174.139&fromUser
=9240115&toUser=9240115&uriUser=&contentType=&uri=sip%3artc.xxx.com
2020-04-20 05:59:29.447798 http 200 response received from server in 0.0395 secs: {"action":"route","data":{"tag":"rtcproxy"}}
<<<CRASH>>>
2020-04-20 05:59:29.845084 Starting drachtio version v0.8.4-2-gb9cf802

elbow commented 4 years ago

My colleague reports:

I applied the iptables rules but and waited 5 minutes. By that time rtcproxy had established a connection. As soon as I removed them I got crashes

So from that report it can't be a race during initialisation? - drachtio been running a long time and the client inbound connection all setup. Still crashes.

davehorton commented 4 years ago

agree, I'll wait for the logs ...

davehorton commented 4 years ago

any luck in gathering logs?

elbow commented 4 years ago

Hi Dave,

Because it is disruptive to the service we have to wait until our late evening to do the testing again.

That is in say 6 hours time.

elbow commented 4 years ago

@davehorton You won't be happy with this update. However:

If we run drachtio-server against our remote request-handler with normal logging we get that segfault very often:

Tue 2020-04-21 18:33:04 SAST  15145     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:05 SAST  15155     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:09 SAST  15166     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:33 SAST  15198     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:33 SAST  15184     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:35 SAST  15208     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:52 SAST  15244     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:52 SAST  15221     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:33:53 SAST  15254     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:21 SAST  15265     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:21 SAST  15286     0     1  11 * /usr/local/bin/drachtio
Tue 2020-04-21 18:34:23 SAST  15297     0     1  11 * /usr/local/bin/drachtio

If we push the sofia logging up to 9 and switch drachtio logging to debug:

        <!-- sofia (internal sip library) log level, from 0 (minimal) to 9 (verbose) -->
        <sofia-loglevel>9</sofia-loglevel>

        <!-- general process log level: notice, warning, error, info, debug.  Default: info -->
        <loglevel>debug</loglevel>

Then we don't get any crashes.

All I can guess is that the time taken to do the logging is enough to slow something down so that the race is avoided.

So we can't give you debug level logs of the crash since there is no crash if the logging is set to debug level.

We restarted Drachtio a bunch of times in case it was intermittent and cou;dn't get any crash to happen.

How do you want us to proceed?

Thanks, Steve

davehorton commented 4 years ago

could you try with drachtio at debug and sofia at log level 3?

davehorton commented 4 years ago

also, your logs are showing an error right after startup, before any calls:

2020-04-20 20:31:38.385746 Starting sofia event loop in main thread: 139745329432384
2020-04-20 20:31:38.385807 tport_type_udp.c:519 tport_udp_error() tport_udp_error: Connection refused (111) [icmp type=3 code=3]
2020-04-20 20:31:38.385841 tport_type_udp.c:524 tport_udp_error()       reported by [127.0.0.1]:0

I'd like to get to the bottom of this as well. Can you just show me a log with debug level (drachtio and sofia) after startup? In that log it should not be necessary to receive any calls

elbow commented 4 years ago

drachtio.log

This is from startup for the first few seconds. As it start a bunch of packets arrive and I cut the log after the first few replies go back.

My drachtio config binds to "*" like so, which I guess is where the loopback address come from:

       <contacts>
                <contact>sip:*:6060;transport=udp,tcp</contact>
                <contact>sips:*:6061;transport=tls</contact>
                <contact>sips:111.222.159:4433;transport=wss</contact>
        </contacts>

Steve

drachtio / drachtio-server

Non-stop segfaults when we switch to a non-local #124