drachtio / drachtio-server

A SIP call processing server that can be controlled via nodejs applications
https://drachtio.org
MIT License
245 stars 93 forks source link

Drachtio quit on assert after reporting a client disconnected and then a message arrived for a transaction being processed on that client #140

Open elbow opened 4 years ago

elbow commented 4 years ago

Hi Dave,

I experienced drachtio quitting on an assert in drachtio::ClientController::addDialogForTransaction

Last message logged by drachtio before the restart was:

2020-09-10 12:24:19.183611 ClientController::addDialogForTransaction - transaction id 1372cb9c-581c-46a5-81d7-38a595532109 not found

Backtrace from the core dump:

#1  0x00007f342383402a in __GI_abort () at abort.c:89
#2  0x00007f342382abd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7349ec "false", file=file@entry=0x7349cf "../src/client-controller.cpp", line=line@entry=406, 
    function=function@entry=0x7364c0 <drachtio::ClientController::addDialogForTransaction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::__PRETTY_FUNCTION__> "void drachtio::ClientController::addDialogForTransaction(const string&, const string&)") at assert.c:92
#3  0x00007f342382ac82 in __GI___assert_fail (assertion=0x7349ec "false", file=0x7349cf "../src/client-controller.cpp", line=406, 
    function=0x7364c0 <drachtio::ClientController::addDialogForTransaction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::__PRETTY_FUNCTION__> "void drachtio::ClientController::addDialogForTransaction(const string&, const string&)") at assert.c:101
#4  0x00000000004c891e in drachtio::ClientController::addDialogForTransaction (this=0x2a0a9d0, transactionId="1372cb9c-581c-46a5-81d7-38a595532109", 
    dialogId="syNzg0MjM~rp~i7papflflnesa8mpfbea;from-tag=pfqhd64hq9") at ../src/client-controller.cpp:406
#5  0x000000000054f20f in drachtio::SipDialogController::addDialog (this=0x2a46e00, dlg=std::shared_ptr (count 5, weak 1) 0x36c9ca80) at ../src/sip-dialog-controller.hpp:301
#6  0x000000000053d563 in drachtio::SipDialogController::processResponseOutsideDialog (this=0x2a46e00, orq=0x4c092e60, sip=0x4fc88258) at ../src/sip-dialog-controller.cpp:771
#7  0x0000000000533fd1 in (anonymous namespace)::response_to_request_outside_dialog (p=0x2a062a0, request=0x4c092e60, sip=0x4fc88258) at ../src/sip-dialog-controller.cpp:79
#8  0x00000000005fc591 in outgoing_recv ()
#9  0x00000000005e80b0 in agent_recv_response ()
#10 0x00000000005e48eb in agent_recv_message ()
#11 0x000000000063d100 in tport_base_deliver ()
#12 0x000000000063d043 in tport_deliver ()
#13 0x000000000063ca92 in tport_parse ()
#14 0x000000000063c75a in tport_recv_event ()
#15 0x000000000063c341 in tport_base_wakeup ()
#16 0x000000000063c0f5 in tport_wakeup_pri ()
#17 0x00000000006770eb in su_epoll_port_wait_events ()
#18 0x0000000000675cb1 in su_base_port_run ()
#19 0x000000000062675e in su_port_run ()
#20 0x00000000006278ac in su_root_run ()
#21 0x0000000000425417 in drachtio::DrachtioController::run (this=0x2a062a0) at ../src/controller.cpp:1160
#22 0x0000000000413fc9 in main (argc=4, argv=0x7ffc7da61048) at ../src/main.cpp:47

I grepped for the transaction id in the log and find:

2020-09-10 12:24:10.370914 ClientController::route_response_inside_transaction - client managing transaction has disconnected: 1372cb9c-581c-46a5-81d7-38a595532109
2020-09-10 12:24:19.183611 ClientController::addDialogForTransaction - transaction id 1372cb9c-581c-46a5-81d7-38a595532109 not found

So drachtio seems to be saying that my client disconnected.

My client is a K8S pod. When I check that it didn't initiate the disconnect, but also reported it:

2020-09-10T10:24:21.351Z drachtio:agent wp#on close xxx.:9022
2020-09-10T10:24:21.352Z drachtio:socket _initServer: removed socket: 9022:undefined, count now: 0
2020-09-10T10:24:21.611Z drachtio:agent wp#on error - Error: connect ECONNREFUSED xxx.xxx.xxx.xxx:9022 xxx:9022
[2020-09-10 10:24:21.611 +0000] ERROR (24 on rtcproxy-6b4c8b9985-dnxct): Error connecting to drachtio server: connect ECONNREFUSED xxx.xxx.xxx.xxx:9022
[2020-09-10 10:24:21.613 +0000] INFO  (24 on rtcproxy-6b4c8b9985-dnxct): exitting: deleting callcount row
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! drachtio-rtpengine-webrtcproxy@0.1.1 start: `node app.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the drachtio-rtpengine-webrtcproxy@0.1.1 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-09-10T10_24_21_629Z-debug.log

I suppose something between drachtio and the k8s pod might have gone wrong but when the client restarted all carried on fine.

Can drachtio not just 500 reply to the incoming requests or drop replies rather than crashing?

Thanks, Steve

davehorton commented 4 years ago

This is interesting and I should like to see the entire drachtio log leading up to the crash if possible.

I agree that a client disconnecting should not cause an assert, and that is not actually why it asserted. It asserted because for each client connection there are a couple of pieces of information that should either be all there, or not and for some reason in this case some were there and some missing. I need to examine the events that led up to this from the server side if possible

elbow commented 4 years ago

Hi Dave,

Thanks for the reply - I'll send the full log by email. Drachtio was restarted at 5am (to help with memory leak) and so its from then to to this crash.

Thanks, Steve

davehorton commented 4 years ago

something strange was done by the client app in this case. Still tracking down how this led to the server asserting, but here is the sequence:

That's the weird part. It sent a BYE and then immediately tried to reINVITE.

We get a 200 OK from the far end for both the BYE and the reINVITE. We don't respond with ACK to the 200 OK to the INVITE (since we flushed the dialog after sending BYE) but continue to get retransmitted 200 OKs. Then there is the assertion and the crash.

davehorton commented 4 years ago

It looks like a bit of a race condition. On the A leg (the above was the B leg) I see the INVITE and at the 12:22:31 we get a re-INVITE followed immediately by a BYE, both from the caller and the BYE coming before we have answered the re-INVITE. The app then sends the INVITE followed immediately by the BYE. In this case, what should happen I suppose is that the far end, having received the BYE, should respond with a non-success response to the INVITE but instead it responds 200 OK.

It seems like I need to address this race condition in the drachtio server in some fashion, probably by doing an ACK-BYE if the reINVITE is answered with a 200 OK.

elbow commented 4 years ago

Hi Dave,

I've got another race problem for you too - I'm not sure if it's related, but maybe you should look at it before trying to fix this one.

I have the issue on my own project, Dan was going to look over it but tomorrow I'll try to post it for you to take a loon if you don't mind.

davehorton commented 4 years ago

ok if you can same info as before (log / stack trace if available) that would be great

elbow commented 4 years ago

ok if you can same info as before (log / stack trace if available) that would be great

Sure. I'm at my desk now so I'll get all the details together.

elbow commented 4 years ago

Hi,

So I checked the IP of the client phone at the time and this is a cellular mobile IP address.

Unfortunately its now too long ago to retrieve the logs from the client software - I can only go back a week.

The call seemed to start fine - answered at 12:20:05.

Then at 12:22:31.899791 there is a "a=sendonly" reinvite

and immediately (while the b2bua is still dealing with the reinvite) we receive a bye: 12:22:31.900400

What would cause this to happen? I checked for another incoming Telviva call and there wasn't one.

What if the caller received a cellular call and answered it which would account for the on-hold. But if that's what happened then something went wrong since the call was immediately BYEd.

In any event drachtio now has the re-invite in flight through the b2bua stuff, and also a BYE which needs the same.

The BYE immediately deletes the dialog, and b2bua sends it up to the B side.

Meanwhile the same thing happens to the re-INVITE.

When the OK to the reinvite comes back from the B-side the matching dialog can't be found.

B-side sends the OK repeatedly and fruitlessly looking for the ACK it wants.

But all the fun is over by 12:23:03.435147 - that's the last OK received and the B side timer expires and it gives up.

I checked the captured SIP trace from Telviva (the B side system) and indeed the last OK was sent at that time.

So why suddenly do we have 76 seconds (1 minute and 15 seconds?) later the log entry:

2020-09-10 12:24:19.183518 SipDialogController::processResponse - adding dialog id: syNzg0MjM~rp~i7papflflnesa8mpfbea;from-tag=pfqhd64hq9

?

The callid there is the one that was used on the B-side of that call. There was no packet arriving from the B-side, so what provoked it to suddenly log that?

Here's how it looked from the point of view of the Telviva system:

image

(etc etc with the OKs)

left hand side is Drachtio, in the middle a Kamailio proxy and on the right an Asterisk 13 system.

I agree that Asterisk didn't behave too well; processing of a BYE is async so the channel can be dying but not dead yet which is why the BYE and OK were both OKed.

elbow commented 4 years ago

A suggestion, for what its worth which is probably not much:

Why not keep your dialogs around for say a minute after you would otherwise delete them in order to allow any stray packets to be processed?

Clients can have unreliable connection - for us they are Webrtc which is TCP. So if there is a big wedge up of connectivity then a whole lot of packets can turn up at once.

Consider this scenario:

(We don't currently re-invite the voip call in this case so this is a possible reason why an onh-old and bye happened at the same time - we will try to reproduce to see).

elbow commented 4 years ago

On the other example: I invited you to the project where it is issue 124

davehorton commented 4 years ago

I have removed the assertion in a recent commit on the develop branch, since this condition can in fact happen in the race condition case where a re-invite and a bye are received at more or less the same time.