Open elbow opened 4 years ago
This is interesting and I should like to see the entire drachtio log leading up to the crash if possible.
I agree that a client disconnecting should not cause an assert, and that is not actually why it asserted. It asserted because for each client connection there are a couple of pieces of information that should either be all there, or not and for some reason in this case some were there and some missing. I need to examine the events that led up to this from the server side if possible
Hi Dave,
Thanks for the reply - I'll send the full log by email. Drachtio was restarted at 5am (to help with memory leak) and so its from then to to this crash.
Thanks, Steve
something strange was done by the client app in this case. Still tracking down how this led to the server asserting, but here is the sequence:
syNzg0MjM~rp~i7papflflnesa8mpfbea
, CSeq: 25312804)That's the weird part. It sent a BYE and then immediately tried to reINVITE.
We get a 200 OK from the far end for both the BYE and the reINVITE. We don't respond with ACK to the 200 OK to the INVITE (since we flushed the dialog after sending BYE) but continue to get retransmitted 200 OKs. Then there is the assertion and the crash.
It looks like a bit of a race condition. On the A leg (the above was the B leg) I see the INVITE and at the 12:22:31 we get a re-INVITE followed immediately by a BYE, both from the caller and the BYE coming before we have answered the re-INVITE. The app then sends the INVITE followed immediately by the BYE. In this case, what should happen I suppose is that the far end, having received the BYE, should respond with a non-success response to the INVITE but instead it responds 200 OK.
It seems like I need to address this race condition in the drachtio server in some fashion, probably by doing an ACK-BYE if the reINVITE is answered with a 200 OK.
Hi Dave,
I've got another race problem for you too - I'm not sure if it's related, but maybe you should look at it before trying to fix this one.
I have the issue on my own project, Dan was going to look over it but tomorrow I'll try to post it for you to take a loon if you don't mind.
ok if you can same info as before (log / stack trace if available) that would be great
ok if you can same info as before (log / stack trace if available) that would be great
Sure. I'm at my desk now so I'll get all the details together.
Hi,
So I checked the IP of the client phone at the time and this is a cellular mobile IP address.
Unfortunately its now too long ago to retrieve the logs from the client software - I can only go back a week.
The call seemed to start fine - answered at 12:20:05.
Then at 12:22:31.899791 there is a "a=sendonly" reinvite
and immediately (while the b2bua is still dealing with the reinvite) we receive a bye: 12:22:31.900400
What would cause this to happen? I checked for another incoming Telviva call and there wasn't one.
What if the caller received a cellular call and answered it which would account for the on-hold. But if that's what happened then something went wrong since the call was immediately BYEd.
In any event drachtio now has the re-invite in flight through the b2bua stuff, and also a BYE which needs the same.
The BYE immediately deletes the dialog, and b2bua sends it up to the B side.
Meanwhile the same thing happens to the re-INVITE.
When the OK to the reinvite comes back from the B-side the matching dialog can't be found.
B-side sends the OK repeatedly and fruitlessly looking for the ACK it wants.
But all the fun is over by 12:23:03.435147 - that's the last OK received and the B side timer expires and it gives up.
I checked the captured SIP trace from Telviva (the B side system) and indeed the last OK was sent at that time.
So why suddenly do we have 76 seconds (1 minute and 15 seconds?) later the log entry:
2020-09-10 12:24:19.183518 SipDialogController::processResponse - adding dialog id: syNzg0MjM~rp~i7papflflnesa8mpfbea;from-tag=pfqhd64hq9
?
The callid there is the one that was used on the B-side of that call. There was no packet arriving from the B-side, so what provoked it to suddenly log that?
Here's how it looked from the point of view of the Telviva system:
(etc etc with the OKs)
left hand side is Drachtio, in the middle a Kamailio proxy and on the right an Asterisk 13 system.
I agree that Asterisk didn't behave too well; processing of a BYE is async so the channel can be dying but not dead yet which is why the BYE and OK were both OKed.
A suggestion, for what its worth which is probably not much:
Why not keep your dialogs around for say a minute after you would otherwise delete them in order to allow any stray packets to be processed?
Clients can have unreliable connection - for us they are Webrtc which is TCP. So if there is a big wedge up of connectivity then a whole lot of packets can turn up at once.
Consider this scenario:
(We don't currently re-invite the voip call in this case so this is a possible reason why an onh-old and bye happened at the same time - we will try to reproduce to see).
On the other example: I invited you to the project where it is issue 124
I have removed the assertion in a recent commit on the develop branch, since this condition can in fact happen in the race condition case where a re-invite and a bye are received at more or less the same time.
Hi Dave,
I experienced drachtio quitting on an assert in drachtio::ClientController::addDialogForTransaction
Last message logged by drachtio before the restart was:
Backtrace from the core dump:
I grepped for the transaction id in the log and find:
So drachtio seems to be saying that my client disconnected.
My client is a K8S pod. When I check that it didn't initiate the disconnect, but also reported it:
I suppose something between drachtio and the k8s pod might have gone wrong but when the client restarted all carried on fine.
Can drachtio not just 500 reply to the incoming requests or drop replies rather than crashing?
Thanks, Steve