Closed brstgt closed 4 years ago
I had this crash many times today, it seems to happen whenever the server is getting a bit more busy. Unfortunately I don't have more details right now
If this error happens, as an emergency measure it's possible to downgrade libnice version to the one available in Kurento Release repo (libnice version 0.1.13.1.xenial~20170725160546.81.eebfdab
in Xenial); the old, "stable" 0.1.13 version has a custom modification from a previous Kurento developer, that removes the assertion (replaces "g_assert
" it with a more permissive "g_warn_if_fail
"). This modification was lost when updating to latest libnice upstream, as I'm finding out now. I'll make sure this makes its way back to the experimental versions!
If it's possible, please before downgrading try to get libnice debug logs (you'll need to launch KMS on console, the "service init" won't log libnice messages yet). A command such as this should do it:
{
sudo service kurento-media-server stop
source /etc/default/kurento-media-server
export GST_DEBUG="${GST_DEBUG:-3},kmsiceniceagent:5,kmswebrtcsession:5,webrtcendpoint:4"
export G_MESSAGES_DEBUG="libnice,libnice-stun"
export NICE_DEBUG="$G_MESSAGES_DEBUG"
/usr/bin/kurento-media-server
}
libnice, when running with debug enabled, outputs a message such as this:
"Agent %p : stream %u component %u STATE-CHANGE %s -> %s."
(with placeholders filled with actual values). We'd benefit from knowing what's the state change that's causing this crash.
Also for completeness, if possible, also repeat the same procedure with the older version of libnice; it won't crash, but will still print some warning with the exact same format. We'll need to know what are those state changes that are not expected by libnice (see here) to then tell libnice maintainer about this issue (I'll probably open a bug report on your behalf, unless you prefer to do it yourself).
Hi Juan,
We were able to catch this error twice on the same server:
KMS: 6.7.2~19.g181284d
libnice: 0.1.15.xenial~20180709152002.83.28531a4
I've attached the logs captured when launching KMS on the console.
We were able to reproduce this with a user that was inside a corporate network in France. The crash occurred when that user stopped publishing his stream.
He was nice enough to help us reproduce this, but we no longer have access to him- he was an external speaker for our customer.
It seems that the nature of corporate networks might have something to do with this particular bug.
it won't crash, but will still print some warning
We did not see any logs when using the same KMS version with 0.1.13.1.xenial~20170725160546.81.eebfdab
But there was definitely weird behavior going on:
So .13 still has the issue (as you mentioned), but it simply does not crash kurento. It's a step better than a crash, but still problematic for our customer.
I really hope this helps!
Best regards, Jorge
@jmaiquez I've checked your linked debug logs, but it turns out the reason for crash in those two cases is not the assert problem described in this issue report and in your groups message; instead KMS crashed due to the proverbial bug in libnice socket code (Kurento report, libnice report).
They key to differentiate is that the assert has this stack trace:
/lib/x86_64-linux-gnu/libc.so.6:0x35428
[abort]
/lib/x86_64-linux-gnu/libc.so.6:0x3702A
[g_assertion_message]
/build/glib2.0-b4FPyK/glib2.0-2.48.2/./glib/gtestutils.c:2429
[g_assertion_message_expr]
/build/glib2.0-b4FPyK/glib2.0-2.48.2/./glib/gtestutils.c:2453
[agent_signal_component_state_change]
/opt/libnice/agent/agent.c:2353
And the socket crash has this other:
[g_socket_send_message]
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0:0x7B044
[socket_send_message]
/opt/libnice/socket/tcp-bsd.c:309
[socket_send_messages]
/opt/libnice/socket/tcp-bsd.c:362
[nice_agent_send_messages_nonblocking_internal]
/opt/libnice/agent/agent.c:4833
Please keep monitoring your servers to see if we can catch the invalid state transition (remember that in old, Release version of libnice 0.1.13, the assert was replaced by a warning, so you won't find a Kurento crash when the invalid state change occurs).
I'll provide your debug logging to the libnice socket crash issue report, maybe they are helpful for the libnice team.
Hi Juan,
Understood. Thanks for the clarification. Our customer has been running webinars all week, and we have also been testing. Here are our observations from this week.
Monday, 17 Sep
Tuesday, 18 Sep
Tuesday, 18 Sep
Wednesday, 19 Sep
LOW LOAD SITUATIONS
HIGHER LOAD SITUATIONS
SUMMARY
CONCLUSIONS
We have been so swamped with crisis management that we have not been able to put together an effective load testing platform, but (ignoring the special client/network configurations for a while) it seems to me that the key to reproducing the socket crash is load testing. How much load have you guys been able to put on KMS and run a session for over an hour?
Best regards, Jorge
We have replaced the undesired assert()
in our fork with a saner warning that won't break if the condition applies. Please check my comment in the other issue.
By the way it turns out that the assert()
is actually part of the original libnice sources!
If you ask me, using assertions in production code is a Very Bad practice, to say the least. OK for debug builds, but totally unacceptable for releases. That's my personal policy, anyway. We could probably file a bug report in libnice to ask for removing that assert...
The latest Pre-Release builds don't have the assertion any more, and can be installed with a simple apt-get install kurento-media-server
.
You can check if the latest version of libnice is installed in your system if you run this command:
apt-cache policy libnice10
and the result looks like this:
libnice10:
Installed: 0.1.15-1kurento1~20181018[...]
This version will stay in Pre-Release for some time to give people time to try it, and will be promoted to Release if no regressions or blocking bugs are found.
Kurento 6.14 uses libnice 0.1.17, which has received a ton of fixes and improvements. So this is probably fixed by now.
KMS Version: dev-master, 2018-06-29
Other libraries versions:
Client libraries
Browsers tested Add OK or FAIL, along with the version, after browsers where you have tested this issue:
System description: Please describe your setup (where is KMS located, where are the clients, STUN, TURN...)
What steps will reproduce the problem? Happened in production, can't tell how to repro
What is the expected result? No Crash
What happens instead? Crash
Does it happen with one of the tutorials? No
Please provide any additional information below.