Closed j1elo closed 5 years ago
Hi Puneet,
Thanks so much for this. We are still trying to figure out how to reproduce the crash reliably using our video collaboration application. Once we can semi-reliably reproduce it, we will use your patch to see if it becomes less (or not at all) reproducible.
I will report my findings here when I have them.
Thanks again, Jorge
Hi,
I would like to know if this would be released in Kurento 6.8? and also if the Kurento 6.8 version would be considered a stable version?
Thanks,
Dario
@guzmand Please wait until official announcement for Kurento 6.8.1, where there are already some minor problems fixed (like the KMS truncating the error log on each restart) and improvements added to aid in working with the libnice issues (libnice logging integrated with Kurento logging, all will be logged now to the same output files). I'll write an official announcement on the Kurento mail list and the homepage blog (kurento.org)
@puneet89 I have used your test but wasn't able to reproduce the crash, after several repetitions libnice never crashed, so I couldn't see either the assert or the socket issues... I have tried several versions of KMS:
I think these issues are strongly related to the actual network configuration where the failure happens, and is not so much related to load, because it seems strange that some are able to easily reproduce the issue with their systems, and some other people have full system load and never see the issue...
Latest Kurento Media Server 6.8.1 brings some performance improvements that were merged into upstream libnice. We've load tested this version and couldn't make libnice crash.
Our load test consists of 13 rooms with 7 browser participants on each room. This means a total of 91 participants:
If the crash from this report still happens in your systems, please provide a detailed explanation of the network topology (firewalls? using TURN? using TCP vs. UDP streams?) and organization of participants (all in same room? multiple rooms? etc).
Worth noticing that this load test has been executed in a AWS machine with 4 cores and 8GB of RAM.
@j1elo what great news, thanks! a small question: were you able to reproduce the crash earlier with the described test environment?
@j1elo Hi,
I will pick up this activity again in next week then will report if i can still reproduce the crash with latest build 6.8.1. Regards Puneet Singh
@puneet89 thank you; also please remember to update Olivier in libnice's issue #33
@Kukunin Not really, we finished development of our load tests just last week. We'll be doing various testing in the coming days.
I can reproduce this on a fresh AWS instance of 6.8.1 every time as follows:
@j1elo this issue is coming frequently with KMS 6.8.1 ( no performance test) ,even in normal testing this issue is coming . for issue https://gitlab.freedesktop.org/libnice/libnice/issues/33 i will update Olivier soon .
pkg Info: root@imp-desk-101:/var/log/kurento-media-server# dpkg -l | grep kms ii kms-core 6.8.1.xenial.20181023170759.64317be amd64 Kurento Core module ii kms-elements 6.8.1.xenial.20181023171349.0f655c6 amd64 Kurento Elements module ii kms-filters 6.8.1.xenial.20181023172029.83f02d4 amd64 Kurento Filters module ii kms-jsonrpc 6.8.1.xenial.20181023170316.9ad5ec7 amd64 Kurento JSON-RPC library ii kmsjsoncpp 1.6.3.xenial.20181023152331.d78deb7 amd64 Kurento jsoncpp library root@imp-desk-101:/var/log/kurento-media-server# dpkg -l | grep libnice ii libnice10:amd64 0.1.15-1kurento1 amd64 ICE library (shared library)
Attached error.log file for KMS 6.8.1. errors.log
Stack trace: [34;1m[g_socket_send_message][0m /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0[32;1m:0x7B044[0m [34;1m[nice_output_stream_new][0m /usr/lib/x86_64-linux-gnu/libnice.so.10[32;1m:0x2769E[0m [34;1m[nice_output_stream_new][0m /usr/lib/x86_64-linux-gnu/libnice.so.10[32;1m:0x27813[0m [34;1m[nice_agent_recv_nonblocking][0m /usr/lib/x86_64-linux-gnu/libnice.so.10[32;1m:0x11879[0m [34;1m[gst_nice_src_get_type][0m /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstnice15.so[32;1m:0x3902[0m [34;1m[gst_nice_sink_get_type][0m /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstnice15.so[32;1m:0x4203[0m [34;1m[gst_base_sink_do_preroll][0m /usr/lib/x86_64-linux-gnu/libgstbase-1.5.so.0[32;1m:0x2A1B2[0m [34;1m[gst_base_sink_do_preroll][0m /usr/lib/x86_64-linux-gnu/libgstbase-1.5.so.0[32;1m:0x2B620[0m [34;1m[gst_flow_get_name][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x6E5CF[0m [34;1m[gst_pad_push][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x76533[0m [34;1m[gst_proxy_pad_chain_default][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x5F5E3[0m [34;1m[gst_flow_get_name][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x6E5CF[0m [34;1m[gst_pad_push][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x76533[0m [32;1m0x1B48D[0m at /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstcoreelements.so [34;1m[gst_flow_get_name][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x6E5CF[0m [34;1m[gst_pad_push][0m /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0[32;1m:0x76533[0m 0:00:00.610831422 [334m 8757[00m 0x226eaa0 [33;01mWARN
@j1elo is in vacancy right now. It will take care of this issue as soon as he comes (in two days).
Having a way to reproduce the crash we will be able to fix it very soon.
This is currently blocking us from launching a feature with Kurento in production. Any ETA on what the turn around time will be once he's back? Is this a difficult fix?
We will work hard on having it fixed ASAP. Maybe at the beginning of the next week.
Thanks for the quick response @micaelgallego. Eagerly awaiting the fix, let me know if can use a nightly build or something right away
@chevonc we will notify when probable fix is in nightly.
Are you able to reproduce the crash easily?
@micaelgallego yes I can. Can I work with you to get it fixed? Will do w/e necessary :)
FYI - looks like Olivier was nice enough to push another set of fixes. I'm willing to work with whoever to get this verified/merged.
https://gitlab.freedesktop.org/libnice/libnice/issues/33#note_80203
Currently trying to figure out how to test out a patched lib with his changes if anyone has pointers
Validated this fix no longer causes the crash 🍾 . If we can merge/get this a released patched version of Kurento that'd be awesome. Currently just running with src-built patched version of the libnice
Tested with the procedure suggested by @American-Ledger and it works. Curiously enough, it didn't work with the simpler Hello World tutorial. After some investigation I found the reason: when the page is closed / reloaded, the recorder tutorials abort abruptly by deleting the client objects, instead of doing a controlled stop. This is actually a bug in the tutorial code ;-)
But funnily enough this "bug" does actually trigger the libnice crash; a controlled release of resources from the application wouldn't cause the crash...
I've imported this behavior to a customized version of kurento-hello-world (Java tutorial). This can be used to easily reproduce the crash by anyone interested (you need to install Java JDK and Maven, like for any of the other Kurento Java tutorials).
I've tested latest commits from libnice PR 15 and the issue doesn't happen any more with this test application. So I've merged these patches into our fork of libnice and packages are already available in Kurento's pre-release repository.
Please test with the pre-release version of Kurento and let us know if these patches do actually fix the crashes!
@j1elo Thanks ! I have upgraded all my setups with above kurento build, now will do all kind of sanity test on same .
@j1elo with latest dev repo i am getting following error on Kurento when more than 12 user joining conference.
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
I am raising different issue for this GStreamer error , as libnice issue is closed and working.
That assertion is happening at this point in GStreamer code.
The docs of g_return_if_fail provide a way to debug this: you would need to follow these steps:
G_DEBUG=fatal-criticals
.# Install KMS and debug symbols
apt-get update && apt-get install [...]
$ export G_DEBUG=fatal-criticals
$ export GST_DEBUG="3,Kurento*:4,kms*:4"
$ gdb /usr/bin/kurento-media-server
(gdb) run
# Now run until the assertion happens, GDB should break at that point
(gdb) backtrace
Then copy the backtrace so we can see it and know where it comes from. Please use Markdown formatting when you copy stuff from logs or from a terminal... (use ``` for quoting blocks)
@j1elo okay thanks i will take all logs and will share with you.
Closing as it seems to be fixed since upstream merged their Merge Request !15
Fix release: Kurento 6.9.0
we can still reproduce it, please check below link:
Mail thread: https://groups.google.com/d/topic/kurento/_rf1ANq5Cm8/discussion Related: https://github.com/Kurento/bugtracker/issues/208
Analysis It seems that the crash is happening in the 3rd-party library libnice, during socket access:
Crash occurs because of g_socket_send_message (socket=0x0). For some reason,
g_socket_send_message
is called with NULL socket, so it breaks in line 4255 of glib/gsocket.c.Seems that libnice is famous for having random crashes in socket code. I could find more information about similar issues here and here; in the second link, they mention that some of the crashes were fixed in 2016 with this commit, so I assume that the fix is included in the libnice release version 0.1.14, on Apr 3, 2017.
Kurento uses libnice 0.1.13, so it suffers the bug.
Solution Proposed solution is to update the version of libnice, however this is a big update, and we tried it already in the past but several integration tests broke. Currently, updating libnice to 0.1.14 will break some use cases due to the changes it brings. So updating this library requires careful planning and lots of tests.
Update 2018-08-22 This is a bug in libnice, reported 9 months ago and currently tracked here: libnice issue #33 segfault in g_socket_send_message
We still don't have enough information about the cause of the issue, and have been unable to reproduce it in debugging sessions (as it seems to be a crash that only happens with some amount of load on production servers). As it is, the best option is to show interest in solving that bug in the upstream project's bug tracker. They have all the context and the know-how about libnice's code base and will probably be able to solve the issue.