Kurento / bugtracker

[ARCHIVED] Contents migrated to monorepo: https://github.com/Kurento/kurento
46 stars 10 forks source link

libnice crashes in socket code: g_socket_send_message (socket=0x0) #247

Closed j1elo closed 5 years ago

j1elo commented 6 years ago

Mail thread: https://groups.google.com/d/topic/kurento/_rf1ANq5Cm8/discussion Related: https://github.com/Kurento/bugtracker/issues/208

Analysis It seems that the crash is happening in the 3rd-party library libnice, during socket access:

#5  0x00007f87c1272044 in g_socket_send_message (socket=0x0, address=address@entry=0x0, vectors=0x7f875c3f9840, num_vectors=2, messages=messages@entry=0x0, num_messages=num_messages@entry=0, flags=0, cancellable=0x0, error=0x7f86dc6951b0) at /build/glib2.0-prJhLS/glib2.0-2.48.2/./gio/gsocket.c:4255
#6  0x00007f87bedb8cbf in socket_send_message (sock=sock@entry=0x7f87940596e0, message=message@entry=0x7f86dc6952a0, reliable=reliable@entry=0) at tcp-bsd.c:306
#7  0x00007f87bedb8f3b in socket_send_messages (sock=0x7f87940596e0, to=<optimized out>, messages=<optimized out>, n_messages=1) at tcp-bsd.c:360
#8  0x00007f87beda0ae9 in nice_agent_send_messages_nonblocking_internal (agent=0x7f875c104650 [NiceAgent], stream_id=<optimized out>, component_id=<optimized out>, messages=0x7f875c3bfdb0, n_messages=5, allow_partial=0, error=0x0) at agent.c:4748

Crash occurs because of g_socket_send_message (socket=0x0). For some reason, g_socket_send_message is called with NULL socket, so it breaks in line 4255 of glib/gsocket.c.

Seems that libnice is famous for having random crashes in socket code. I could find more information about similar issues here and here; in the second link, they mention that some of the crashes were fixed in 2016 with this commit, so I assume that the fix is included in the libnice release version 0.1.14, on Apr 3, 2017.

Kurento uses libnice 0.1.13, so it suffers the bug.

Solution Proposed solution is to update the version of libnice, however this is a big update, and we tried it already in the past but several integration tests broke. Currently, updating libnice to 0.1.14 will break some use cases due to the changes it brings. So updating this library requires careful planning and lots of tests.

Update 2018-08-22 This is a bug in libnice, reported 9 months ago and currently tracked here: libnice issue #33 segfault in g_socket_send_message

We still don't have enough information about the cause of the issue, and have been unable to reproduce it in debugging sessions (as it seems to be a crash that only happens with some amount of load on production servers). As it is, the best option is to show interest in solving that bug in the upstream project's bug tracker. They have all the context and the know-how about libnice's code base and will probably be able to solve the issue.

jmaiquez commented 6 years ago

Hi Puneet,

Thanks so much for this. We are still trying to figure out how to reproduce the crash reliably using our video collaboration application. Once we can semi-reliably reproduce it, we will use your patch to see if it becomes less (or not at all) reproducible.

I will report my findings here when I have them.

Thanks again, Jorge

guzmand commented 6 years ago

Hi,

I would like to know if this would be released in Kurento 6.8? and also if the Kurento 6.8 version would be considered a stable version?

Thanks,

Dario

j1elo commented 6 years ago

@guzmand Please wait until official announcement for Kurento 6.8.1, where there are already some minor problems fixed (like the KMS truncating the error log on each restart) and improvements added to aid in working with the libnice issues (libnice logging integrated with Kurento logging, all will be logged now to the same output files). I'll write an official announcement on the Kurento mail list and the homepage blog (kurento.org)

@puneet89 I have used your test but wasn't able to reproduce the crash, after several repetitions libnice never crashed, so I couldn't see either the assert or the socket issues... I have tried several versions of KMS:

I think these issues are strongly related to the actual network configuration where the failure happens, and is not so much related to load, because it seems strange that some are able to easily reproduce the issue with their systems, and some other people have full system load and never see the issue...

j1elo commented 5 years ago

Latest Kurento Media Server 6.8.1 brings some performance improvements that were merged into upstream libnice. We've load tested this version and couldn't make libnice crash.

Our load test consists of 13 rooms with 7 browser participants on each room. This means a total of 91 participants:

If the crash from this report still happens in your systems, please provide a detailed explanation of the network topology (firewalls? using TURN? using TCP vs. UDP streams?) and organization of participants (all in same room? multiple rooms? etc).

pabloFuente commented 5 years ago

Worth noticing that this load test has been executed in a AWS machine with 4 cores and 8GB of RAM.

Kukunin commented 5 years ago

@j1elo what great news, thanks! a small question: were you able to reproduce the crash earlier with the described test environment?

puneet89 commented 5 years ago

@j1elo Hi,

I will pick up this activity again in next week then will report if i can still reproduce the crash with latest build 6.8.1. Regards Puneet Singh

j1elo commented 5 years ago

@puneet89 thank you; also please remember to update Olivier in libnice's issue #33

@Kukunin Not really, we finished development of our load tests just last week. We'll be doing various testing in the coming days.

American-Ledger commented 5 years ago

I can reproduce this on a fresh AWS instance of 6.8.1 every time as follows:

  1. Use one client of Tutorial Recorder in Chrome Browser.
  2. Modify index.js to ignore all IceCandidates that are not TCP.
  3. Load tutorial in browser. Press Play. Stream will appear on right via TCP.
  4. Press F5 to break the connection... Boom! Kurento crashes with: Segmentation fault (thread 140198302856960, pid 8089)
    • Every time.
puneet89 commented 5 years ago

@j1elo this issue is coming frequently with KMS 6.8.1 ( no performance test) ,even in normal testing this issue is coming . for issue https://gitlab.freedesktop.org/libnice/libnice/issues/33 i will update Olivier soon .

pkg Info: root@imp-desk-101:/var/log/kurento-media-server# dpkg -l | grep kms ii kms-core 6.8.1.xenial.20181023170759.64317be amd64 Kurento Core module ii kms-elements 6.8.1.xenial.20181023171349.0f655c6 amd64 Kurento Elements module ii kms-filters 6.8.1.xenial.20181023172029.83f02d4 amd64 Kurento Filters module ii kms-jsonrpc 6.8.1.xenial.20181023170316.9ad5ec7 amd64 Kurento JSON-RPC library ii kmsjsoncpp 1.6.3.xenial.20181023152331.d78deb7 amd64 Kurento jsoncpp library root@imp-desk-101:/var/log/kurento-media-server# dpkg -l | grep libnice ii libnice10:amd64 0.1.15-1kurento1 amd64 ICE library (shared library)

Attached error.log file for KMS 6.8.1. errors.log

Stack trace: [g_socket_send_message] /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0:0x7B044 [nice_output_stream_new] /usr/lib/x86_64-linux-gnu/libnice.so.10:0x2769E [nice_output_stream_new] /usr/lib/x86_64-linux-gnu/libnice.so.10:0x27813 [nice_agent_recv_nonblocking] /usr/lib/x86_64-linux-gnu/libnice.so.10:0x11879 [gst_nice_src_get_type] /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstnice15.so:0x3902 [gst_nice_sink_get_type] /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstnice15.so:0x4203 [gst_base_sink_do_preroll] /usr/lib/x86_64-linux-gnu/libgstbase-1.5.so.0:0x2A1B2 [gst_base_sink_do_preroll] /usr/lib/x86_64-linux-gnu/libgstbase-1.5.so.0:0x2B620 [gst_flow_get_name] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x6E5CF [gst_pad_push] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x76533 [gst_proxy_pad_chain_default] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x5F5E3 [gst_flow_get_name] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x6E5CF [gst_pad_push] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x76533 0x1B48D at /usr/lib/x86_64-linux-gnu/gstreamer-1.5/libgstcoreelements.so [gst_flow_get_name] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x6E5CF [gst_pad_push] /usr/lib/x86_64-linux-gnu/libgstreamer-1.5.so.0:0x76533 0:00:00.610831422  8757 0x226eaa0 WARN

micaelgallego commented 5 years ago

@j1elo is in vacancy right now. It will take care of this issue as soon as he comes (in two days).

Having a way to reproduce the crash we will be able to fix it very soon.

chevonc commented 5 years ago

This is currently blocking us from launching a feature with Kurento in production. Any ETA on what the turn around time will be once he's back? Is this a difficult fix?

micaelgallego commented 5 years ago

We will work hard on having it fixed ASAP. Maybe at the beginning of the next week.

chevonc commented 5 years ago

Thanks for the quick response @micaelgallego. Eagerly awaiting the fix, let me know if can use a nightly build or something right away

micaelgallego commented 5 years ago

@chevonc we will notify when probable fix is in nightly.

Are you able to reproduce the crash easily?

chevonc commented 5 years ago

@micaelgallego yes I can. Can I work with you to get it fixed? Will do w/e necessary :)

chevonc commented 5 years ago

FYI - looks like Olivier was nice enough to push another set of fixes. I'm willing to work with whoever to get this verified/merged.

https://gitlab.freedesktop.org/libnice/libnice/issues/33#note_80203

Currently trying to figure out how to test out a patched lib with his changes if anyone has pointers

chevonc commented 5 years ago

Validated this fix no longer causes the crash 🍾 . If we can merge/get this a released patched version of Kurento that'd be awesome. Currently just running with src-built patched version of the libnice

j1elo commented 5 years ago

Tested with the procedure suggested by @American-Ledger and it works. Curiously enough, it didn't work with the simpler Hello World tutorial. After some investigation I found the reason: when the page is closed / reloaded, the recorder tutorials abort abruptly by deleting the client objects, instead of doing a controlled stop. This is actually a bug in the tutorial code ;-)

But funnily enough this "bug" does actually trigger the libnice crash; a controlled release of resources from the application wouldn't cause the crash...

I've imported this behavior to a customized version of kurento-hello-world (Java tutorial). This can be used to easily reproduce the crash by anyone interested (you need to install Java JDK and Maven, like for any of the other Kurento Java tutorials).

I've tested latest commits from libnice PR 15 and the issue doesn't happen any more with this test application. So I've merged these patches into our fork of libnice and packages are already available in Kurento's pre-release repository.

Please test with the pre-release version of Kurento and let us know if these patches do actually fix the crashes!

puneet89 commented 5 years ago

@j1elo Thanks ! I have upgraded all my setups with above kurento build, now will do all kind of sanity test on same .

puneet89 commented 5 years ago

@j1elo with latest dev repo i am getting following error on Kurento when more than 12 user joining conference.

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed

puneet89 commented 5 years ago

I am raising different issue for this GStreamer error , as libnice issue is closed and working.

j1elo commented 5 years ago

That assertion is happening at this point in GStreamer code.

The docs of g_return_if_fail provide a way to debug this: you would need to follow these steps:

  1. Install KMS and the debug symbols, as explained here.
  2. Set an environment variable: G_DEBUG=fatal-criticals.
  3. Run KMS with GDB, until the same assertion happens.
  4. Generate a backtrace.
# Install KMS and debug symbols
apt-get update && apt-get install [...]

$ export G_DEBUG=fatal-criticals
$ export GST_DEBUG="3,Kurento*:4,kms*:4"
$ gdb /usr/bin/kurento-media-server
(gdb) run
# Now run until the assertion happens, GDB should break at that point
(gdb) backtrace

Then copy the backtrace so we can see it and know where it comes from. Please use Markdown formatting when you copy stuff from logs or from a terminal... (use ``` for quoting blocks)

puneet89 commented 5 years ago

@j1elo okay thanks i will take all logs and will share with you.

j1elo commented 5 years ago

Closing as it seems to be fixed since upstream merged their Merge Request !15

Fix release: Kurento 6.9.0

erantoker commented 5 years ago

we can still reproduce it, please check below link:

https://groups.google.com/forum/#!topic/kurento/exdEbg2GM9g