Open MaxIasat opened 4 years ago
For 6.14.0 we worked on a performance issue that would cause CPU exhaustion when an Endpoint tried to disconnect from N Endpoints, for somewhat big N (depends on the machine, but somewhere around N > 20 we could observe the issues in our case).
With some optimizations the CPU usage decreased significantly, we talked a bit about this in the 6.14.0 blog post.
Your issue seems very similar to the one we solved. Maybe there are still some cases when CPU exhaustion occurs... Can you observe if the 200% CPU spikes are happening when one sender PlayerEndpoint is being disconnected from a big number of receiver WebRtcEndpoints?
Other idea is to avoid delete-and-create of new PlayerEndpoints each time the test video is repeated. What happens if, instead, you call setPosition(0) before the file ends playing? (e.g. if the video file length is 30 seconds, call player.setPosition(0)
every 25 seconds).
PS. the -asan
images are made to run Kurento with AddressSanitizer, to find memory corruption bugs. This doesn't look like a memory corruption issue, so I don't think we'll get anything useful from it. A better tool to use would be Callgrind, to generate profiling output and then inspect it with KCacheGrind, but we'll leave that for later if we don't find a solution.
Hi Juan,
thank you for your reply.
I'll try to explain my application in more precise way.
Thank you and best regards Maxim
Hi All,
i would like to give an update of my observations:
Conclusion:
So this seems to come from GStreamer, which makes it a bit harder to debug. However I'll add this issue to the list of GStreamer 1.14 related topics, so we know that this needs to be looked at when working on Kurento 7.0.0 with GStreamer 1.14.
The HTTP plugin should work fine now that we're going to release the version 6.15.0, so I'd ecommend you to update when it gets published in the next weeks.
Hi All,
i have installed the 6.15.0 on our system and let it run for many weeks. Unfortunately, this version crashed after ca. 6 days! I noticed the crashes many times (segmentation error).
I noticed also memory consumption by kms. I suppose that there is a memory leak somewhere, because in all cases no free memory was available. So, unfortunately the 6.15.0 is not usable for me.
In general, the situation with this version is much better. The previous version 6.14.0 was crashed after some couple of hours. So i think, the fixes are working, but somewhere memory is leaking and it causes the crash of the system.
The kms 7.0.0 experimental version is working more stable. It was running more than 18 days without any problems. So, my question is also, when is the final release of the kms 7.0.0 with the new gstreamer? I suppose, that many issues are coming from the old gstreamer branch.
Thank you and best regards
I'm sorry but there is not ETA for 7.0.0 version.
We are an small team working on Kurento and OpenVidu and right now we are focused on providing scalability and fault tolerance to OpenVidu.
Regards
@MaxIasat Although I said that this doesn't look like a memory corruption bug, just in case, could you confirm if there were any further logs after the -asan
image crashed?
Long term behaviour kurento 6.14.0-asan So, for further analysing (maybe memory leaks) I used the asan version of the kurento as docker with the same configuration described above. I was very surprising, but this asan version were running more than 3 days on my system. But the kurento media server crashed! Here are the last log lines:
{"log":"63:45:14.991258108 9 0x61d0090e0630 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006c38460\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.902977789Z"} {"log":"63:45:25.128369440 9 0x61d0090e04f0 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006a43560\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.902996642Z"} {"log":"63:45:25.128247109 9 0x61d0090e0540 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006d06ae0\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.903015638Z"} {"log":"63:45:40.736984796 9 0x61d008b848a0 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006c381e0\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.903034851Z"} {"log":"63:45:40.736981220 9 0x61d0090e0400 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006b70560\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.903053751Z"} {"log":"63:45:45.413900652 9 0x619009775890 WARN KurentoWorkerPool WorkerPool.cpp:122:checkThreads: Worker thread pool is lagging! (CPU exhausted?)\n","stream":"stderr","time":"2020-07-30T22:24:26.903083716Z"} {"log":"63:45:47.777772174 9 0x61d008f360f0 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x6210075a2470\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.903102773Z"} {"log":"63:45:47.777753526 9 0x61d0090e04f0 WARN dtlsconnection gstdtlsconnection.c:312:handle_timeout:\u003cGstDtlsConnection@0x621006ed67a0\u003e handling timeout failed\n","stream":"stderr","time":"2020-07-30T22:24:26.90313725Z"} {"log":"/entrypoint.sh: line 84: 9 Killed LD_PRELOAD=\"$PWD/libasan.so\" ASAN_OPTIONS='detect_odr_violation=0 detect_leaks=1 detect_invalid_pointer_pairs=2 strict_string_checks=1 detect_stack_use_after_return=0 check_initialization_order=1 strict_init_order=1' LD_LIBRARY_PATH=\"$PWD\" ./kurento-media-server --modules-path=\"$PWD\" --gst-plugin-path=\"$PWD\"\n","stream":"stderr","time":"2020-07-30T22:24:27.647452597Z"}
This behaviour I could reproduce twice.
Normally when the asan image crashes, it should print some final logs with all the memory information and details about where the crash happened. I could review those, just in case there is something to be found.
Prerequisites
[X] I have read the SUPPORT document
[X] I have checked the Troubleshooting Guide
[X] I have tested with the latest version of Kurento
Issue description
Application: My application is similar to “one-to-many” tutorial, but instead of the presenter i have used PlayerEndpoint to stream the video files (mkv files). As soon as one client is connected to the application server the streaming will be started. Other clients are connected to the running stream. Some kind of business logic selects in the infinite loop the video media files and triggers the creation of the new PlayerEndpoint and then the playing the new video file. Before it, the old one PlayerEndpoint has been released. The existing WebRtcEndpoints will be connected to the new PlayerEndpoint. All the clients show the same video file.
So, i have one pipeline, one player endpoint and many WebRtcEndpoints connected (in my case i have only max 5 clients connected). The files are provided with file:// attribute to the PlayerEndpoint.
The video files have duration of 30 seconds. Every 30 s the business logic triggers the new video file. I have tested kurento 6.14.0 and 6.14.0-asan versions installed in docker on Debian 8 machine with 4 cores CPU and 4 GB RAM. Mkv video files have 720p resolution and coded as H.264 with 25fps. I have disabled the using of VP8 codec in kurento. So, only H.264 codec can be chosen. In this case the video files are not transcoded. I log this information also in my application by subscribing on MediaTranscodingStateChange event. The audio is disabled. The video files don’t have any audio tracks.
General behaviour: The general behaviour is as expected. As soon one client connects to the server the streaming is started and each 30 seconds, i see new video file. Other connected clients see immediately the running stream. The stream runs synchronously. From functionality point of view, it works great.
Long term behaviour kurento 6.14.0 In the above-mentioned configuration, the system runs stable couple of hours (ca. 8h). At the beginning I noticed cpu usage for kurento-media-server pending about 90% (with top tool). After some time I observed some kind of CPU spikes (about 200%). Finally, after ca. 8h the cpu usage was huge and the system didn’t respond. Only reboot could help. I noticed also the memory increasing over the time (with top tool).
Long term behaviour kurento 6.14.0-asan So, for further analysing (maybe memory leaks) I used the asan version of the kurento as docker with the same configuration described above. I was very surprising, but this asan version were running more than 3 days on my system. But the kurento media server crashed! Here are the last log lines:
This behaviour I could reproduce twice. The container with the crashed kurento inside I can provide as a tar file. Can you please help with analysing of this issue? Thank you and best regards
Context
I try to implement a video player which stream different videos every 30 seconds.
How to reproduce?
Do long term test with PlayerEndpoint and many clients by streaming different files in the infinite loop. See description above.
Expected & current behavior
Stable system without crashes.
(Optional) Possible solution
I suppose maybe some kind of race condition blocks the released version of kurento. In the asan version i think the timing is a little bit different and this may be the cause why kurento runs longer.
INFO about Kurento Media Server
INFO about your Application Server
INFO about end-user clients
INFO about your environment
No STUN or TURN server are configured. The system is running in the local network with dhcp.
I have used the following docker run command:
Commands output