AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.57k stars 1.06k forks source link

SIGSEGV occurs in Edge's dynamic application #1209

Closed web3omega closed 1 year ago

web3omega commented 1 year ago

Describe the bug I'm running an edge server on AWS with Amazon Linux 2. It's crashing randomly with SIGABRT and SIGSEGV errors intermittent. I connect the Edge server to 4 Origin servers. I mainly see issues when more than 1 Origin server is being streamed.

To Reproduce Steps to reproduce the behavior:

  1. Set Server.xml as edge server, and put 4 OVT origins (Rest default config)
  2. With basic Encoder x264 on the Origin servers
  3. Start client streaming
  4. Works fine for several minutes. Then SIGABRT/SIGSEGV on random intervals, typically after several minutes after start.
  5. See error

Logs

[2023-05-02 09:32:57.169] W [StreamMotor:251] ov.Queue | queue.h:268  | [0x7f741c77c858] #default#stream2/m9-MR-Outbound size has exceeded the threshold: queue: 135221, threshold: 100, peak: 135221
OvenMediaEngine: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
[2023-05-02 09:32:57.569] C [StreamMotor:251] OvenMediaEngine | signals.cpp:120  | OME received signal 6 (SIGABRT), interrupt.

and

[2023-05-02 06:56:18.918] I [SPRtcSig-t3333:10] Publisher | application.cpp:184  | LLHLS Publisher Application has deleted [#default#stream1] application
[2023-05-02 06:56:19.037] C [StreamMotor:141] OvenMediaEngine | signals.cpp:120  | OME received signal 11 (SIGSEGV), interrupt.

Server (please complete the following information):

Additional context I had similar problems when using Ubuntu 20.04 on AWS, so decided to use Amazon Linux 2. But didnt solve the issues, so seems OME related. Also, the Origin servers, with ~6 streams each, run fine on their own for several weeks with >20 simultanous users streaming.

Please let me know if further info is needed,

getroot commented 1 year ago

please upload server.xml of origin and edge. and full log fils.

web3omega commented 1 year ago

Edge: Server.txt Origin: Origin.txt

Here the config files. Will take me a bit to get the logging from AWS

web3omega commented 1 year ago

log2.csv log1.csv Here the two log files for both crashes. I updated the hostnames in all files as I dont want them to be public available.

(I now see that its not the entire logging, im fetching the entire log files later today)

Keukhan commented 1 year ago

@web3omega

Thanks for repoting :) I have one question. Are all the SIGABRT/SIGEGV errors occurring on the edge server?

web3omega commented 1 year ago

@web3omega

Thanks for repoting :) I have one question. Are all the SIGABRT/SIGEGV errors occurring on the edge server?

Yes, only at the edge server. All 4 of my origin servers are running for 4 weeks without crash/restart. So I'm very happy with that stability. I'm just trying to setup a -simple- edge server just passing through

Keukhan commented 1 year ago

@web3omega

thanks. I will try to analyze the cause. I'll let you know when the problem is resolved.

web3omega commented 1 year ago

@web3omega

thanks. I will try to analyze the cause. I'll let you know when the problem is resolved.

Please patience, small update: I just realized my edge server was still running v0.15.8 and my origin servers are at v0.15.1 I updated the edge to v0.15.10 and then nothing worked, because I need to update the origin servers first (incompatibility issues <v0.15.9). I will do several testruns tomorrow again running everything in sync on v0.15.10, which hopefully should solve it.

web3omega commented 1 year ago

So far so good, i updated both Edge and Origins to v0.15.10 and Im not able to reproduce the issue. Seems like latest 'breaking changes' solved it. I will close this issue for now as Im unable to reproduce it anymore and I will keep edge + origin versions always in sync.

web3omega commented 1 year ago

Need to re-open, The problem still exists running both Edge and Origin on v0.15.10. The edge server crashes after a couple hours.

Here the logging of the edge server:

OME_SIGSEGV.txt

getroot commented 1 year ago

Thank you for reporting.

From the logs you posted, I'm suspecting that this problem is happening in the dynamic application side. (If the application name is *) Currently, stream1, stream2, stream3, and stream4 applications are being dynamically created/deleted in your edge. Can you please check if the problem is reproduced if the above application is created in Server.xml in advance?

web3omega commented 1 year ago

Thank you for reporting.

From the logs you posted, I'm suspecting that this problem is happening in the dynamic application side. (If the application name is *) Currently, stream1, stream2, stream3, and stream4 applications are being dynamically created/deleted in your edge. Can you please check if the problem is reproduced if the above application is created in Server.xml in advance?

Ok, I hardcoded the 4 applications and now i'm not yet able to reproduce the issue. Thanks, will now do further stress testing to see if im still able to reproduce.

web3omega commented 1 year ago

My edge server is now running 2 days without issue. So indeed the dynamic application names are causing the edge server to crash. For me the problem is solved with above workaround.

getroot commented 1 year ago

Thanks for checking. I will solve the problem in dynamic application.

getroot commented 1 year ago

I have committed a patch for this issue. You can test it with the latest master branch. I would be grateful if you could confirm that your problem has been resolved.