AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.56k stars 1.06k forks source link

OME server crashes regularly, while server is barely breaking a sweat #887

Closed TalkingIsland closed 2 years ago

TalkingIsland commented 2 years ago

Describe the bug We use OME by streaming via RTMP. We use a docker image of OME server. In different situations, the server's stream disconnects (player says "Connection with low-latency(OME) terminated unexpectedly). and stops serving the video to the viewers. At the same time, statistics show that there is still several mbits of incoming traffic from the incoming RTMP stream, however all outgoing traffic (hundreds of megabits) is gone. The only thing that fixes it is to run stop.sh and start.sh.

We noticed this issue occurring in 2 different ways.

1) When there is more than ~200 concurrent viewers, it simply stops serving the livestream within several minutes, and then we have to restart via stop.sh start.sh - and it resumes for a few more minutes, and then crashes again. image the above screenshot shows how the server looks like in terms of connections during that time. I'd like to highlight, that the server has 10gbit, and it barely reaches ~500mbit before it crashes. CPU usage is between 10-20%, and RAM usage is extremely small as well. Below screenshots are of the bandwidth usage and CPU usage from the same livestream, where ~400 people were attempting to watch it at the same time in the beginning. Only remained stable, once that dwindled down to around ~200 viewers. The more the viewers - the faster the crash occurs between restarts. image image

2) Just happens randomly once every few weeks, when viewership is under 150 concurrent viewers. Then all subsequent stream attempts also fail, even a day later, until stop.sh and start.sh are run

Expected behavior What I expected is for the docker instance of OME to use way more of server's resources before slowing down or crashing. Our hardware should be able to handle streaming up to 1000 people, based on our estimates, and docker doesn't seem to be limited in any way, as far as we can see.

Server:

getroot commented 2 years ago

Thanks for reporting.

Please provide additional information.

  1. Does OvenMediaEngine crash? If OvenMediaEngine crashes, there is a crash dump file in /opt/ovenmediaengine/bin. Sending it in will help us analyze the problem.
  2. Please upload the /var/log/ovenmediaengine/ovenmediaengine.log file when the situation occurs. If you have any privacy information in the log file, please send it to support@airensoft.com.
  3. Have you tuned your kernel settings? https://airensoft.gitbook.io/ovenmediaengine/troubleshooting#5-2.-tuning-your-linux-kernel
  4. For reference, you can simulate hundreds of webrtc playbacks with the performance tester we provide. https://airensoft.gitbook.io/ovenmediaengine/performance-tuning
TalkingIsland commented 2 years ago

Hi!

  1. It doesn't seem to crash, just the docker stops doing its work until its restarted. Also, there is no such folder/file. We have /opt/ome but it doesn't have a bin folder.
  2. Will do once I can replicate it.
  3. No, just did that now, thank you.
  4. Thank you - would that simulation equate to a real load, in that if the OME server glitches out with real people, it would glithc out with that testing tool?

On another note, I just upgraded to Ubuntu LTS 22.04.1 - as I have read that there were some kernel issues with 20.04 in relation to OME. Will be testing soon.

krakow10 commented 2 years ago

Is the memory filled up when it crashes? That's what happens for me when it's been running for a long time.

TalkingIsland commented 2 years ago

For us - no, the memory never goes above 15-20% usage (we have 128gb).

TalkingIsland commented 2 years ago

Hi @getroot I have sent the log to the email you provided. The log came from the "docker logs ovenmediaengine" command.

We have done the config changes that were recommended, we updated linux kernel to 5.15 (ubuntu 22.04), and we are still running into the same issue, consistently, after we hit around ~150-160 concurrent viewer mark. It is unrelated to bandwidth, as even when streaming a still image, the stream crashes at ~100mbit. It seems to be related to only the number of concurrent viewers.

At the time of crash, cpu usage was 9%, ram usage was 2.9%, bandwidth usage was 100mbit.

Please let us know how we can address this issue that has been plaguing us ever since we started using OME half a year ago.

getroot commented 2 years ago

Have you tried tuning OvenMediaEngine performance?

https://airensoft.gitbook.io/ovenmediaengine/performance-tuning

The term "crash" you mentioned confuses me, but looking at your logs it seems that this is not a crash and can be improved with performance tuning. Refer to the manual and adjust the number of threads. The performance test tool generates real traffic and can generate hundreds of sessions on 8 cores CPU machine.

TalkingIsland commented 2 years ago

Yes, we tried changing the settings there Publishers AppWorkerCount is set to 1 StreamWorkerCount is set to 24

I am trying to run the testing tool, and having issues connecting trying these commands:

go run OvenRtcTester.go -url ws://OURHOSTNAME:3333/app/stream -n 5
go run OvenRtcTester.go -url ws://OURHOSTNAME:3334/app/stream -n 5
go run OvenRtcTester.go -url ws://OURHOSTNAME:1935/app/stream -n 5

I either get client_0 failed to run (reason - invalid offer message received : ws://OURHOSTNAME:3333/app/stream ({"code":404,"error":"Cannot create offer"})) or I get client_0 failed to run (reason - could not connect signaling server : ws://OURHOSTNAME:3334/app/stream (reason : websocket: bad handshake)) How do I confirm the correct stream name? With our integration we authenticate via ?policy=xxxx and &signature=xxxx.

Thanks

getroot commented 2 years ago

404 means your OME server has not "app/stream" stream but you tried.

bad handshake is tls error. you have to connect your 3334/TLS port with wss://

getroot commented 2 years ago

Measure performance to see which threads are the bottleneck.

TalkingIsland commented 2 years ago

Ok, testing now, thank you.

TalkingIsland commented 2 years ago

image and docker is showing this: image

TalkingIsland commented 2 years ago

each of the streamworkers seems to use a different core, just so little of each.

TalkingIsland commented 2 years ago

Ok, so it "crashed" again when I connected 168 clients.

here is the log on the performance tool -

udp4 relay 1.1.1.1:14090 related IP:45318 candidate found
client_166 connection state has changed connected
client_166 track has started, of type 110: audio/OPUS
client_166 track has started, of type 98: video/H264
client_167 connection state has changed checking
client_167 has started
udp4 relay 1.1.1.1:14090 related IP:45334 candidate found
client_167 connection state has changed connected
client_167 track has started, of type 110: audio/OPUS
client_167 track has started, of type 98: video/H264
client_168 failed to run (reason - could not connect signaling server : wss://HOSTNAME:3334/app/stream (reason : websocket: bad handshake))

after that it disconnected all 168 sessions. Now, docker is unresponsive if I try to run a new stream, just sits at this, even though a new stream has been initiated image

getroot commented 2 years ago

hmm, I think something other than OvenMediaEngine is limiting your server's performance. We are processing over 4Gbps output on our 8-core server and bottlenek is cpu. Please let me know how the “ulimit” and kernel tuning settings are applied. And it seems necessary to measure speed with "iperf" to measure your network performance.

TalkingIsland commented 2 years ago

Perhaps, but I looked everywhere that I could.

ulimit is set to unlimited kernel tuning settings are applied.

I can hit 1gbps on a livestream without crashing, I can also crash with streaming a still picture with only 100mbit traffic. The only thing that matters is hitting approximately 170 simultaneous connections, which then crashes the script.

TalkingIsland commented 2 years ago

The result is 1.65gbit/sec after running iperf between the ome server as a server and the testing tool server as a client

However, when running the other way around, with the ome server connecting as a client and the testing tool server being the server it is closer to 500mbit/s.

Regardless, I know that it crashes only with 100mbit being used as well.

TalkingIsland commented 2 years ago

Could it be because docker is running through snap?

getroot commented 2 years ago

It is impossible to know precisely from the given information alone. However, it would be good to test by installing OME directly on the host instead of docker. Alternatively, it might be helpful to give it a try on an Aws instance as well. If I think of another possibility, I'll comment again.

getroot commented 2 years ago

How about measuring performance with a test tool on the same server? This will narrow down the cause of the problem.

TalkingIsland commented 2 years ago

Hi @getroot I reinstalled it directly, avoiding docker, as per your advise, and it is much better now, thank you.

Still not perfect though, we need to figure this out:

We went from a hard limit of 168 sessions, after which docker instance "crashed", dropping the stream for all viewers, to a hard limit of 383 concurrent viewers, after which it fails to connect any new sessions, but all existing sessions are still working fine, and if a session is terminated, then another can take its place.

The testing tool says the following when attempting to connect over 383 concurrent viewers:

client_0 connection state has changed connected
client_1 failed to run (reason - could not connect signaling server : wss://OURHOSTNAME:3334/app/u9nzhEsCHgwV (reason : websocket: bad handshake))
client_0 track has started, of type 98: video/H264
client_0 track has started, of type 110: audio/OPUS

At this point, it also seems unrelated to bandwidth, just a fixed number of websocket connections? Any idea how we can improve this limit?

Our CPU usage at the 383 peak is 20%, our RAM usage is 7%, and it uses 1.4gbit (our port is 10gbit).

I am using 2 different servers from different locations to run the testing tool, so typically they both can individually connect at least 250 each, but when attempting to connect together, the total turns out 383, as mentioned above.

Any thoughts? Thanks

TalkingIsland commented 2 years ago

At another run here, we maxed out at 380 concurrent. Avg video delay is just 10ms, no frames dropped, so its not the server being overloaded.

here is another error when reaching the 380th total client

client_129 failed to run (reason - could not connect signaling server : wss://OURHOSTNAME:3334/app/u9nzhEsCHgwV (reason : EOF))

During connection, both ome-test tool servers were throwing these errors multiple times a second at some point before stabilizing

turnc ERROR: 2022/09/25 05:31:00 fail to refresh permissions: all retransmissions failed for H4QfsmxQgyMDVBmd
turnc ERROR: 2022/09/25 05:31:03 fail to refresh permissions: all retransmissions failed for ChhumxwVlyt2jEK4
turnc ERROR: 2022/09/25 05:31:04 fail to refresh permissions: all retransmissions failed for Uy10E4Dst1uY8h/l

The OME log is showing the following line, when the last 381st client was attempting to connect:

[2022-09-25 05:30:56.883] W [SPAPISvr-T8081:215582] APIController | controller.h:154 | HTTP error occurred: [HTTP] Could not find the stream: [default/#default#app/NbVScAm83hVk] (404)

And another time when reached a limit on another stream - exactly the same [2022-09-25 05:54:55.032] W [SPAPISvr-T8081:1340] APIController | controller.h:154 | HTTP error occurred: [HTTP] Could not find the stream: [default/#default#app/NbVScAm83hVk] (404)

TalkingIsland commented 2 years ago

Ok, I have identified the issue. We were running it through nginx proxy and it didn't have enough worker connections available.

Now, I was able to hit 750 concurrent viewers, 3 servers 250 each, but we still have an issue with the testing tool. After a minute or a few minutes of normally serving 250 connections, with no errors, no packet loss, etc, it just drops all the bandwidth, and starts throwing these errors:

turnc ERROR: 2022/09/25 06:31:23 fail to refresh permissions: turn: failed to retransmit transaction oJoHN2pUJDGJdjI5
turnc ERROR: 2022/09/25 06:31:23 fail to refresh permissions: turn: failed to retransmit transaction ZqtmhisKIpvzQ2f2
turnc ERROR: 2022/09/25 06:31:23 fail to refresh permissions: turn: failed to retransmit transaction e1jvQUTHKyYr5qIW
turnc ERROR: 2022/09/25 06:31:24 fail to refresh permissions: turn: failed to retransmit transaction fc/ImTAQ4y8Wy7+O
turnc ERROR: 2022/09/25 06:31:24 fail to refresh permissions: write tcp4 TestingServerIP:42422->LivestreamServerIP:3478: write: broken pipe
turnc ERROR: 2022/09/25 06:31:24 fail to refresh permissions: turn: failed to retransmit transaction r8dxY2s7G8uriBau
turnc ERROR: 2022/09/25 06:31:24 fail to refresh permissions: turn: failed to retransmit transaction 35NC+DLn4BYbXx/4

What can be the reason for this issue?

TalkingIsland commented 2 years ago

Ok, I was able to hit a total of 1800 maximum viewers. The testing tool seems to only be stable for a long time with 200 sessions, and if I go over ~200, such as 300, then it crashes all the 300 sessions for that test-tool-server with the above mentioned errors, within a few minutes.

The server itself is handled 1800 viewers (6 test servers * 300 sessions) with a peak 6gbit, 43% cpu usage, 13% ram usage. So much better now, but if only the "fail to refresh permissions" error wasn't inevitably appearing on the ome test tool, it would be perfect.

getroot commented 2 years ago

It was a great experiment. It will help others as well. The key to the problem was that the performance of Nginx was limited. If I had known this structure already, the problem would have been analyzed more quickly.

The performance testing tool relies on Pion/WebRTC. The "Fail to refresh permissions" error is presumably due to a situation where the TURN request times out (network issue, performance issue, etc.). That is, it is an error that appears due to insufficient hardware, software, or network performance of the performance measurement tool.

Do you think it's okay to close this issue now?

TalkingIsland commented 2 years ago

I see, ok, thank you! Yes, I will close this issue with this comment and I hope it helps others in the future.