AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.55k stars 1.06k forks source link

Streams dropping intermittently #729

Closed naanlizard closed 2 years ago

naanlizard commented 2 years ago

Describe the bug I'm still investigating but it appears that OME freezes and stops receiving input streams from all rtmp inputs. It recovers fairly quickly (less than a few seconds) but it still can cause some buffering for viewers. Here is a graph showing CPU usage (orange) and connected streams (green) - CPU usage is dropping to near 0 periodically when streams are disconnected.

I've checked and it does not appear to be our host's network connection

Screen Shot 2022-04-11 at 5 10 35 PM

Attached here is an excerpt of our log file including a few minutes before everyone is disconnected and probably 10-15 minutes after

cleanlog.txt

Server.xml

Aside from the actual bug, if there are any performance recommendations we're happy to hear them :)

getroot commented 2 years ago

Thank you. You seem to have found a very serious problem. Although increasing <Bind><Providers><RTMP><WorkerCount> may partially solve this problem, it will not solve the underlying problem. I'll try to analyze this problem soon and blow it up to another galaxy.

naanlizard commented 2 years ago

Increasing what may partially solve it?

This is with the latest release btw

getroot commented 2 years ago

I mistyped the code tag in the comment. It's corrected.

naanlizard commented 2 years ago

What would you recommend as an optimum value? Should other bind provider worker counts be increased as well?

getroot commented 2 years ago

<Bind>~<WorkerCount> is to set the number of threads that receive data of the corresponding port. Usually, one is enough. However, if you receive a lot of stream input from the RTMP socket and check the CPU usage for each thread, if SPRTMP-XXX threads are using the CPU core a lot, then increase the number of threads to expand.

https://airensoft.gitbook.io/ovenmediaengine/performance-tuning#monitoring-the-usage-of-threads

What I said above that increasing the number of WorkerCount might help, I assume that the problem you report is that the RTMP Module blocks when a session ends, and this affects other sessions as well. At this time, if there are multiple threads, it will affect the few sessions.

naanlizard commented 2 years ago

Thank you, I understand the worker count better.

I am not sure if it is that the bug is triggered when a stream ends - the "connected streams" graph is from nginx-rtmp, and we forward streams from ome to nginx-rtmp. So all nginx-rtmp streams are dropped when the bug happens. The graph simply sometimes does not capture that

Anything else you would recommend changing in our config? We'll try the new settings and see if it improves things

naanlizard commented 2 years ago

Oh, I should also say, this is on a very powerful 16c/32t CPU, and the problem happens even with very few streams connected. If the problem you think is that a stream disconnecting spikes CPU usage, I am not sure that is correct

getroot commented 2 years ago

Please check if my understanding is correct.

  1. You are running OME on a 16c/32t CPU.

  2. Dozens of RTMP streams are input to OME.

  3. OME redirects all streams to nginx-rtmp.

  4. Intermittently the OME freezes and stops receiving all RTMPs packet. And it will be recovered after a few seconds. (Are all connections disconnected and then reconnected?)

  5. At this time, the CPU usage of OME instantly drops to 0% and then recovers.

  6. Intermittently all retransmissions from OME to nginx-rtmp stop and reconnect? So all nginx-rtmp streams are dropped when the bug happens. -> Are there any logs for this? And does this have anything to do with number 4?

  7. The ORANGE line in the graph you uploaded is the CPU usage of OME.

  8. The GREEN line of the graph you uploaded is the number of streams obtained from nginx-rtmp.

naanlizard commented 2 years ago

Comments added inline, if I do not clarify or add anything it is because you are correct

Please check if my understanding is correct.

  1. You are running OME on a 16c/32t CPU.
  2. Dozens of RTMP streams are input to OME.

At times, dozens, sometimes only two or three. They are user generated so you can assume they start and stop randomly

  1. OME redirects all streams to nginx-rtmp.
  2. Intermittently the OME freezes and stops receiving all RTMPs packet. And it will be recovered after a few seconds. (Are all connections disconnected and then reconnected?)

Yes, at least all of the nginx-rtmp streams disconnect. I don't believe obs (used by the user-streamers) disconnects, just drops frames for a few seconds

  1. At this time, the CPU usage of OME instantly drops to 0% and then recovers.

Not exactly zero, but very close

  1. Intermittently all retransmissions from OME to nginx-rtmp stop and reconnect? So all nginx-rtmp streams are dropped when the bug happens. -> Are there any logs for this? And does this have anything to do with number 4?

This is the same bug as 4 - I've never seen them happen separately. The only log I've got is already posted but I can try and capture another if it helps. I do not have logs from nginx-rtmp

  1. The ORANGE line in the graph you uploaded is the CPU usage of OME.

it is the CPU usage of ome in arbitrary units. I can provide a graph of percentage-cpu usage if you want

  1. The GREEN line of the graph you uploaded is the number of streams obtained from nginx-rtmp.
getroot commented 2 years ago

(4) How did you check that the OME frezees, stops receiving packets and drops frames? RTMP is a TCP connection, so unless the connection is completely disconnected, it will be retransmitted and eventually the packet will be received. Did you find the problem with HLS playback from OME?

(6) It may be a problem of the RTMP Push Publisher. I couldn't find a log related to this in the log you posted, so I asked if there are other related logs.

naanlizard commented 2 years ago

4 - all playback is currently through ome, apologies our setup is complex. Both hls and webrtc. We know ome drops frames because I have watched it happen on my obs that was streaming at the time. We have not directly tested that ome freezes but it seemed the best word to use

getroot commented 2 years ago

You watched OBS drop frames. I understood.

I haven't experienced this problem yet, so it might take a while to reproduce this. Let's try it first with your Server.xml. I would appreciate it if you could provide additional information while I reproduce this issue.

naanlizard commented 2 years ago

Absolutely, just let me know what you need. If you'd like to try our setup it is a docker setup. I can email you how to do it

getroot commented 2 years ago

To check all possibilities,

Please let me know the dmesg -L -T result and kernel version uname -r of the host server, and the dmesg -L -T result and kernel version uname -r in the docker instance.

getroot commented 2 years ago

And if you haven't tuned your kernel, this information might be of some help.

https://airensoft.gitbook.io/ovenmediaengine/troubleshooting#5-2.-tuning-your-linux-kernel

naanlizard commented 2 years ago

Host: 5.10.0-051000rc6-generic cleandmesg.txt

Docker (doesn't permit dmesg to run, but just in case I listed it) 5.10.0-051000rc6-generic dmesg: read kernel buffer failed: Operation not permitted

I had no idea about some of the tuning suggestions, I applied them and will rewrite that section to be a bit clearer about what they do hopefully soon.

naanlizard commented 2 years ago

I captured logs for a longer freeze (and a few minutes before and after) - also probably more useful as this includes errors and such! I didn't realize how > worked with docker. You must use &> to output errors as well as normal logs!

cleanlongfreeze.txt

naanlizard commented 2 years ago

As of 9pm local we have disabled stream forwarding from OME to nginx-rtmp - it's use is incidental and perhaps the forwarding is the cause of the issue. Will report back tomorrow to see if it helped overnight

getroot commented 2 years ago

I haven't been able to reproduce this problem yet.

Does inputting 1 RTMP stream reproduce the problem? If so, I think the scope of analysis should be extended to firewall, Linux kernel, NIC, NIC driver, etc.

Once upon a time, I had a network freezing problem on a server providing a commercial service, and after several days of overnight analysis I found out it was a bug in the kernel. https://bugzilla.kernel.org/show_bug.cgi?id=205933

You are using kernel version 5.10, so this is probably not an issue.

naanlizard commented 2 years ago

We haven't tried reproducing it in test conditions, these are all on production. We've just deployed the change to the workercount you recommended (to 16) and we'll see how that goes.

Disabling rtmp-forwarding did not help (though we may still be sending API calls, unclear and I'm checking with our dev but they're asleep)

I suspect it is not the firewall at least, our firewall config is the same as what we used with nginx-rtmp successfully for many years. Same as NIC, NIC driver, kernel, etc. Though perhaps those things interact differently with OME

getroot commented 2 years ago

I understood. I will continue to analyze. Oh, I didn't ask the most important thing. What version of OME are you using?

naanlizard commented 2 years ago

airensoft/ovenmediaengine:0.13.2

getroot commented 2 years ago

I haven't been able to reproduce it on my development server yet. Also, I did not find this problem in the commercial service that I am providing technical support for (15,000 concurrent users, 20 Edges). If this problem had been reproduced on that service, it would have received huge complaints.

naanlizard commented 2 years ago

Yeah I assumed it was something we're doing different from everyone else's deployments.

It is fairly frequent, once every two or three hours but sometimes many times in a row, sometimes long periods of good behavior.

Will check nic stuff later, as I'm out now

naanlizard commented 2 years ago

Attached is my netstat -s interface output - perhaps it means more to you than me, I'm no networking genius. netstatsoutput.txt

After changing the worker count in the provider section (for all providers) to 16, things are at least better - at least my test stream hasn't been affected after ~4 hours and no obvious CPU usage drops in the CPU graph, but perhaps if they are 1/16 as extreme, it wouldn't be noticeable.

naanlizard commented 2 years ago

Spoke too soon, users are still getting buffering time to time though the CPU graph looks smoother. I'm not sure if streams still get dropped frames

We'll be switching back to nginx-rtmp until we can sort this out

getroot commented 2 years ago

Does "users are still getting buffering time" happen in HLS or WebRTC? Or is it both?

If the frame is dropped at the front end (OBS->OME), there may be a frame drop problem during playback (video stuttering), but I still don't know exactly why the buffering occurs. At first, I was suspicious of the RTMP ingest part of OME as you analyzed, but I think you should check other parts as well. OBS's frame drop and viewer's buffering problem may not be related.

Please check the contents below.

Buffering problem in WebRTC

Buffering problem in HLS

getroot commented 2 years ago

For reference, Legacy HLS is not a protocol that can reliably process very small segments for low latency. Apple also recommends this to be 5 to 10 seconds. And LLHLS was released to solve this. LLHLS supports latency of 2-3 seconds.

naanlizard commented 2 years ago

Does "users are still getting buffering time" happen in HLS or WebRTC? Or is it both?

What happens for our viewers is that WebRTC will simply drop connection entirely, then the player reloads and tries to connect again but cannot - this leaves a play button on screen but clicking it does nothing

On HLS it just leads to endless buffering (but that's using a videojs player, so I can't fault OME for that necessarily, other than the initial buffering)

In WebRTC of course streamers have b frames set to 0, this is a separate problem. We're well aware of the b frames stuttering - this is very different.

We playback with tcp, we just want the option to play udp in the future without updating our OME config

I'll try and capture the video graph you requested, but it will take some time. Again it is OME completely stopping as above, simply less common now.

Re: HLS segment size - nginx-rtmp worked for many years flawlessly with a segment size of 1. We can try increasing it to 4 that but it would be unfortunate to have to change. I do not think this is the issue either, again this is the same behavior as the initial bug report

naanlizard commented 2 years ago
Screen Shot 2022-04-15 at 10 55 11 AM

Here is a graph of CPU usage (ignore the axis labels on the chart). This is with the increased worker count - clearly OME is still having trouble.

getroot commented 2 years ago

A segment of 1 second in HLS is very unstable. As you well know, nobody recommends it. Apple, Google, and everyone else doesn't use it that way. If it works well, there is no reason the LLHLS protocol was released. For normal playback, the viewer's network jitter must be very low, as every segment must always be downloaded in less than a second. However, not all viewers have such a great network environment.

Furthermore, in the OME 0.13.2 release, there is a big problem with using 1 second segment. It did not support HTTP/1.1 persistent connections.

https://github.com/AirenSoft/OvenMediaEngine/issues/279#issuecomment-1075284675

This is now patched to the latest master. However, as you know, the master branch is a development version and cannot be used in a commercial environment.

This problem is too difficult. It cannot be reproduced in both my development environment and the commercial environment that I support. @bchah Have you ever experienced this problem? Or do you have any doubts?

naanlizard commented 2 years ago

When I'm home from my trip I will work too reproduce this somehow for you.

Re: segment size, perhaps the actual segments created are on keyframes? Most of our streamers use 4 as their keyframe interval. I cannot account for what Google and apple recommend, only our experiences running nginx-rtmp for over 4 years now. This could explain some buffering at least though, but not the ingest dropping frames and webrtc disconnecting

getroot commented 2 years ago

Thanks for continuing to help with the analysis. I think you are running into several problems.

  1. Frame drop in OBS It seems to be related to the problem of CPU stuck in OME, I suspect this may also happen with certain Docker versions, so I'm looking into it. It would be very helpful if you let me know if the problem reproduces when you install it on a non-docker host. Also, tell us the docker version you are using and the docker run command. (If you have made any special settings in docker, please let us know as well.)

  2. WebRTC Disconnecting Problems Regarding (1), if OME freezes for more than a few seconds, it may be cut off due to ICE Timeout (How many seconds does it freeze? If it is more than 30 seconds, all clients will be disconnected.)

  3. HLS Buffering Issues This seems to be a different problem than (1)(2). First, let me focus on (1)(2).

naanlizard commented 2 years ago

The hls buffering happens for every stream at the same time, I think ome entirely stops responding and serving hls files, webrtc connections, and servicing incoming rtmp connections.

The freezes are short typically. Several seconds, not thirty seconds

getroot commented 2 years ago

Yes you seem to suspect that the network is completely stuck for a few seconds. I haven't been able to find this happening in OME in the logs you posted, so I'm also looking for a problem with docker or something else.

naanlizard commented 2 years ago

that's fine, I just wanted to be clear the problem is all playback methods.

Perhaps it is a network problem with docker. The streams don't disconnect from ome though, just drop frames, so I assumed it was a processing problem with ome

I understand the frustration trying to solve a bug you can't reproduce. We'll be focusing on that in the coming month

getroot commented 2 years ago

This kind of problem (whether it's a bug in OME or an external one) pushes OME and the OME community further. Problems like this are not easy to discover without the help of the community. Thank you for your contribution.

bchah commented 2 years ago

I only have anecdotal evidence to contribute on this one so far. I think @naanlizard has a more rigorous use case, since my application use case is only for about ~5 streams and ~5 viewers max per stream. We only use WebRTC for playback, with HLS as emergency failover only. For HLS we use 4 second segments with a count of 4, so 16-20 seconds real world latency. I don't know how HLS would survive with a 1 second segment. Perhaps if the encoding parameters were optimized for very low bandwidth.

Testing with a single RTMP stream (OME 0.13.2) there is no sign of a chronic disconnect issue on the OME side. We leave a test stream going 24/7 and while it does "reset" once in a while, this is to be expected after running for days and days in a row:

IMG_8139

A user contacted me recently asking why their stream was disconnecting and reconnecting from time to time, with a similar description to this issue. When I investigated I noticed they were sending a stream with B-frames and it was logging errors on every frame. Every few hours OME appeared to be restarting on its own but there were no critical errors or SIGABRTs in the log, so it is not clear what the actual cause was. The log file was big, something like 600MB in just one day. Once they stopped sending B-frames, the issue went away.

Playback quality is one of the most important things so I have an offer to suggest to @naanlizard - once you are able to reproduce the issue on your own systems (outside of production), I can provide a test endpoint where you can run the same stream that caused the issue and @getroot can access the OME instance directly, to better observe the problem.

naanlizard commented 2 years ago

For the record I’m happy to let you tinker with our live server if you’d like @getroot , just let me know where to email. If it would be helpful of course

naanlizard commented 2 years ago

Figured I'd show a more clear example with the updated config (as above) - https://i.imgur.com/nZuHe8c.png

OME CPU usage is still going to near 0 - perhaps it is getting caught on recording or something? The hard drive we record to isn't particularly busy aside from writing streams and serving streams to be downloaded (rare)

@getroot do you have recording enabled on your big deployment? What about start and stop stream callbacks?

I assume right now that we are using something that you aren't and that is what causes the trouble

getroot commented 2 years ago

I run several video platforms.

The platform mentioned at https://github.com/AirenSoft/OvenMediaEngine/issues/738 operates 1 channel and provides WebRTC streaming to 15,000 concurrent users. It operates on an Origin-Edge architecture, and Origin does not use AdmissionWebhooks or Recording. Edge doesn't do anything but play.

On another platform I record 15 channels. It has been running non-disruptively for over a year already.

On some other platforms, it operates with the Origin-Edge structure, and although it is not large, Origin uses almost all of them, such as AdmissionWebhooks and Recording. Edge, of course, doesn't do anything but play.

It's hard to imagine CPU usage going to zero intermittently other than when the network is stuck or kernel soft lockup. If there is no input stream, CPU usage may drop sharply because there is nothing to process at the back end. Because soft lockup locks the core, of course, CPU usage drops sharply.

If the input network is stuck, it can also happen if the socket uses 1 WorkerThread and AdmissionWebhooks gets a late response. This is because the socket thread blocks until AdmissionWebhooks receives a response. So the ControlServer should respond very quickly. I never imagined this part because AdmissionWebhooks is not set in the Server.xml you uploaded. Are you using AdmissionWebhooks?

Softlockup issues are outside of OME. For example, https://github.com/moby/moby/issues/42895 You can also refer to this article. How is your system's memory usage? <Rtx>true</Rtx> can be memory intensive (depending on the bitrates of the input stream, of course).

naanlizard commented 2 years ago

Plenty of available memory, we've got about 120 GB, 40GB free most of the time.

It's hard to imagine CPU usage going to zero intermittently other than when the network is stuck or kernel soft lockup. If there is no input stream, CPU usage may drop sharply because there is nothing to process at the back end. Because soft lockup locks the core, of course, CPU usage drops sharply.

Kernel lockup, I don't believe is happening - everything else works reliably (a realtime chat server, file serving, loading the frontend and backend calls, etc)

Similarly, I don't think it's the network. This is the same setup that has been running nginx-rtmp for years.

If the input network is stuck, it can also happen if the socket uses 1 WorkerThread and AdmissionWebhooks gets a late response. This is because the socket thread blocks until AdmissionWebhooks receives a response. So the ControlServer should respond very quickly. I never imagined this part because AdmissionWebhooks is not set in the Server.xml you uploaded. Are you using AdmissionWebhooks?

new 1.txt

Here is our server.xml again - we definitely use admissionwebhooks. I would not be surprised if our backend server responded slowly sometimes. This could cause dropped frames for streams that were already connected and playback problems for viewers of those same streams? I would expect it to only cause a slow start to the stream being checked

Can you clarify - "the socket uses 1 WorkerThread" - would increasing a value somewhere in our Server.xml potentially fix this?

getroot commented 2 years ago

I want to make sure that the environment in which you ran nginx-rtmp for many years was also docker.

As I said before, you have already increased the socket threads for RTMP to 16 with the settings below. Your server has 16 workers to receive RTMP connections. When a user requests an RTMP connection through OBS, etc., it belongs to one of these workers. If your ControlServer responds late when an RTMP connection tries to initiate or close a connection, it will freeze other RTMP connections of the worker it belongs to.

<RTMP>
<Port>${env:OME_RTMP_PROV_PORT:1951}</Port>
<WorkerCount>16</WorkerCount>
</RTMP>

By the way... you created too many threads. If there are too many threads, too much context switching occurs, which can actually degrade performance. WorkerCount is recommended to increase by looking at the cpu core usage per thread as I previously guided.

naanlizard commented 2 years ago

I'll have to discuss later which configuration values should be changed to what in the future

Yes nginx-rtmp ran only in docker (everything we run is in docker)

getroot commented 2 years ago

The thread for RTMP input and the thread for WebRTC output are independent threads. Therefore, freezing the RTMP thread does not cause the WebRTC thread to freeze. But I'm not sure how the WebRTC Player will react if there is no WebRTC output for a few seconds because there is no input from RTMP for several seconds.

naanlizard commented 2 years ago

We have switched back to nginx-rtmp ingest and webrtc playback via OME

So:

Previously:

OME for all features - ingest, playback (HLS and WebRTC), access control for ingest and I believe playback, thumbnails, screenshots, recordings

Now:

nginx-rtmp ingest, HLS playback, thumbnails, screenshots, recordings, access control

ffmpeg forwarding rtmp streams from nginx-rtmp to OME for WebRTC playback only.

Currently nginx-rtmp seems far more performant than OME was under the same conditions.

We will be experimenting with multiple containers to compare CPU usage across nginx-rtmp and OME, and hopefully debug the dropping streams (still happening with our same config with the ffmpeg forwarding - will be testing change by change what causes that hopefully)

getroot commented 2 years ago

I fixed some code that was badly affecting performance today. In my tests, when I created 100 streams, it used 450% CPU before improvement and 300% CPU after improvement.

It would be very helpful if you test it with the latest master code in your environment and report the results.

Ah, the thumbnail side is not yet. This is an improvement on the stream parsing library.

bchah commented 2 years ago

@getroot +100 for this!

Just compiled your latest changes and the CPU now sits relatively ~25% lower under the same workload.

🥇 🥇 🥇 🥇 🥇

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.