CasparCG / server

CasparCG Server is a Windows and Linux software used to play out professional graphics, audio and video to multiple outputs. It has been in 24/7 broadcast production since 2006. Ready-to-use downloads are available under the Releases tab https://casparcg.com.
GNU General Public License v3.0
912 stars 268 forks source link

Really high CPU load over time #1356

Closed dotarmin closed 1 year ago

dotarmin commented 3 years ago

Expected behaviour

Be able to play clips, both long and short without having to worry about the CPU load.

Current behaviour

When playing shorter clips using v2.3.0 LTS (even in v2.2.0), the CPU load goes to 90-92% over time and is stuck there. I have attached some screens to show how it looks like. For longer clips we do not see this behaviour.

Shorter clips = around 20 seconds Longer clips = hours

I think it has to do with the number of commands sent and that it's not related to the actual file length, but it's just a theory.

Used commands (from automation system)

LOAD
PLAY

LOAD
PLAY

Environment


Screenshots

image01

image02

image03

image04

TondaKrist commented 3 years ago

We are experiencing that too. After some period CasparCG 2.3 LTS process stucks at 99% and then fails. Even after STOPping all layers and playing only one then.

ronag commented 3 years ago

I have seen this too.

ronag commented 3 years ago

Does anyone have reliable repro steps?

Julusian commented 3 years ago

@scriptorian is able to reproduce this and is having a look into the cause

hummelstrand commented 3 years ago

Seems like it can be reproduced by issuing multiple LOAD and PLAY commands over time.

TondaKrist commented 3 years ago

Reproducable after multiple PLAY and LOADBG commands over time as @hummelstrand mentioned - even on single layer. I will prepare commands log to reproduce.

scriptorian commented 3 years ago

As mentioned I have managed to reproduce this with a test script that repeatedly LOADs a clip onto a channel/layer (using the ffmpeg producer). No PLAY is required to provoke the fault. For testing I have made the script loop every 200ms and this makes the problem apparent in a reasonable amount of time. The first symptom is the process working set increasing linearly, then after a few minutes the CPU load starts increasing too.

I have analysed the application using various tools and confirmed that it is working well and not leaking any threads or objects on the heap (with the exception of one rare bug that I have addressed - not relevant to this problem) which is great news but frustrating in terms of finding the problem. I recently tried running Windows Performance Analyzer and finally found a clue. By comparing CPU usage early and late in a run it was apparent that an increasing amount of time was spent in the TBB library and with cleaning up thread local storage. With some very simple (and not production ready!) hacking I removed the TBB thread parallel optimisations in the ffmpeg producer and the memory and CPU growth problem disappeared.

I don't believe there is anything wrong with the CasparCG code that uses this library so my next step will be to get an updated version of the TBB library and try again with that. The release notes mention some bugfixes that may be relevant. Intel have now wrapped it into their new oneAPI product and installing that failed for me just now. If anyone here has experience of this library (@ronag?) I'd be grateful for any pointers for how you cooked it / downloaded it last time.

ronag commented 3 years ago

Try skipping the custom tbb stuff and use the regular ffmpeg thread pool?

scriptorian commented 3 years ago

Thanks @ronag. If you are referring to to the override of AVFilterGraph::execute that is currently using TBB as the custom multithreading implementation then yes, I have turned this off. The real difference with this problem though is in the tbb::parallel_invoke and tbb::parallel_for_each calls in av_producer and av_util. Removing these stops the problem, removing just one of them halves the rate of growth!

ronag commented 3 years ago

For now just remove the tbb stuff. We can follow up with another PR with an updated tbb version later.

ronag commented 3 years ago

I don't know how to update tbb at the moment since intel wrapped it into oneAPI.

ronag commented 3 years ago

on windows you can also try https://docs.microsoft.com/en-us/cpp/parallel/concrt/how-to-write-a-parallel-for-loop?view=msvc-160

ronag commented 3 years ago

Do we know if this problem occurs on Linux?

scriptorian commented 3 years ago

Thanks for the suggestions. I've got hold of the latest tbb now and I think the best approach is to push through with trying that. If the problem has gone away then there are no code changes (any tbb interface changes notwithstanding) and linux should continue to work - hopefully without any problems. Any other approach would require a fair amount of code changes with potentially surprising impacts on performance and that seems like something to avoid if possible.

TondaKrist commented 3 years ago

Sorry, is it something we can fix via some TBB tweaking in Windows, or not?

scriptorian commented 3 years ago

I have now downloaded and built with the latest TBB library from the Intel oneAPI product. There were some API changes but dealing with these was straightforward and should be safe. The good news is that this completely fixed the growing CPU and memory problems. I have left my test script running for a good long time and everything stayed very steady.

TondaKrist commented 3 years ago

Awesome, will it be included in some future builds of CasparCG? Or can you please provide your build for long time testing?

scriptorian commented 3 years ago

We are just discussing how to progress with testing this change and whether to make a beta version. Does anyone here have any thoughts? I'll update this thread when we have a plan!

hummelstrand commented 3 years ago

Please beta test and report any issues here! https://github.com/CasparCG/server/releases/tag/v2.3.2-lts-beta

dimitry-ishenko commented 3 years ago

Is this something to worry about on Linux? (Running NRK version).

scriptorian commented 3 years ago

It's not clear whether the TBB bug also exists in the Linux version. The TBB release notes include some mentions of fixing relevant bugs in the Windows version so there is reasonable hope that this problem won't affect Linux. The updated TBB library is available for Linux so it should be straightforward to make an updated build if problems appear.

hummelstrand commented 3 years ago

Is this something to worry about on Linux? (Running NRK version).

The latest NRK version of CasparCG Server is v2.1, so it is not affected by this bug which seems to have been introduced in v2.2.

dimitry-ishenko commented 3 years ago

OK I get it. Thank you @scriptorian and @hummelstrand

martastain commented 3 years ago

Just FYI: It seems there is no problem with increasing CPU load on 2.3.2 beta on Windows 10 (yellow lines). There is just a slight memory usage increase over time but from my experience, it will eventually drop.

Green lines belong to a custom 2.3.0 build running on Debian. Both servers use LOADBG/AUTO to play mixed (Linux) and XDCAM HD (Windows) playlists.

shot-justreadtheinstructions-20210129-111714

TondaKrist commented 3 years ago

I have to confirm, that this build fixes CPU usage leak on Windows (both Intel and AMD currently running 5 days 24/7). Thanks guys, awesome job in investigation and fix.

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled. I will start a new thread for that.

ronag commented 3 years ago

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled.

I have also encountered this.

dotarmin commented 3 years ago

@TondaKrist or @ronag, can you please create an issue for this of not already done? Thanks

Never mind, already done, thanks!

sendust commented 3 years ago

This is off-topic, but Beta-version v2.3.2-lts-beta also has audio issues on systems that use the 1001-based-standard. #1326 already has a solution to the audio issue, and I hope users using NTSC can participate in this test. Thanks~~