ffmpeginteropx / FFmpegInteropX

FFmpeg decoding library for Windows 10 UWP and WinUI 3 Apps
Apache License 2.0
211 stars 53 forks source link

Hardware Decoding 10x faster than Software Decoding? #443

Open softworkz opened 23 hours ago

softworkz commented 23 hours ago

I'm afraid, but not even close...

PC Laptop
Speed CPU GPU (3D) Speed CPU GPU (3D)
SW Decoding 9.05 100% 0% 3.75 100% 0%
Intel
HW Decoding 7.87 25% 0% 7.66 20% 0%
HW Decoding + HW Download 3.68 32% 80% 5.5 51% 61%
HW Decode, HW Dow, Subtitles, HW Up, HW Encode 1.95 40% 80% 1.89 48% 61%
SW Decode, Subtitles, HW Upload, HW Encode 3.54 75% 60% 2.79 100% 40%
Nvidia
HW Decoding 7.6 8% 79%
HW Decoding + HW Download 7.58 12% 90%
HW Decode, HW Dow, Subtitles, HW Up, HW Encode 1.34 25% 20%
SW Decode, Subtitles, HW Upload, HW Encode 4.65 60% 100%

Reproducing

Here's an Excel file including all the ffmpeg commands: SubtitleBurnInTests.xlsx

The test.mlkv is "Samsung Dubai" which you can find on DemoLandia.net Subtitles file: subs.zip

Assessing the Results

General

First of all, this exactly aligns with what I wrote in this (https://github.com/ffmpeginteropx/FFmpegInteropX/issues/439#issuecomment-2484469788) and subsequent posts below.

My Laptop (Tigerlake) has similar graphics than my PC (RocketLake), that's why results are similar. Unfortunately the older Laptop I have is too old for this. Feel free to run these tests on weaker machines. You will see somewhat different results and relations, but all of the following conclusions will generally hold true (exceptions always possible).

When assessing transcoding performance results, these are often appearing to be odd and unexpected. It is important to understand that:

This is why I said that you cannot reasonably talk in factors when trying to make comparisons in this area.

Observations

Yet, my statements aren't based on some synthetic test results. Up until 2 or 3 years ago, we have regularly received user reports about stuttering audio, where it turned out that it was caused by transcodes with subtiltle burn-in. After half a year of research and testing, we have made the change in the stable release to use sw decoding instead of hw decoding in those specific cases. After this change, we have rarely seen any such report. Many are running our server on NAS devices with non-recent and non-high-end CPUs, and this change has helped to lift the transcoding speed over the critical bar (1.0x, everything below cannot play fluidly) for many users.

Final Notes

Nr 1

If it was in the context of FFmpegInteropX playback where you came to that "10x faster" impression, then you might have missed to consider the following:

When comparing decoding speed while switching FFmpegInteropX between hw decoding and ffmpeg sw decoding, you are not comparing "sw decoding" to "hw decoding". Instead you are actually comparing "sw decoding + hw uploading" to "hw decoding without data transfer".

Nr 2

Some things you wrote just do not sound right.

Yes, that's why I wanted to tell you about it. There's not much point in telling things you already know 😆

But it's not about "I know something that you don't know" - it's about knowledge transfer. Since FFmpegInteropX is driving our Xbox app now, we have a natural interest in getting it even better.

brabebhin commented 21 hours ago

What does "speed" mean? Is it GHz? Time it takes to do something?

softworkz commented 21 hours ago

The speed is something that ffmpeg is outputting. It indicates velocity of its progress through the file relative to realtime playback. At 1.0x - it would take the same amount of time to process the file as the playback duration. So, everything <1.0x cannot be played smoothly without interruption.

softworkz commented 21 hours ago

We actually have a plugin for the server which is able to generate (depending on available gpus) and run such tests automatically, here's another set of figures for subtitle burn-in:

image

...showing that sw decoding is faster than hwdecoding+download.

softworkz commented 21 hours ago

The two videos here are showing these tests in action, including a visualization of the transcoding pipeline topology: https://github.com/softworkz/SubtitleFilteringDemos/tree/master/TestRun1?rgh-link-date=2024-11-19T10%3A47%3A59Z

softworkz commented 19 hours ago

For convenience, all-in-one (ff binary + test files): https://1drv.ms/u/c/8a9863d7afb15f9b/Ebkn1YXBEBlMutUN5NuzO1oB7NSZNnrKyNdpJtJqKkrxhw?e=jqUq9B

Just unzip and you can run the commands in the Excel file.

lukasf commented 18 hours ago

@softworkz These are interesting results. But you are totally missing the point. The discussion was about efficiency (power consumption) - not decoding speed. I never said that a HW unit can decode 10x as fast as a CPU. That would be ridiculous - why would they add such an overpowered HW unit, wasting silicon area? I said that it uses 10x less energy during playback, thus saving battery and keeping the device cool. And I gave you proof that the numbers are actually even higher for most modern codecs.

It is a rule of thumb that an ASIC can perform an algorithm about a magnitude more efficient than a general purpose CPU (of course, strongly depending on the actual algorithm and implementation details). That's why they are used so often. Modern CPUs also start integrating ASICs for crypto algorithms, since they become more and more common and start eating a considerable amount of CPU power without ASIC support. CPU manufacturers sure would not do that if it did not have a considerable effect. And crypto mining of the popular coins is mainly done on ASICs now, since they are so much more efficient, allowing much higher revenues than GPU mining.

lukasf commented 18 hours ago

I wonder if it would be possible to speed up hwdownload and hwupload by parallelizing it (doing multiple downloads and uploads at the same time). Theoretically, the speed of PCIe 3.0 x16 should be high enough for download+upload of 4K frames in real time. And when running on a iGPU, things do not even have to go through PCIe. Why is it so slow then?

FFmpeg 7 does run filters in a graph in parallel, but a single hwdownload or hwupload does only process one frame at a time.

brabebhin commented 18 hours ago

The PCIe 3 port is not the bottleneck, a 4090 can barely saturate it. Some of the overhead comes from DirectX11 itself. There are some optimizations that can be done with iGPUs, but these are only available in DirectX12 IIRC. Even so, DRAM and memory buses to the CPU are significantly slower than what a dedicated GPU can achieve.

softworkz commented 18 hours ago

I wonder if it would be possible to speed up hwdownload and hwupload by parallelizing it (doing multiple downloads and uploads at the same time). Theoretically, the speed of PCIe 3.0 x16 should be high enough for download+upload of 4K frames in real time.

There's locking for D3D11 frame access in ffmpeg. I have removed that in our ffmpeg, because ffmpeg filtering is (was) single threaded, but that brought jsut a small improvement in certain cases. Anyway, D3D11 doesn't support multi-threading (access from multiple threads yes, but they must be serialized AFAIR). D3D12 supports real multi-threading.

I wonder

What I've been often wondering is why they can't just remap the memory instead of copying in case of iGPUs - it's the same memory anyway.

FFmpeg 7 does run filters in a graph in parallel, but a single hwdownload or hwupload does only process one frame at a time.

I have not worked with the code from newer versions, but running filters in parallel can only mean that multiple filters can execute in parallel, From the architecture it's not possible to have a single filter executing in parallel.

brabebhin commented 17 hours ago

Remapping for iGPU is available in directx12.

softworkz commented 17 hours ago

Oh, and there's a doubling involved. When you upload or download a d3d texture, you get a pointer in CPU memory for accessing the data, bu tyou don't "own" the data, so you need to copy it to or from your own memory range.

It doesn't double PCIe bandwidth, but memory bandwidth. And CPU time for copying.

softworkz commented 17 hours ago

There's also the requirement of using array textures with D3D11 that's why it's slower than DXVA2 - or wait - I think that was just requirement for QSV withh D2D11.