Poorer than expected performance on ffmpeg nvtegra decoder

averne commented 1 week ago

Reposting emailed issue by @theofficialgman:

I have built the nvtegra decoders from https://github.com/averne/FFmpeg/tree/nvtegra-upstreaming on Switchroot Ubuntu Noble 24.04 and am benchmarking them against a ffmpeg implementation of a wrapper around nvidia's nvv4l2 (written primarily by CTCaer) https://github.com/theofficialgman/FFmpeg/tree/6.1.1-nvv4l2

Performance on your decoder is very subpar and results in a lot of CPU utilization (100% on one thread). I believe the issue is indicated by the following repeated log

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

On nvv4l2 there are no such logs or high CPU utilization.

For reference, here are the performance numbers I am measuring from https://repo.jellyfin.org/jellyfish/ test videos

https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv

nvtegra: fps 31

clocks: NVDEC 488
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv -f null -

nvv4l2: fps 66

clocks: NVDEC 716, VIC 192

https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-h264.mkv

nvtegra: fps 56

clocks: NVDEC 716
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-120-mbps-4k-uhd-h264.mkv -f null -

nvv4l2: fps 62

clocks: NVDEC 716, VIC 307

If there is a better place to have this discussion let me know because your github repo does not have issues enabled.

Thanks,

theofficialgman

theofficialgman commented 1 week ago

the performance gap is larger with lower resolutions (maybe bitrate too?) https://repo.jellyfin.org/jellyfish/media/jellyfish-10-mbps-hd-hevc-10bit.mkv

nvtegra: fps 118

clocks: NVDEC 448
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-10-mbps-hd-hevc-10bit.mkv -f null -

nvv4l2: fps 273
clocks: NVDEC 716 VIC 192

theofficialgman commented 1 week ago

if you have links to the video files used in your benchmarking here https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html that would be interesting for me as well

averne commented 1 week ago

If there is a better place to have this discussion let me know because your github repo does not have issues enabled.

Sorry, I forked this repo from the FFmpeg mirror which doesn't have issues enabled, and didn't notice until now.

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

Indeed that would be the problematic code path. FFmpeg needs to "download" the decoded data from the hardware engine, in this case going through a block linear to pitch linear layout conversion. At the moment two methods are implemented: when some constraints are satisfied, the transfer is accelerated using the VIC engine. Otherwise, it falls back to a cpu copy, which of course is far less performant. See here: https://github.com/averne/FFmpeg/blob/nvtegra-upstreaming/libavutil/hwcontext_nvtegra.c#L1009-L1015

To enable the fast path, you can apply the following patch to FFmpeg:

diff --git a/libavutil/frame.c b/libavutil/frame.c
index 0775e2abd9..0b0a34f0e8 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -212,7 +212,7 @@ static int get_video_buffer(AVFrame *frame, int align)
         total_size += sizes[i];
     }

-    frame->buf[0] = av_buffer_alloc(total_size);
+    frame->buf[0] = av_buffer_aligned_alloc(total_size, 0x100);
     if (!frame->buf[0]) {
         ret = AVERROR(ENOMEM);
         goto fail;
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 8dd05147a4..a650d66c9a 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -416,7 +416,7 @@ static int transfer_data_alloc(AVFrame *dst, const AVFrame *src, int flags)
     frame_tmp->width  = ctx->width;
     frame_tmp->height = ctx->height;

-    ret = av_frame_get_buffer(frame_tmp, 0);
+    ret = av_frame_get_buffer(frame_tmp, 0x100);
     if (ret < 0)
         goto fail;

And if you're using mpv, you can add this:

diff --git a/video/mp_image.c b/video/mp_image.c
index a89762b..af98703 100644
--- a/video/mp_image.c
+++ b/video/mp_image.c
@@ -175,7 +175,7 @@ static bool mp_image_alloc_planes(struct mp_image *mpi)
         return false;

     // Note: mp_image_pool assumes this creates only 1 AVBufferRef.
-    mpi->bufs[0] = av_buffer_alloc(size + align);
+    mpi->bufs[0] = av_buffer_aligned_alloc(size + align, 0x100);
     if (!mpi->bufs[0])
         return false;

diff --git a/video/mp_image.h b/video/mp_image.h
index af0d9fd..c047a45 100644
--- a/video/mp_image.h
+++ b/video/mp_image.h
@@ -32,7 +32,7 @@
 // libraries except libavcodec don't really know what alignment they want.
 // Things will randomly crash or get slower if the alignment is not satisfied.
 // Whatever. This value should be pretty safe with current CPU architectures.
-#define MP_IMAGE_BYTE_ALIGN 64
+#define MP_IMAGE_BYTE_ALIGN 256

 #define MP_IMGFIELD_TOP_FIRST 0x02
 #define MP_IMGFIELD_REPEAT_FIRST 0x04

Another consideration regarding your samples is that 10-bit hwdownloads are split in two VIC transfers due to hardware limitations, which impacts the throughput. If you want to check the actual performance of the decode engine, you can add an early exit to the transfer function. With this I find:

jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv: 75fps
jellyfish-10-mbps-hd-hevc-10bit.mkv: 350fps (added -stream_loop -1 since the sample is too short to stabilize)

Ultimately though, for the best integration within a video player or another application, you really want a zero-copy pipeline, ie. directly importing decoded frames in your gpu api as a texture. I've written such code for mpv and deko3d on hos, but not for opengl/vulkan on linux. It's probably possible to adapt the dmabuf code for vaapi, but that would need a bit of research.

theofficialgman commented 1 week ago

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

Indeed that would be the problematic code path. FFmpeg needs to "download" the decoded data from the hardware engine, in this case going through a block linear to pitch linear layout conversion. At the moment two methods are implemented: when some constraints are satisfied, the transfer is accelerated using the VIC engine. Otherwise, it falls back to a cpu copy, which of course is far less performant. See here: https://github.com/averne/FFmpeg/blob/nvtegra-upstreaming/libavutil/hwcontext_nvtegra.c#L1009-L1015

Maybe you will find this commit in CTCaer's nvv4l2 decoder relevant for that https://github.com/theofficialgman/FFmpeg/commit/1dc58f75d97751daf38ec16aaed5b9fa2e2d40cb

See the original post for link to the branch itself where you can see the current decoder and encoder implementations.

averne commented 1 week ago

64B alignments are relevant for block linear layouts (and I use it here). 256B is required for pitch linear, which is the format FFmpeg wants when "downloading" frames.

In any case, I can assist with the set up of a nvtegra->opengl interop backend for mpv. You can contact me @ avhe on discord.

theofficialgman commented 1 week ago

To enable the fast path, you can apply the following patch to FFmpeg:

diff --git a/libavutil/frame.c b/libavutil/frame.c
index 0775e2abd9..0b0a34f0e8 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -212,7 +212,7 @@ static int get_video_buffer(AVFrame *frame, int align)
         total_size += sizes[i];
     }

-    frame->buf[0] = av_buffer_alloc(total_size);
+    frame->buf[0] = av_buffer_aligned_alloc(total_size, 0x100);
     if (!frame->buf[0]) {
         ret = AVERROR(ENOMEM);
         goto fail;
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 8dd05147a4..a650d66c9a 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -416,7 +416,7 @@ static int transfer_data_alloc(AVFrame *dst, const AVFrame *src, int flags)
     frame_tmp->width  = ctx->width;
     frame_tmp->height = ctx->height;

-    ret = av_frame_get_buffer(frame_tmp, 0);
+    ret = av_frame_get_buffer(frame_tmp, 0x100);
     if (ret < 0)
         goto fail;

would this not cause issues with other ffmpeg decoders? It seems to me like you should be modifying the default output of av_frame_get_buffer if the input is 0 (which currently chooses an optimal value based on CPU https://github.com/FFmpeg/FFmpeg/blob/e61fed8280ccf2fb9e69c8d4e1849be2dcfebd89/libavutil/frame.h#L858-L881, or at least used to until they hardcoded it https://github.com/FFmpeg/FFmpeg/blame/e61fed8280ccf2fb9e69c8d4e1849be2dcfebd89/libavutil/frame.c#L185)

averne commented 1 week ago

I don't see how that could break anything. The first part of the patch sets the alignment of the buffer which is irrelevant for software decoders. The second part is only relevant for hardware decoders, of which there is currently only mine available on tegra, since vaapi and cuvid are (shamefully) not supported.

theofficialgman commented 1 week ago

I don't see how that could break anything. The first part of the patch sets the alignment of the buffer which is irrelevant for software decoders. The second part is only relevant for hardware decoders, of which there is currently only mine available on tegra, since vaapi and cuvid are (shamefully) not supported.

I'd like something that is upstream worthy and included in your patchset https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html.

The point (of me testing this) is to see if its worth including as an option to the (already very well functioning) CTCaer's nvv4l2. We currently force nvv4l2's use (on switchroot ubuntu bionic/jammy/noble) for all suitable decoding which allows every application that uses system ffmpeg (vlc, mpv, dolphin-emu, obs-studio, firefox (soon)) without any changes at the upstream applications. These aren't zero-copy either but they work well enough.

averne commented 1 week ago

Unfortunately the patchset in its current state will not use accelerated transfers unless by chance your media width is aligned to 256 and you luck out on buffer allocations, so I would not consider it suitable yet. Another option would be to use the gpu copy engine for frame transfers, which doesn't suffer from the constraints VIC does. But that would add a heavy amount of code.

averne / FFmpeg

Poorer than expected performance on ffmpeg nvtegra decoder #1