Decoding speed is not good as expected

brucechin commented 4 months ago

I want to leverage this framework to accelerate the old CPU-FFmpeg workflow and did some benchmark.

The ffmpeg execution flow is:

        command = (
            ffmpeg.input(file_path)
            .filter("select", "not(mod(n, 3))")
            .filter("scale", w="if(gt(iw,ih),-1,360)", h="if(gt(iw,ih),360,-1)")
            .output(
                "pipe:1",
                format="image2pipe",
                vcodec="mjpeg",
                vsync="vfr",
                qscale=2,
                threads=4,
            )
            .compile()
        )

For simplicity, on the GPU implementation side I did:

        decoder = nvc.PyNvDecoder(video_file, gpu_id)  
        width = decoder.Width()
        height = decoder.Height()
        frame_count = 0
        raw_frame = np.zeros((height, width, 3), np.uint8)
        while True:
            # Decode the frame
            success = decoder.DecodeSingleFrame(raw_frame)
            if not success:
                break
            frame_count += 1

For a ~10000 frame 1920*1080 resolution video, I run the code on 64 CPU-cores which are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and an NVIDIA T4 GPU. I obtain the following results:

Total videos decoded: 1
Total frames decoded: 9712
Total NV Framework decoding time: 15.64 seconds
Total ffmpeg CPU decoding time: 14.27 seconds

According to nvidia-smi dmon the T4 GPU dec unit utilization is quite high throughout the decoding execution:

gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
Idx     W     C     C     %     %     %     %   MHz   MHz
    0    52    57     -    48    20     0    74  5000  1590
    0    55    57     -    35    18     0    84  5000  1590
    0    61    58     -    30    19     0   100  5000  1590
    0    52    57     -    21    17     0    99  5000  1590
    0    53    58     -    17    15     0    93  5000  1590
    0    48    58     -    46    24     0   100  5000  1590
    0    60    60     -    37    19     0    90  5000  1590
    0    51    59     -    52    23     0    96  5000  1590
    0    56    59     -    36    18     0    83  5000  1590
    0    55    59     -    67    28     0   100  5000  1590
    0    60    60     -    46    23     0    97  5000  1590
    0    53    59     -    25    17     0    96  5000  1590
    0    56    59     -    63    25     0   100  5000  1590

Please note thate for the ffmpeg side, it can not fully leverage the 64 cores. If I create a thread pool and use it to process many videos, the speed could be ~7-8 times faster. May I ask if this is expected? here for T4 side, I only called DecodeSingleFrame and did not implement the full video transformation logic, but the speed is not as good as I expected. I thought T4 could be 10X faster than the current number. Otherwise, the switch from CPU ffmpeg to GPU decoding workflow does not enjoy much benefit.

cc @RomanArzumanyan @gedoensmax . If this GPU decode single frame performance is expected, do you have any other GPU acceleration suggestion for me? I though there could be things like batch processing to improve the overall decoding throughput but I have not found it. Really appreciate your help!

RomanArzumanyan commented 4 months ago

Hi @brucechin

There's no active development going on in VPF repo. Please check out VALI which is VPF spin-off: https://github.com/RomanArzumanyan/VALI. It's actively developed and maintained, has compatible API and module naming.

If this GPU decode single frame performance is expected

In your code snippet you are decoding video frames to system memory. That drastically slows down the decoding speed because every time you get a decoded surface, you're doing a blocking CUDA DtoH memcpy. Please keep your decoded frames in GPU memory as long as possible:

decoder = nvc.PyNvDecoder(video_file, gpu_id)  
frame_count = 0
while True:
    surface = decoder.DecodeSingleSurface()
    if surface.Empty():
        break
    frame_count += 1

PyNvDecoder has a DecodeSingleFrame method for convenience, e. g. when you need to dump raw frames to disk or as a fallback. It's not designed for performance.

brucechin commented 4 months ago

@RomanArzumanyan Thank you for your reply. Yes that is true, I observe that the DtoH is the dominant overhead inside GPU processing via Nsight system profiling (check figure 1). But the python level overhead seems be ~10x larger than the CUDA level execution (check figure 2).

Have you encountered similar bottleneck from python before? Is it possible to leverage C++ API to write the decoding pipeline in order to avoid the python level overhead?

I will also check the new repo!

RomanArzumanyan commented 4 months ago

Hi @brucechin

I’ve never seen such a high python side latency before. BTW there’s a wiki page with performance analysis: https://github.com/NVIDIA/VideoProcessingFramework/wiki/VPF-Performance-analysis

Have you seen it?

brucechin commented 4 months ago

@RomanArzumanyan I think it was caused by disk IO. I need to somehow pipeline the stages. Let me check this doc too! Thank you for your help here!

NVIDIA / VideoProcessingFramework

Decoding speed is not good as expected #570