Open brucechin opened 4 months ago
Hi @brucechin
There's no active development going on in VPF repo. Please check out VALI which is VPF spin-off: https://github.com/RomanArzumanyan/VALI. It's actively developed and maintained, has compatible API and module naming.
If this GPU decode single frame performance is expected
In your code snippet you are decoding video frames to system memory. That drastically slows down the decoding speed because every time you get a decoded surface, you're doing a blocking CUDA DtoH memcpy. Please keep your decoded frames in GPU memory as long as possible:
decoder = nvc.PyNvDecoder(video_file, gpu_id)
frame_count = 0
while True:
surface = decoder.DecodeSingleSurface()
if surface.Empty():
break
frame_count += 1
PyNvDecoder
has a DecodeSingleFrame
method for convenience, e. g. when you need to dump raw frames to disk or as a fallback. It's not designed for performance.
@RomanArzumanyan Thank you for your reply. Yes that is true, I observe that the DtoH is the dominant overhead inside GPU processing via Nsight system profiling (check figure 1). But the python level overhead seems be ~10x larger than the CUDA level execution (check figure 2).
Have you encountered similar bottleneck from python before? Is it possible to leverage C++ API to write the decoding pipeline in order to avoid the python level overhead?
I will also check the new repo!
Hi @brucechin
I’ve never seen such a high python side latency before. BTW there’s a wiki page with performance analysis: https://github.com/NVIDIA/VideoProcessingFramework/wiki/VPF-Performance-analysis
Have you seen it?
@RomanArzumanyan I think it was caused by disk IO. I need to somehow pipeline the stages. Let me check this doc too! Thank you for your help here!
I want to leverage this framework to accelerate the old CPU-FFmpeg workflow and did some benchmark.
The ffmpeg execution flow is:
For simplicity, on the GPU implementation side I did:
For a ~10000 frame 1920*1080 resolution video, I run the code on 64 CPU-cores which are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and an
NVIDIA T4
GPU. I obtain the following results:According to nvidia-smi dmon the T4 GPU dec unit utilization is quite high throughout the decoding execution:
Please note thate for the ffmpeg side, it can not fully leverage the 64 cores. If I create a thread pool and use it to process many videos, the speed could be ~7-8 times faster. May I ask if this is expected? here for T4 side, I only called
DecodeSingleFrame
and did not implement the full video transformation logic, but the speed is not as good as I expected. I thoughtT4
could be 10X faster than the current number. Otherwise, the switch from CPU ffmpeg to GPU decoding workflow does not enjoy much benefit.cc @RomanArzumanyan @gedoensmax . If this GPU decode single frame performance is expected, do you have any other GPU acceleration suggestion for me? I though there could be things like batch processing to improve the overall decoding throughput but I have not found it. Really appreciate your help!