PeterL1n / BackgroundMattingV2

Real-Time High-Resolution Background Matting
MIT License
6.84k stars 953 forks source link

Improving performance of video inference and increase GPU utilization #178

Open h9419 opened 2 years ago

h9419 commented 2 years ago

Two improvements are made in this contribution:

  1. Removed repeated copying of background image to GPU memory, minimizing the effect of a memory bandwidth bottleneck
  2. Increased GPU utilization by offloading the CPU video encoding to children threads as soon as it is copied to CPU memory, freeing up the parent process to begin processing the next frame
  3. Further increased GPU utilization by offloading the CPU video decoding to another thread so that the main thread can focus on feeding the GPU

This modification allowed for about three times the performance on my system with R7 5800H and RTX3060 mobile. Using the same 4k video on both resnet50 and resnet101 models, the original version ran at 2.20it/s whereas this runs at an average of 7.5it/s.

h9419 commented 2 years ago

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

h9419 commented 1 year ago

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

I have made a version of it to work with Nvidia's vpf library that takes advantage of nvenc and nvdec hardware accelerators for video, and directly creates GPU tensors without involving the CPU. It works inside a docker container under WSL.

However, I don't plan to publish the code since I don't think I can redistribute or publish nvenc/nvdec/x264 binaries and my glue code only works with a self compiled version when I wrote the code.

One thing I can verify is that the claimed inference speed is achievable on consumer grade GPUs, and GeForce RTX series GPU can be faster than Quadro RTX simply because of the nvenc/nvdec performance