PeterL1n / BackgroundMattingV2

Real-Time High-Resolution Background Matting
MIT License
6.85k stars 952 forks source link

Real Time Performance #48

Open NaeemKhan333 opened 3 years ago

NaeemKhan333 commented 3 years ago

Thanks for such a nice work. I have a question related to real time inferences on the video input data.I have tested the resnet50 backbone model on the 1080p resolution videos.Results are good but it is really slow.How we can speed the inferences other than gpu(I have gpu gtx 1080) .Secondly webcam inferences results are very poor, my laptop webcam have resolution 640x480 and result are very poor.Can you guide how I can improve the results and speed the inferences.Thanks

PeterL1n commented 3 years ago

The inference_video.py script is not optimized for real-time. You could use hardware decoding and encoding. Parallelize data transfer etc. But those likely require C++ so we are not doing that.

640x480 is quite a low resolution. You can try to increase the backbone_scale to 0.5, or just use model_type=mattingbase without the refinement. But both approach will slow down inference.

adeelabbas commented 3 years ago

I am having the same issue. Wondering how we can reproduce results shown in the paper?

swelcker commented 3 years ago

i had success in realtime for online meetings (MS Teams) with FPS >= 24. Quality is only acceptable >=720p with GTX1650. You have to take control of the webcam params (Autofocus off, Exposure, WB etc, input 30fps), the lighting and spec. exposure has a huge impact on the result and framerate. I use a thread for the webcam input capture and put it in a queue, multi processing for the matting and the 50fps model, results also put in a queue, a streamer process who feeds the virtual webcam (pyfakewebcam). Results also streamed in 720p, backbonescale <=0.25

Works fine even when i replace the background with video frames, you need to optimize the code as much as possible. Here a screenshot to list the cam settings image

bluesky314 commented 3 years ago

@swelcker How it the quality of the output? How does it compare to Google Meet's background removal which requires no background image?

swelcker commented 3 years ago

People always believe my background is real, but i also set the output resolution to 720p through pyfakewebcam, so MS Teams and others can't choose a lower resolution. I have the same effect in realtime, with the background between the hair or through my glasses. The effort was the multiprocessing part to keep the fps above 24. Also as my background is a kind of white, the lightning needs to be very good if i wear a white shirt. Example Screenshot: image

goodnessshare commented 2 years ago

Hi @swelcker

Do you have any tips or examples where you got multiprocessing working for real time performance?

Sorry I'm still learning the basics and all the attempts I've tried just lead to my entire station freezing.

Thanks!

h9419 commented 2 years ago

Hi @goodnessshare

I managed to improve the inference_video.py script's performance by over 3 times in my fork, making real-time HD video inference with encoding somewhat viable, but that is still just tackling video encoding scheduling. My tip with this is to maximize GPU utilization by offloading serial CPU tasks to other processes or threads.

If you experience freezing, there's a high likelihood that you are using too much memory resulting in disk paging and high page faults. I would recommend checking the RAM usage first, then looking into performance differences with lower resolutions.

I tested on a laptop with GTX1060 and 360p webcam inference runs at 12fps, even with OBS capturing to virtual webcam and streaming out to zoom, and that already exceeds the quality we usually get with zoom.

zinuoli commented 2 years ago

@h9419 Hi, I have the same question about improving the performance of inference_video.py. Though I use tesla V100 to do the inference it still runs very very slow. I use 4K video as the input and the backbone-scale is 0.125, setting the refine-sampling-pixels = 128000 to get a higher quality of the output video. May I see your solution in inference_video.py? Could you please share it in your repository? Thanks a lot if you can reply me.

zinuoli commented 2 years ago

@h9419 On a side note, the batch sizeis set to 8, but it still takes me 15 mins to do the inference.

h9419 commented 2 years ago

My pull request is from my public fork of this repository.

I used a not very scientific methodologies of running the script without inference and finds that at least 80% of the execution time is spent on data loader, video encoder and in webcam's case, the openCV imshow window. It's not even copying the frame from CPU to GPU and back that slows down that much, but decoding frames as tensors. You can check out my fork for how I used threads, processes, pipes and queues to handle loading and saving of data asynchronously.

I'm just using more CPU cores to pipeline preparation and saving of the data while the main thread can focus on keeping the GPU busy. My video inference of the default settings can achieve 25-33fps on 1080p and more than 8fps on 4k. I observe little to no performance difference between resnet50 and resnet101 or higher sampling sizes since the bottleneck is anything but the GPU.

For webcam, I made a custom version with the pyvirtualcam library which manages stable 720p 30fps and that's as many frames as my webcam can provide. I didn't plan to release it as it's not polished.

Currently, the bottleneck I face is still video decoding and encoding. If you have the know-how of using CUVID/NVDEC/NVENC in combination with CUDA without copying memory to the CPU, it can manage real time 4k in theory. I simply haven't figured it out.

zinuoli commented 2 years ago

@h9419 Thank you for replying to me so quickly! I'll try it as soon as possible!

h9419 commented 2 years ago

@lzn1273180880 I think I have found the hardware limit for me in video inference using Nvidia's VideoProcessingFramework. Video decoding can be done on the GPU and it is converted into tensor directly on GPU without CPU memory copy. Thus the CPU usage actually decreased.

On my 130w RTX3060 mobile with 6GB VRAM, running under WSL2 and Docker, with threaded NVDEC decode and CUDA are as follows. Saving video is still done with my threaded CPU version, and I save 3 streams which includes 'com', 'fgr' and 'pha'. Power and memory usage information are logged using HWiNFO.

Model Samples fps VRAM Saving Resolution GPU Power
Resnet50 80000 54 4.1GB No 1080p 110-115w
Resnet101 80000 39 4.5GB No 1080p 110-115w
Resnet50 80000 15.38 5.9GB No 4k 122w
Resnet50 80000 31 4.1GB Yes 1080p 90-95w
Resnet101 80000 28 4.5GB Yes 1080p 90-95w
Resnet50 80000 6.43 5.9GB Yes 4k 90-100w

I haven't figured out hardware encoding but memory and memory bandwidth seems to be the limit when copying to CPU, which means my laptop cannot handle 4k inference in real time in any capacity.

At least it demonstrates that Docker inside WSL2 can access everything an Nvidia GPU has to offer in full speed.

zinuoli commented 2 years ago

Hi, @h9419 . As I mentioned before I used TESLA V100 to do the video_inference and it's slow too. Im not sure if I can solve this by simply using Nvidia's VideoProcessingFramework, maybe I should try it when Im free. If I got any progress, I'll tell you immediately. Thank you :)

zinuoli commented 2 years ago

@h9419 At least we can know that hardware shouldn't be the limit of the inference for TESLA V100. There must be something to improve.

h9419 commented 2 years ago

@lzn1273180880

I have a few small but significant breakthrough.

  1. Using both NVDEC NVENC in VPF make video encoding happen without any CPU memory involvement in the raw Tensors.
  2. Using fp16 model in TorchScript provided significant performance improvement and reduction in VRAM use.
Model Samples fps VRAM Saving Resolution GPU Power
Resnet50-fp16 100,000 56.6 3.6GB NVENC-h264 x3 720p 72w
Resnet50-fp16 100,000 45.6 3.8GB NVENC-h264 x3 1080p 93w
Resnet50-fp16 100,000 37.2 4.2GB NVENC-h264 x3 1440p 106w
Resnet50-fp16 100,000 17.8 5.4GB NVENC-h264 x3 4k 111w
Resnet101-fp16 100,000 37.1 3.8GB NVENC-h264 x3 1080p 83w
Resnet101-fp16 100,000 17.8 5.5GB NVENC-h264 x3 4k 113w

My implementation requires building from release tags of VPF but with /src/PytorchNvCodec.cpp from the master branch because current master branch does not compile for me but I needed the new feature to make GPU surface from GPU Tensor. And there is one disadvantage of it being raw h264 streams without a container which makes seeking during playback less convenient.

The high power usage indicated this is about the best I can get out of my laptop RTX3060, and I am already outputting all three useful streams, com, pha and fgr. CPU usage is not high and RAM usage is no longer tens of GB. The finding above shows that once you managed to put the entire pipeline to run on the GPU, the V100 in theory can handle real time 4k.

My hardware is limiting me to around 1440p real time, and resolution is not scaling linearly beyond that point. It's no longer memory limitation or that CPU cannot keep the GPU busy, but that the CUDA core count is limiting when the frame cannot be processed in one go. Framerate drops to less than half when one frame no longer fits into one round of kernel execution.

As I mentioned before, video_inference is slow because it spends more than 85% of its execution time using dataloader to sequentially load the video into frames, converting the frames into tensor, and VideoWiter also saves the resultant video sequentially. My system ran the default Resnet50 with 80k samples at 4k 2.16fps, but my test results above shows that it can be much faster when we move all Tensor processing to GPU. VPF is a must if you want to run inference in real time.

Update: If I pipeline the encoding part, 1080p inference is 60+fps with resnet50 on both 130w RTX3060 laptop and OEM desktop RTX2070 super. Curiously, the same 1080p video inference is only 40fps on an 80w RTX5000 laptop. 4k inference is 30+fps with mobilenetv2 under the condition that I reduce the output streams to 1 or 2.