Is there a flag to use GPU acceleration?

laurentopia commented 5 years ago

OpenCV has a build using CUDA I think, so I was wondering if GPU acceleration flag existed in DVR-scan

Breakthrough commented 5 years ago

Update (July 2022): The latest release of DVR-Scan also includes an experimental build with CUDA support, you can grab it from:

https://dvr-scan.readthedocs.io/en/latest/download/ https://github.com/Breakthrough/DVR-Scan/releases

And can use the GPU mode with -b MOG2_CUDA, e.g.:

dvr_scan -i video.mp4 -b mog2_cuda

See the docs for details.

Thank you for your submission @laurentopia

laurentopia commented 5 years ago

Thanks, I'm sifting through 8 hours of footage and it's taking a while, what optimization flags can I use? The motion happens only in the lower third of the image, I could use a --region flag, couldn't find one though

Breakthrough commented 5 years ago

Hi @laurentopia;

Recently, there was a pull request merged which added the -roi flag which lets you set a region of interest. This should provide a decent performance improvement. See my post here for details.

Thank you!

Breakthrough commented 3 years ago

GPU acceleration should now be possible with the OpenCV Python bindings. It may complicate building & distributing DVR-Scan a bit, but that may be a good solution for now to keep DVR-Scan written in Python but still provide significant performance improvements. I don't have a timeline for this yet as I'm still having trouble compiling the module with CUDA support in a way suitable for distribution, but found this resource from the opencv-python package on pip which may be useful.

Edit: Also found this blog post which details the steps for compiling it on Windows. I need to understand what implications this will have for binary redistribution, but this indeed seems feasible. I plan on starting with having DVR-Scan support GPU-acceleration from the Python side, and once that's confirmed working, will then move onto releasing a Windows binary. I still need to come up with a more optimal workflow for releasing Windows binaries as the process is a bit unwieldy right now.

laurentopia commented 3 years ago

thank you, I'll keep that in mind the next time i need to sift through hours of footage

Breakthrough commented 3 years ago

Hey @laurentopia; I'd like to keep this issue open, as I do want to investigate adding CUDA support. Also note that the next release of DVR-Scan will include a faster algorithm, as per #48. Thanks!

CJPNActual commented 2 years ago

Hi @Breakthrough,

I hope this is the correct place for this missive. The TL;DR is that yes, indeed adding OpenCV CUDA support can, along with other important and significant optimisations, massively improve throughput. I.e. from 75fps to 1300fps.

One caveat: I completely recreated the fundamentals of DVR-Scan in C# (fully managed and unsafe as required), as my Python skills are essentially non-existent, and I needed something significantly more performant as a matter of urgency. However some if not all the optimisations should carry across directly to the Python version.

Rather than detail the specifics in one giant wall of text, let me know if/when you're currently looking at this issue, and I would be more than happy to share the exact details and mechanics of the performance gains.

Best regards,

CJPN

Edit: gfx is Nvidia RTX 2080 Super

Some screenies of DVR-Scan and the optimised setup working on the same source material (H.264 @ 1920x1080) with identical settings. C# version streams multiple files or separate timespans within the same file. Four streams, in this instance, is enough to saturate the GPU:

DVRScan

CJPNMotionDetect

Breakthrough commented 2 years ago

Awesome result - would be glad to learn more about how you went about it. I hope to eventually find something that targets both Nvidia and AMD GPUs, but even something CUDA based is better than nothing.

CJPNActual commented 2 years ago

Awesome result - would be glad to learn more about how you went about it. I hope to eventually find something that targets both Nvidia and AMD GPUs, but even something CUDA based is better than nothing.

Indeed there are suggestions within OpenCV's docs - this being an imminent target of further research on this end - that utilising UMats as opposed to Mats automagically enables OpenCL processing with OpenCV choosing the fastest device available and falling back to CPU as required.

While CUDA is fun n'all that, it's proprietary nature does bring about a certain discomfort. Particularly given current hardware shortages.

CJPNActual commented 2 years ago

Well. OpenCL running on the GPU does indeed automagically work. Performance is, however, nowhere close to the exact same algorithm operating explicitly in CUDA. It appears as if OpenCV is uploading/downloading the UMat to/from the GPU between every processing step - i.e. GPU bus usage is higher with the OpenCL version running at 350FPS than the CUDA version running at 1300FPS - something that one can easily and explicitly prevent from occurring with OpenCV/CUDA.

Nevertheless, 350fps is an improvement over 75FPS, eh? Further investigation required.

Edit: By jiggling around what runs on CPU and GPU this is up to 550FPS.

CJPNMotionDetectOpenCL

CJPNActual commented 2 years ago

The difference between CUDA and OpenCL isn't as bad as initially thought. And is somewhat counterintuitive.

4x 1080p

Running CUDA

Thread 1 @ frame 10622 (1920x1080) @ 307.44 FPS @ 0.64 Gpx/s. Thread 2 @ frame 10667 (1920x1080) @ 321.28 FPS @ 0.67 Gpx/s. Thread 3 @ frame 10243 (1920x1080) @ 303.71 FPS @ 0.63 Gpx/s. Thread 4 @ frame 10951 (1920x1080) @ 334.74 FPS @ 0.69 Gpx/s. Cumulative 1267.17 FPS. Cumulative 2.63 Gpx/s.

4x 4K

Running CUDA

Thread 1 @ frame 1511 (3840x2560) @ 61.32 FPS @ 0.6 Gpx/s. Thread 2 @ frame 1504 (3840x2560) @ 61.28 FPS @ 0.6 Gpx/s. Thread 3 @ frame 1510 (3840x2560) @ 63 FPS @ 0.62 Gpx/s. Thread 4 @ frame 1515 (3840x2560) @ 60.59 FPS @ 0.6 Gpx/s. Cumulative 246.2 FPS. Cumulative 2.42 Gpx/s.

4x 1080p

Running OpenCL

Thread 1 @ frame 2824 (1920x1080) @ 126.35 FPS @ 0.26 Gpx/s. Thread 2 @ frame 2816 (1920x1080) @ 128.29 FPS @ 0.27 Gpx/s. Thread 3 @ frame 2835 (1920x1080) @ 128.05 FPS @ 0.27 Gpx/s. Thread 4 @ frame 2902 (1920x1080) @ 140.47 FPS @ 0.29 Gpx/s. Cumulative 523.15 FPS. Cumulative 1.08 Gpx/s.

4x 4K

Running OpenCL

Thread 1 @ frame 1716 (3840x2560) @ 34 FPS @ 0.33 Gpx/s. Thread 2 @ frame 1721 (3840x2560) @ 34.27 FPS @ 0.34 Gpx/s. Thread 3 @ frame 1725 (3840x2560) @ 34.6 FPS @ 0.34 Gpx/s. Thread 4 @ frame 1707 (3840x2560) @ 35.72 FPS @ 0.35 Gpx/s. Cumulative 138.59 FPS. Cumulative 1.36 Gpx/s.

Breakthrough commented 2 years ago

@CJPNActual is there any source code you can share for your benchmarks? Would love to integrate something like this into DVR-Scan, but I'm not that familiar with OpenCL. Thanks!

eded333 commented 2 years ago

The difference between CUDA and OpenCL isn't as bad as initially thought. And is somewhat counterintuitive.

4x 1080p

Running CUDA

Thread 1 @ frame 10622 (1920x1080) @ 307.44 FPS @ 0.64 Gpx/s. Thread 2 @ frame 10667 (1920x1080) @ 321.28 FPS @ 0.67 Gpx/s. Thread 3 @ frame 10243 (1920x1080) @ 303.71 FPS @ 0.63 Gpx/s. Thread 4 @ frame 10951 (1920x1080) @ 334.74 FPS @ 0.69 Gpx/s. Cumulative 1267.17 FPS. Cumulative 2.63 Gpx/s.

4x 4K

Running CUDA

Thread 1 @ frame 1511 (3840x2560) @ 61.32 FPS @ 0.6 Gpx/s. Thread 2 @ frame 1504 (3840x2560) @ 61.28 FPS @ 0.6 Gpx/s. Thread 3 @ frame 1510 (3840x2560) @ 63 FPS @ 0.62 Gpx/s. Thread 4 @ frame 1515 (3840x2560) @ 60.59 FPS @ 0.6 Gpx/s. Cumulative 246.2 FPS. Cumulative 2.42 Gpx/s.

4x 1080p

Running OpenCL

Thread 1 @ frame 2824 (1920x1080) @ 126.35 FPS @ 0.26 Gpx/s. Thread 2 @ frame 2816 (1920x1080) @ 128.29 FPS @ 0.27 Gpx/s. Thread 3 @ frame 2835 (1920x1080) @ 128.05 FPS @ 0.27 Gpx/s. Thread 4 @ frame 2902 (1920x1080) @ 140.47 FPS @ 0.29 Gpx/s. Cumulative 523.15 FPS. Cumulative 1.08 Gpx/s.

4x 4K

Running OpenCL

Thread 1 @ frame 1716 (3840x2560) @ 34 FPS @ 0.33 Gpx/s. Thread 2 @ frame 1721 (3840x2560) @ 34.27 FPS @ 0.34 Gpx/s. Thread 3 @ frame 1725 (3840x2560) @ 34.6 FPS @ 0.34 Gpx/s. Thread 4 @ frame 1707 (3840x2560) @ 35.72 FPS @ 0.35 Gpx/s. Cumulative 138.59 FPS. Cumulative 1.36 Gpx/s.

This seems extremely promising both for cuda and opencl. Any chance you can share the source code?

CJPNActual commented 2 years ago

Copied from recent email:

I rewrote the entire concept in C#, with sections of unsafe (C++ style) code for added memory voodoo where using FFMPEG for video decoding is concerned, so am not certain it translates directly to Python with a source-share...

However, getting the base-concept to run on CUDA or OpenCL is as simple as redefining some of the OpenCV textures as "graphics card resident" and then calling the same functions on them with the now GFX-resident textures/bitmaps as arguments as opposed to the "CPU-resident" ones. OpenCV will deal with all of the difficult copy-to-gfx-card stuff internally. The code on your end should look almost identical with minor changes as mentioned.

Does that make sense?

Breakthrough commented 2 years ago

Yes, totally. Could you point me to any code samples for that in OpenCV 4? (C++ is fine) I am considering rewriting DVR-Scan in Rust for the next version which should support most of the C++ style stuff. Thanks!

Edit: It looks like samples/gpu/video_reader.cpp and samples/gpu/bgfg_segm.cpp should be good enough for a first pass, but I'm curious what you mean by redefining as graphics card resident - do you mean using a GpuMat instead of a regular Mat?

Edit2: And sorry do you have examples of that for OpenCL? I assumed OpenCV only supported CUDA for certain things.

ijutta commented 2 years ago

Is there any way to run dvr-scan with GPU acceleration? Sorry, I am totally n00b and I only can start it on Windows or Linux, where it could be possible to increase power with GPU.

I would be very interested in helping in developing/testing dvr-scan, but I totally don't know where to start...

Best regards, Adam

Breakthrough commented 2 years ago

@ijutta it's fairly straight forward to modify the scan_motion function to support the OpenCV CUDA module. For the most part, there are drop-in replacements for the CPU bound versions.

That being said, the next release of DVR-Scan will include multithreading to improve performance. It will still be slower than using a GPU for calculation, but should be at least 50% faster than the current version. Once that's done, I'll pick this up for v1.6 and at least try to support versions of OpenCV that are compiled with CUDA support at minimum as an experimental feature.

In the meantime, I'm more than happy to help out or point you in the right direction for adding this to DVR-Scan. I'm working on refactoring the application as a whole to make it more easy to integrate GPU/multithreaded support in general (the current application has accumulated a lot of technical debt), but any proof of concepts/PRs are more than welcome.

Breakthrough commented 2 years ago

@ijutta interestingly enough, I just happened to find someone distributing prebuilts for Windows: https://jamesbowley.co.uk/downloads/

I'll see how difficult it is to mock something up to at least get a decent performance comparison and get back to you. I still need to see how to best support both CPU + GPU scanning in the long run, but will keep this in mind during the v1.5 refactor to make that an explicit goal.

Edit: Wasn't able to get it to work due to some missing dependencies, but will give building a custom version a try. If that works, I'll create a branch where people can test this out.

Breakthrough commented 2 years ago

I managed to get this working! There's still plenty of work to be done, but you can download and install an experimental version to test out (make sure to uninstall any existing versions of DVR-Scan):

https://github.com/Breakthrough/DVR-Scan/releases/tag/v1.5-beta
- Supports GTX 900-series and above, requires display driver with CUDA support

Once installed or extracted, you can use the -b mog_cuda to enable CUDA processing, e.g.:

dvr-scan -i video.mp4 -b mog_cuda -so

I would recommend comparing performance in scan-only mode (-so) as currently video encoding is not done in parallel, but this is being worked on as well. I get roughly twice as fast scanning performance with the GPU version. This should improve further when video decoding/encoding is done in parallel with motion detection (right now everything is still done in a single thread).

If folks could help out by testing this, that would be fantastic (esp. regarding performance). ~I'll see what I can do about a Windows build to test as well.~ Windows build now available, see link above.

CJPNActual commented 2 years ago

(((Strong disclaimer! I possess virtually no Python expertise.)))

Running the new experimental build immediately complains that:

dvr-scan.exe: error: argument -b/--bg-subtractor: invalid choice: mog_cuda (valid settings for TYPE are: 'mog', 'cnt')

CJPNActual commented 2 years ago

Apologies. Didn't see the edits until now. While, in my tests, OpenCL performs at ~50% of CUDA, obviously not everybody possesses an Nvidia card.

Edit2: And sorry do you have examples of that for OpenCL? I assumed OpenCV only supported CUDA for certain things.

It's been a while since I tested OpenCL, yet my observation - at least using Emgu.CV, the .NET wrapper - is that rather than explicitly creating an OpenCL pipeline as one would with CUDA, one instead instructs OpenCV as a whole to utilise OpenCL via its "native" interface thusly:

CvInvoke.UseOpenCL = true;
CvInvoke.UseOptimized = true;

Subsequently one calls one's pipeline via the native interface, with the exception of the background subtractor which one calls normally via it's member .Apply() function. I believe there may be one caveat in that certain Mats need to be UMat, but otherwise it's relatively easy to port from CPU-only code. Easier than CUDA, one might posit.

Disclaimer: The CUDA code on this end isn't hard-coded to 1920x1080 and so on. :)

                CvInvoke.UseOpenCL = true;
                CvInvoke.UseOptimized = true;

                Emgu.CV.BackgroundSubtractorMOG2 backsub = new Emgu.CV.BackgroundSubtractorMOG2(120, 32, true);

                int FrameCount = 0;

                Emgu.CV.Mat resizedFrame = new Emgu.CV.Mat();

                Emgu.CV.Mat blurredFrame = new Emgu.CV.Mat();

                //This is OpenCV-speak for allocating memory, via OpenCL, on GPU.  UMat is important.
                Emgu.CV.UMat ForegroundMask = new Emgu.CV.UMat();
                ForegroundMask.Create(1080, 1920, DepthType.Cv8U, 1, UMat.Usage.AllocateDeviceMemory);

                //In case we want to view that which the background subtractor considers motion.  Not used here.
                UMat downloadedForegroundMask = new UMat();

                while (!Reader.QuitProcessing)
                {
                    VideoFrame thisFrame = Reader.GetNextFrame();

                    if (thisFrame != null && FrameCount % SkipFrames == 0)
                    {

                        thisFrame.GenerateMatFromFFMPEGFrame();  //Generates a Mat from the raw luminance data

                        Emgu.CV.CvInvoke.Resize(thisFrame.MatFrame, thisFrame.MatFrame, new System.Drawing.Size(thisFrame.MatFrame.Size.Width / divideFrameByFactor, thisFrame.MatFrame.Size.Height / divideFrameByFactor), divideFrameByFactor, divideFrameByFactor, Inter.Linear);

                        //Some CCTV cameras produce noisy footage.  Option to blur to prevent false detections.
                        Emgu.CV.CvInvoke.GaussianBlur(thisFrame.MatFrame, thisFrame.MatFrame, new System.Drawing.Size(5, 5), 5);

                        //Again, GPU requires UMat
                        thisFrame.GenerateUMatFromMATFrame();

                        backsub.Apply(thisFrame.uMatFrame, ForegroundMask);

                        int nonZeroCount = Emgu.CV.Cuda.CudaInvoke.CountNonZero(ForegroundMask);
                        float nonZeroPercent = ((float)nonZeroCount / (ForegroundMask.Size.Width * ForegroundMask.Size.Height)) * 100.0f;

                        if (nonZeroPercent > nonZeroThreshold) ///Some sensible threshold for detection
                        {
                            //Deal with a detection event
                        }

CJPNActual commented 2 years ago

Other optimisations worth noting:

The general architecture is as follows:

-------------------------------n x Processing Container threads---------------------------------------------------------            
-   Decode Thread -> n-frame FIFO buffer -> Processing Thread -> n-frame FIFO buffer -> Optional Encode & ffmpeg cleanup Thread
- 
-   ffmpeg_frame created ---------------no YUV to RGB to YUV conversion------------------------> ffmpeg_frame freed.   
------------------------------------------------------------------------------------------------------------------------

One generally require n x 3 to equal number of CPU cores in order to properly saturate GPU.
GPU driver/task manager lie. They report GPU activity at the GPU's current clock speed. Due to unknown bottlenecks multiple concurrent processing containers are required in order to push GPU to max frequency.
While a royal pain the the backside* compared to using OpenCV's own video decoding/encoding functions, using ffmpeg and properly managing frame lifecycle not only avoids unnecessary YUV -> RGB -> Monochrome conversions and significantly conserves RAM bandwidth, but is also much more likely to be performance-optimised when compared to OpenCV, whose model tends towards ease of use.
Indeed, ffmpeg is capable of a "fast-copy", without encoding, of video given timestamps of a source. One needs to be able to seek to an I-frame for this to work flawlessly, yet doing so almost completely mitigates the encoding overhead. Still a work in progress on this end as ffmpeg is difficult and often painful*. 50% compete on this end.

*To be fair to ffmpeg, the issue, perhaps, pertains to the bizarrely non-conformant video spat out by certain DVRs. Want a consistent framerate? Yeah, but nah. How about I-frames at a relatively useful interval? Again, above the DVR's pay grade.

Breakthrough commented 2 years ago

@CJPNActual are you sure you uninstalled any existing versions of DVR-Scan before installing the experimental version? What do you see if you run dvr-scan --version? This certainly has much more opportunities for optimization, so thanks for suggesting some ideas.

The main focus for v1.5 will be to implement a multithreaded model similar to what you've suggested, doing frame decoding in a separate thread, and offloading video encoding to ffmpeg in a subprocess (trying to have DVR-Scan output a list of cuts and video filters to overlay timestamps/bounding boxes where required).

There's much room for performance improvements though, and using Python for tighter ffmpeg/CUDA integration is difficult in the current package ecosystem. I want to investigate possibly rewriting the project in Rust, to allow tighter integration with the various C/C++ libraries and allow better control over memory management, threading/locking, and integration with the ffmpeg API. I suspect this would also bring performance much closer to the figures you've been able to get.

Doing all of this isn't impossible in Python per-se, but it definitely feels like making all of those parts work together would be much easier in a statically typed compiled language.

CJPNActual commented 2 years ago

@CJPNActual are you sure you uninstalled any existing versions of DVR-Scan before installing the experimental version? What do you see if you run dvr-scan --version? This certainly has much more opportunities for optimization, so thanks for suggesting some ideas.

Ah. Exactly what I didn't do. Still reading v1.3.

The main focus for v1.5 will be to implement a multithreaded model similar to what you've suggested, doing frame decoding in a separate thread, and offloading video encoding to ffmpeg in a subprocess (trying to have DVR-Scan output a list of cuts and video filters to overlay timestamps/bounding boxes where required).

As mentioned, beware janky video from actual DVRs/NVRs. Asking ffmpeg to seek to a particular timestamp in a "perfect"-encoded video is trivial. The files I'm getting out of a mid-range HikVision NVR are unpredictable at best, so the only reliable timestamp one has is the frame number. Indeed I've some code here attempting to convert between frame# and timestamp in reality, yet remains to be perfected.

Incidentally, it may be faster to implement sub-frame region interest merely by applying a mask multiplication on the GPU, then leaving ffmpeg to extract the region of interest after the fact, should the end-user require only the sub-frame as video. The logic there in that one can define multiple regions of interest without having to expensively extract each one and process individually.

There's much room for performance improvements though, and using Python for tighter ffmpeg/CUDA integration is difficult in the current package ecosystem. I want to investigate possibly rewriting the project in Rust, to allow tighter integration with the various C/C++ libraries and allow better control over memory management, threading/locking, and integration with the ffmpeg API. I suspect this would also bring performance much closer to the figures you've been able to get.

Doing all of this isn't impossible in Python per-se, but it definitely feels like making all of those parts work together would be much easier in a statically typed compiled language.

Yeah, precisely why I reimplemented it in C# with unsafe enabled. I've significant former work with real-time video, so it's a safe space. However, not everyday an interesting case occurs for processing video at gigabytes per second, so entertained this as a continuation-of-training exercise on my part.

Breakthrough commented 2 years ago

I've also uploaded an experimental .exe build for 64-bit Windows systems (requires an Nvidia GTX 900-series or above GPU) which you can grab here. Make sure you have CUDA support enabled with your current GPU driver, e.g. when clicking System Information in the Nvidia Control Panel, you should see NVCUDA64.dll under the Components tab, e.g.:

Feedback on both this and the experimental Python version above is most welcome.

Breakthrough commented 2 years ago

Have the multithreaded version ready for testing in the v1.5 branch, just need to cleanup a few things before making another release candidate. From the experimental version above though, now I'm getting closer to 150 FPS from 100 FPS before when using CUDA mode. In CPU mode, using MOG, I can now get close to real-time full frame processing (~60FPS).

Edit: This is with a 1080p video on an i7 6700k w/ a Nvidia GTX 2070.

This opens up a lot of doors now, but for the time being this is as much optimization as I can commit to for v1.5. Of course, if there's any optimizations folks might find or be able to help out with, I would be more than happy to include those in v1.5. Now that there's GPU support in place, once 1.5 is released, I may close this issue and create a new one specifically to focus on optimization opportunities (including those brought up in this issue).

As always, folks can test it by grabbing one of the last passing v1.5 builds (.whl archives are uploaded on each builder under the artifacts).

CJPNActual commented 2 years ago

@Breakthrough Sorry for being a pedant, however given that we're currently and collectively engaged in performance optimisation and/or research, could we speak in Gigapixels/second or some such, as it removes input resolution and framerate from the equation. Also worth mentioning one's hardware setup, as I'm relatively certain a Raspberry PI performs differently to... :)

CJPNActual commented 2 years ago

F.ex. my research UI: Motion Detect Pigapixels

Obviously adding Frame Skipping would, at the very least, serve to double the framerate, while GPx/s remains solid and the only true determinant of system performance.

Breakthrough commented 2 years ago

v1.5 has officially been released, looking forward to any feedback for the new CUDA builds. Will close this issue, but feel free to create new issues or discussions if any issues crop up. As mentioned previously there are likely some areas of improvement regarding performance still, so happy to have any new ideas as well. Will leave this issue pinned for the meantime.

sundeepgoel72 commented 2 years ago

If there are any particular PoCs you want to conduct on the performance side using CUDA, or look at some sections of code for optimizations - give me a shout.

currently my 750Ti only around 5-10% utilised while processing large files using MOG2_CUDA, so i suspect more could be done. Out for two weeks, but will be happy to give it a bash once back.

Breakthrough / DVR-Scan

Is there a flag to use GPU acceleration? #12