Closed MorganGrundy closed 2 years ago
We will probably need to use OpenCV's cv::cuda::GpuMat so that we can still use the basic image operations such as resize, convertTo, cvtColor, etc...
Although we would then need to either copy the images into contiguous device memory or change the CUDA kernels to work with them.
Before making changes for this I need to do a series of performance tests.
I'm doing the initial benchmarking and profiling using the new Benchmark project (and it's CLI interface with NVIDIA Visual Profiler), although I'm having some trouble with NVVP. I have a set of profile data saved but NVVP becomes unresponsive during analysis.
With the current data I have it took my software around 61s to generate a Photomosaic (using default CLI settings), but the actual CUDA kernels are only running for about ~17s (and MemCpy of < 1.5s).
The individual kernels are at:
I don't remember why I wanted detailed profiling on the kernels. I've done some simple initial profiles (not using the benchmark project, but the full software), here are the cases: CUDA-Perf-Cases.txt Here are the initial results: CUDA-Perf-Results-Initial.txt
If I end up wanting more detail I can always take an older build.
Extract from dd12a0f:
WARNING: During a short performance test I noticed a deviation in the photomosaic result (test case 3). Next step I need to fix the tests to work with these changes and find what caused this change.
From the short performance test (test case 3) I did after dd12a0f I am seeing:
I'm trying to fix the tests but I'm getting CUDA errors. I'm struggling to get CUDA debugging working, even on the sample CUDA project they use in the debugging tutorial... this is a pain.
I changed the difference kernels to just take a single library image, so we now launch the kernel for each library image. I'm not sure if I'll keep it like this.
The tests now actually run but we are getting fails from the CPU vs CUDA generator tests. We aren't getting fails from the difference kernel comparison so the problem is likely in the CUDAPhotomosaicGenerator.
Whatever is going wrong, it does seem to be consistent. Running the tests multiple times is giving the exact same failures.
I compared the CUDAPhotomosaicGenerator from the experimental branch against the master to see what had changed.
I removed the use of the GpuMats (use preprocessLibraryImages() instead of preprocessCUDALibraryImages()) and now the tests are passing. So there is some difference between their functionality. From an initial look I'm not seeing any obvious difference.
I'll need to add some test cases:
So I am in the process of adding the above test cases and I narrowed the problem down to the ImageUtility mat resizing. The results from the cv::resize() and cv::cuda::resize() are different.
I found the following issue (apparently I've seen this before but I don't remember it): opencv#4728 It is a long standing issue that got closed as wontfix exactly 2 years and 1 day ago.
This is quite annoying... not sure where I will go from here.
I just ended up disabling the test case for the CUDAImageLibrary set size. I have no idea what variance is acceptable/expected for this... so I'll just ignore it. I'll still try to add an allowed variance to the generator CPU vs CUDA tests though.
Not sure if I'll add any test cases for the library preprocess functions. We can't test the resize only the colour space transform... which seems kind of pointless.
I'm thinking that I'll probably just end up scrapping the CUDAImageLibrary... we didn't really get any performance benefit from it and the GpuMat have just complicated things.
We could still pre-allocate the cuda memory, but still we won't see much benefit.
Yeah, okay, I've decided. I'll do one more commit to scrap the CUDAImageLibrary then close this issue. I'll continue work on improving the CUDA generator performance though.
CUDAImageLibrary has been removed and the CPU vs CUDA generator tests are passing again.
Currently the GPU memory allocation and data transfer happens after the Generate button is clicked. If we pre-allocate and pre-transfer then we can shave a bit of time off of photomosaic generation.