MorganGrundy / MosaicMagnifique

Application for generating Photomosaics. Turn your images into beautiful Photomosaics.
https://morgangrundy.github.io
GNU General Public License v3.0
14 stars 0 forks source link

[REQUEST] GPU Pre-allocate & Pre-transfer Images #22

Closed MorganGrundy closed 2 years ago

MorganGrundy commented 2 years ago

Currently the GPU memory allocation and data transfer happens after the Generate button is clicked. If we pre-allocate and pre-transfer then we can shave a bit of time off of photomosaic generation.

MorganGrundy commented 2 years ago

We will probably need to use OpenCV's cv::cuda::GpuMat so that we can still use the basic image operations such as resize, convertTo, cvtColor, etc...

Although we would then need to either copy the images into contiguous device memory or change the CUDA kernels to work with them.

MorganGrundy commented 2 years ago

Before making changes for this I need to do a series of performance tests.

MorganGrundy commented 2 years ago

I'm doing the initial benchmarking and profiling using the new Benchmark project (and it's CLI interface with NVIDIA Visual Profiler), although I'm having some trouble with NVVP. I have a set of profile data saved but NVVP becomes unresponsive during analysis.

With the current data I have it took my software around 61s to generate a Photomosaic (using default CLI settings), but the actual CUDA kernels are only running for about ~17s (and MemCpy of < 1.5s).

The individual kernels are at:

MorganGrundy commented 2 years ago

I don't remember why I wanted detailed profiling on the kernels. I've done some simple initial profiles (not using the benchmark project, but the full software), here are the cases: CUDA-Perf-Cases.txt Here are the initial results: CUDA-Perf-Results-Initial.txt

If I end up wanting more detail I can always take an older build.

MorganGrundy commented 2 years ago

Extract from dd12a0f:

WARNING: During a short performance test I noticed a deviation in the photomosaic result (test case 3). Next step I need to fix the tests to work with these changes and find what caused this change.

MorganGrundy commented 2 years ago

From the short performance test (test case 3) I did after dd12a0f I am seeing:

MorganGrundy commented 2 years ago

I'm trying to fix the tests but I'm getting CUDA errors. I'm struggling to get CUDA debugging working, even on the sample CUDA project they use in the debugging tutorial... this is a pain.

MorganGrundy commented 2 years ago

I changed the difference kernels to just take a single library image, so we now launch the kernel for each library image. I'm not sure if I'll keep it like this.

The tests now actually run but we are getting fails from the CPU vs CUDA generator tests. We aren't getting fails from the difference kernel comparison so the problem is likely in the CUDAPhotomosaicGenerator.

MorganGrundy commented 2 years ago

Whatever is going wrong, it does seem to be consistent. Running the tests multiple times is giving the exact same failures.

MorganGrundy commented 2 years ago

I compared the CUDAPhotomosaicGenerator from the experimental branch against the master to see what had changed.

I removed the use of the GpuMats (use preprocessLibraryImages() instead of preprocessCUDALibraryImages()) and now the tests are passing. So there is some difference between their functionality. From an initial look I'm not seeing any obvious difference.

I'll need to add some test cases:

MorganGrundy commented 2 years ago

So I am in the process of adding the above test cases and I narrowed the problem down to the ImageUtility mat resizing. The results from the cv::resize() and cv::cuda::resize() are different.

I found the following issue (apparently I've seen this before but I don't remember it): opencv#4728 It is a long standing issue that got closed as wontfix exactly 2 years and 1 day ago.

This is quite annoying... not sure where I will go from here.

  1. Scrap the CUDAImageLibrary?
  2. Accept the variation. Change CUDAImageLibrary test cases to allow a certain variance in the images. Generator CPU vs CUDA test cases are a bit more complicated:
    1. Remove the test cases
    2. For the best fits that don't match; get the difference between the cell image and the best fits, compare the differences allowing a certain variance.
MorganGrundy commented 2 years ago

I just ended up disabling the test case for the CUDAImageLibrary set size. I have no idea what variance is acceptable/expected for this... so I'll just ignore it. I'll still try to add an allowed variance to the generator CPU vs CUDA tests though.

Not sure if I'll add any test cases for the library preprocess functions. We can't test the resize only the colour space transform... which seems kind of pointless.

MorganGrundy commented 2 years ago

I'm thinking that I'll probably just end up scrapping the CUDAImageLibrary... we didn't really get any performance benefit from it and the GpuMat have just complicated things.

We could still pre-allocate the cuda memory, but still we won't see much benefit.

Yeah, okay, I've decided. I'll do one more commit to scrap the CUDAImageLibrary then close this issue. I'll continue work on improving the CUDA generator performance though.

MorganGrundy commented 2 years ago

CUDAImageLibrary has been removed and the CPU vs CUDA generator tests are passing again.