[REQUEST] GPU Pre-allocate & Pre-transfer Images

MorganGrundy commented 2 years ago

Currently the GPU memory allocation and data transfer happens after the Generate button is clicked. If we pre-allocate and pre-transfer then we can shave a bit of time off of photomosaic generation.

MorganGrundy commented 2 years ago

We will probably need to use OpenCV's cv::cuda::GpuMat so that we can still use the basic image operations such as resize, convertTo, cvtColor, etc...

Although we would then need to either copy the images into contiguous device memory or change the CUDA kernels to work with them.

MorganGrundy commented 2 years ago

Before making changes for this I need to do a series of performance tests.

MorganGrundy commented 2 years ago

I'm doing the initial benchmarking and profiling using the new Benchmark project (and it's CLI interface with NVIDIA Visual Profiler), although I'm having some trouble with NVVP. I have a set of profile data saved but NVVP becomes unresponsive during analysis.

With the current data I have it took my software around 61s to generate a Photomosaic (using default CLI settings), but the actual CUDA kernels are only running for about ~17s (and MemCpy of < 1.5s).

The individual kernels are at:

euclideanDifferenceKernel 89.1%
reduceAdd<1024> 7.6%
reduceAdd<2> 2.9%
findLowestKernel 0.4%
calculateRepeats 0.0%
memset 0.0%

MorganGrundy commented 2 years ago

I don't remember why I wanted detailed profiling on the kernels. I've done some simple initial profiles (not using the benchmark project, but the full software), here are the cases: CUDA-Perf-Cases.txt Here are the initial results: CUDA-Perf-Results-Initial.txt

If I end up wanting more detail I can always take an older build.

MorganGrundy commented 2 years ago

Extract from dd12a0f:

WARNING: During a short performance test I noticed a deviation in the photomosaic result (test case 3). Next step I need to fix the tests to work with these changes and find what caused this change.

MorganGrundy commented 2 years ago

From the short performance test (test case 3) I did after dd12a0f I am seeing:

MemCpy:
- HtoD - ~25% decrease of duration.
- DtoH - Not sure why but we are now sending far less data from device to host so duration is <1ms (did I have OPENCV_W_CUDA enabled for the initial tests? If so then that could explain this).
- DtoD - Significant increase in total bytes but also throughput. Duration is about the same, maybe slightly less.
Compute - Small increase in duration (likely due to reduced memory locality?).
An increase in the total generate time (but it doesn't seem to correspond with the above).

MorganGrundy commented 2 years ago

I'm trying to fix the tests but I'm getting CUDA errors. I'm struggling to get CUDA debugging working, even on the sample CUDA project they use in the debugging tutorial... this is a pain.

MorganGrundy commented 2 years ago

I changed the difference kernels to just take a single library image, so we now launch the kernel for each library image. I'm not sure if I'll keep it like this.

The tests now actually run but we are getting fails from the CPU vs CUDA generator tests. We aren't getting fails from the difference kernel comparison so the problem is likely in the CUDAPhotomosaicGenerator.

MorganGrundy commented 2 years ago

Whatever is going wrong, it does seem to be consistent. Running the tests multiple times is giving the exact same failures.

MorganGrundy commented 2 years ago

I compared the CUDAPhotomosaicGenerator from the experimental branch against the master to see what had changed.

I removed the use of the GpuMats (use preprocessLibraryImages() instead of preprocessCUDALibraryImages()) and now the tests are passing. So there is some difference between their functionality. From an initial look I'm not seeing any obvious difference.

I'll need to add some test cases:

In CUDAImageLibrary check that the Mat and GpuMat are identical after each operation.
Compare the preprocess results from CUDAPhotomosaicGenerator::preprocessCUDALibraryImages() and PhotomosaicGeneratorBase::preprocessLibraryImages().

MorganGrundy commented 2 years ago

So I am in the process of adding the above test cases and I narrowed the problem down to the ImageUtility mat resizing. The results from the cv::resize() and cv::cuda::resize() are different.

I found the following issue (apparently I've seen this before but I don't remember it): opencv#4728 It is a long standing issue that got closed as wontfix exactly 2 years and 1 day ago.

This is quite annoying... not sure where I will go from here.

Scrap the CUDAImageLibrary?
Accept the variation. Change CUDAImageLibrary test cases to allow a certain variance in the images. Generator CPU vs CUDA test cases are a bit more complicated:
1. Remove the test cases
2. For the best fits that don't match; get the difference between the cell image and the best fits, compare the differences allowing a certain variance.

MorganGrundy commented 2 years ago

I just ended up disabling the test case for the CUDAImageLibrary set size. I have no idea what variance is acceptable/expected for this... so I'll just ignore it. I'll still try to add an allowed variance to the generator CPU vs CUDA tests though.

Not sure if I'll add any test cases for the library preprocess functions. We can't test the resize only the colour space transform... which seems kind of pointless.

MorganGrundy commented 2 years ago

I'm thinking that I'll probably just end up scrapping the CUDAImageLibrary... we didn't really get any performance benefit from it and the GpuMat have just complicated things.

We could still pre-allocate the cuda memory, but still we won't see much benefit.

Yeah, okay, I've decided. I'll do one more commit to scrap the CUDAImageLibrary then close this issue. I'll continue work on improving the CUDA generator performance though.

MorganGrundy commented 2 years ago

CUDAImageLibrary has been removed and the CPU vs CUDA generator tests are passing again.

MorganGrundy / MosaicMagnifique

[REQUEST] GPU Pre-allocate & Pre-transfer Images #22