PERF: Leverage speedup over requested region subset in `FFTConvolutionImageFilter`

tbirdso commented 2 years ago

Description

A performance discrepancy is observed when blurring with FFTDiscreteGaussianImageFilter in the frequency domain via GPU-based FFTs vs DiscreteGaussianImageFilter in physical space via convolution with a separable kernel on CPU.

Impact analysis

In subplots below "Vk" represents the GPU implementation and "Vnl" represents the CPU implementation.

In the top row 3D cube images of variable "LargestPossibleRegionSize" are blurred. Filter performance follows a natural, smooth trend as expected.

In the bottom row 3D cube images of fixed "LargestPossibleRegion" size with variable "RequestedOutputRegion" sizes are blurred. GPU FFT convolution plateaus while CPU FFT shows a sharp performance increase over comparable input size in the top row. The result is that GPU FFT convolution offers minimal to no performance boost over CPU implementation for most image region subsets.

Single- vs multi-threaded results included to show that threading does not fully account for difference in performance curves.

blur-benchmark-multithreaded-plot-2

It is hypothesized that the performance difference is based on boundary handling in the respective filters. FFTConvolutionImageFilter::GenerateInputRequestedRegion always requests the largest possible input region, so every call processes the same region size irrespective of the requested output. This leads to the plateau observed in the lower row. By contrast, DiscreteGaussianImageFilter propagates the requested output region so that a subset of the input is actually processed. When the requested region is centered inside the input image, it is possible for boundary pixels to be obtained directly from the surrounding area in the input image rather than generated dynamically from the boundary condition. There is some evidence that the difference in boundary processing is a core issue here, setting the requested region index at (0,0,0) results in lessened CPU blurring performance.

Expected behavior

GPU and CPU blurring performance should follow a similar trend when OutputRequestedRegion is reduced from LargestPossibleRegion.

Actual behavior

CPU blurring performance exhibits significant speedup when processing image subsets while GPU blurring performance is independent of the requested region.

Versions

ITK v5.3rc04

Environment

Windows, CMake 3.22

Additional Information

Will investigate and benchmark adjustments to how FFTConvolutionImageFilter sets its requested input region size.

tbirdso commented 2 years ago

Investigation shows that underlying FFT classes also request the largest possible region in each FFT direction, which limits our ability to optimize for image subregions. Exploring the impact of propagating only the requested region for ITK pipeline optimization.

@thewtex Please let me know or update here if you are pursuing a similar investigation so that we can avoid duplicate effort.

tbirdso commented 2 years ago

Additional notes:

itk::FFTConvolutionImageFilter uses itk::RealToHalfHermitianForwardFFTImageFilter and itk::HalfHermitianToRealInverseFFTImageFilter, so it makes sense to focus on those first
forward FFT enlarges the requested output region to the largest possible region by default. While this may maximize description of the whole image, it is not performant in a pipeline with varying requested image regions.
As previously noted, forward FFT enlarges the input requested region to the largest possible region.

dzenanz commented 2 years ago

FFT enlarges the input requested region

This might not be due to performance improvement. It might be how the filter works. I think it needs the entire domain as it is equivalent to convolution with an infinite window size.

tbirdso commented 2 years ago

This might not be due to performance improvement. It might be how the filter works. I think it needs the entire domain as it is equivalent to convolution with an infinite window size.

Right, but at the same time it is reasonable to request only a subregion of the original image and act as if that is the "entire image" to process.

For instance, we want roughly the same result from applying forward FFT to the following two cases:

A 3x3 image of all ones, operating over the largest possible region;
A 9x9 image of zeros with a 3x3 grid of ones in the center, operating over the 3x3 grid of ones (index [3,3], size [3,3])

In both cases the FFT filter should "see" only the 3x3 grid of ones for its operations.

dzenanz commented 2 years ago

Does the non-FFT version ignore parts of the image outside of the requested region, or does it access the kernel-neighborhood-size worth of pixels outside of the requested region? If so, the FFT version's kernel size is the entire image, so they are equivalent in that regard.

dzenanz commented 2 years ago

FFT filter should "see" only the 3x3 grid

If you want that, you can first use region of interest filter and then apply FFT convolution. But I don't think the result will be equivalent.

SimonRit commented 2 years ago

FFT enlarges the input requested region to the largest possible region.

I might misunderstand the problem but I believe this is required: you need the whole input image to compute a piece of the FFT.

tbirdso commented 2 years ago

I see both of your points, this may be a snag in potential application to multiresolution speedup. Will experiment a bit with the region of interest filter to see how close we can get with cropping.

tbirdso commented 2 years ago

I have added a proof of concept notebook at https://github.com/InsightSoftwareConsortium/ITKVkFFTBackend/pull/47 demonstrating how we can use subregions in FFTs for convolution speedup. @dzenanz @SimonRit would you please take a look and let me know if you believe that procedure is sound?

The notebook does not explicitly rely on VkFFT speedup (I used Vnl backends for generating output), could be moved into ITKSphinxExamples in the future.

tbirdso commented 2 years ago

As a quick update, I am now actively working to apply the approved procedure in the notebook I linked above to itk::FFTConvolutionImageFilter to leverage the ITK pipeline for speedup over image subregions. The filter currently assumes in many places that the largest possible region is being used, so effort is nontrivial but will benefit us in allowing multi resolution pyramid generation to be sped up with accelerated FFTs.

tbirdso commented 2 years ago

Starting to see improved results where FFT convolution time decreases as subregion size decreases. However, multithreaded spatial convolution still narrowly beats out FFT convolution for centered subregions. Will need to investigate further internal pipeline optimizations.

fft-conv-benchmark

InsightSoftwareConsortium / ITK