Open psteinb opened 5 years ago
Notes on usage:
This isn't going to work (yet) on larger images, only smaller ones. The limit depends on your GPU.
The Custom Convolution and Richardson Lucy Deconvolution here use real space convolution, not the now standard "FFT then multiplication in frequency space" trick, used to accelerate convolution on CPU and also now sometimes also GPU.
Real space convolution is both very memory and very compute expensive. The trick of doing the convolution as a multiplication in frequency space works because the FFT method only needs to do a single multiplication for directly corresponding pixels in the FFTs of the input and kernel images, not all the possible input image vs kernel pixel combinations as in real space.
Memory: Whatever way you do it, for an "out of place" convolution, you need at least as much memory as the twice image size (input and output images) plus the convolution kernel image size. GPUs still don't often have as much memory as the main system RAM, so larger images and kernels become problematic in a real space GPU method.
Compute: The FFT method reduces the complexity to a single multiplication operation between each pixel in the FT of the image and kernel. So this is good on a CPU with access to large RAM, so long as the FFT operation is fast. The FFT method reduces the complexity in operations and memory accesses to a multiplication operation between each corresponding pixel in the transformed image and the transformed kernel. When doing a real space "out of place" convolution in series on a single or few CPUs, it takes as many multiplication operations and memory accesses as there are total combinations of input and kernel pixel locations - much more complex than in the FFT method. Multi-threading on a modern multi-core CPU will make a real space "out of place" convolution a few times more efficient, but on a GPU with enough compute cores, it could be more efficient than the FFT method, but it would a need huge number of compute cores. In 2019, the sizes of biological 3D images commonly acquired on modern microscopes, are too large for the real space "out of place" convolution on a GPU in terms of memory and number of cores to be able to do it in a single chunk. For small images and kernels, its all good. Maybe there is a clever way around this limitation?
Compute-memory bandwidth: All the above is complicated by the fact that as well as the memory limitation, there is also a memory to compute bandwidth (transfer speed) bottleneck. The speed of data transfer between the GPU and the rest of the system is very dependent on the performance of the GPU and it's components: A laptop with integrated GPU on the motherboard's chip is likely to be weak compared to a powerful GPU card in a workstation that has more memory, more cores and higher bandwidth.
ping @tibuch
I propose to adjust the wiki a bit:
Not sure what you mean by that. Inplace discrete convolutions don't work AFAIK: as you slide the kernel over the image, you alter the neighborhood that is needed as input by a neighboring pixel. So one typically goes with a outofplace convolutions. In terms of memory, the naive approach requires as much memory as 2 times the image size (input and output) plus the size of the kernel. For e.g. 3D convolutions, there is a trick to accelerate the memory accesses by using a full place as temporary storage. But I'd consider this a implementation detail and not relevant to the discussion.
No idea what you mean here. Are you referring to sequential operations?
I'd say, "As images and/or convolution kernels" as in practise first images get bigger. Then people try larger kernels on the images they just made bigger.
I propose to write this out, e.g. like "So the FFT method reduces the complexity in operations and memory accesses to a multiplication operation between each pixel in the transformed image and the transformed kernel.
This is a problematic statement. The speed of the FFT transforms is not the only bottleneck you encounter. FFTs require temporary memory depending on the shape of your inputs. This can go up to 3-4x times the input signal (that is at least the case for cufft). An FFT is basically a reduction operation. They are bound not by the computing speed of your hardware, but by how well/fast your cores can access memory. Further, the multiplication operation has a fixed complexity which is much lower than the one for a discrete convolution. But the former requires the transforms first.
I personally wouldn't make this claim without backing of measurements. I didn't a benchmark 4-5 years ago on a Kepler server GPU. There, discrete convolutions of radius=1 (so 3x3x3 kernels of float32) were performing just as well as an FFT based convolution irrespective of the size of the input image. Anything larger than radius=1, did favor the FFT convolution.