clij / clij-custom-convolution-plugin

BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

wiki feedback #2

Open psteinb opened 5 years ago

psteinb commented 5 years ago

I propose to adjust the wiki a bit:

That trick works because real space convolution is very memory and compute expensive: You have to have as many memory addresses as the image size multiplied by the kernel size, if you do it all in one go on a GPU.

Not sure what you mean by that. Inplace discrete convolutions don't work AFAIK: as you slide the kernel over the image, you alter the neighborhood that is needed as input by a neighboring pixel. So one typically goes with a outofplace convolutions. In terms of memory, the naive approach requires as much memory as 2 times the image size (input and output) plus the size of the kernel. For e.g. 3D convolutions, there is a trick to accelerate the memory accesses by using a full place as temporary storage. But I'd consider this a implementation detail and not relevant to the discussion.

Doing it in series on a single or few CPUs, it takes that many multiplication operations.

No idea what you mean here. Are you referring to sequential operations?

As images and convolution kernels get bigger this quickly spirals way out of control.

I'd say, "As images and/or convolution kernels" as in practise first images get bigger. Then people try larger kernels on the images they just made bigger.

So the FFT method reduces the complexity to a single multiplication operation between each pixel in the FT of the image and kernel.

I propose to write this out, e.g. like "So the FFT method reduces the complexity in operations and memory accesses to a multiplication operation between each pixel in the transformed image and the transformed kernel.

So long as the expensive FFT is faster than the real space convolution, the FFT method is faster.

This is a problematic statement. The speed of the FFT transforms is not the only bottleneck you encounter. FFTs require temporary memory depending on the shape of your inputs. This can go up to 3-4x times the input signal (that is at least the case for cufft). An FFT is basically a reduction operation. They are bound not by the computing speed of your hardware, but by how well/fast your cores can access memory. Further, the multiplication operation has a fixed complexity which is much lower than the one for a discrete convolution. But the former requires the transforms first.

However, if you had a GPU with terabytes of RAM, and many thousands of parallel compute units, the real space convolution could be faster then the FFT trick, even for reasonably large sized images as we get in biology.

I personally wouldn't make this claim without backing of measurements. I didn't a benchmark 4-5 years ago on a Kepler server GPU. There, discrete convolutions of radius=1 (so 3x3x3 kernels of float32) were performing just as well as an FFT based convolution irrespective of the size of the input image. Anything larger than radius=1, did favor the FFT convolution.

chalkie666 commented 5 years ago

@psteinb @haesleinhuepf Is this better, less wrong, and/or approaching useful?

Notes on usage:

This isn't going to work (yet) on larger images, only smaller ones. The limit depends on your GPU.

The Custom Convolution and Richardson Lucy Deconvolution here use real space convolution, not the now standard "FFT then multiplication in frequency space" trick, used to accelerate convolution on CPU and also now sometimes also GPU.

Real space convolution is both very memory and very compute expensive. The trick of doing the convolution as a multiplication in frequency space works because the FFT method only needs to do a single multiplication for directly corresponding pixels in the FFTs of the input and kernel images, not all the possible input image vs kernel pixel combinations as in real space.

Memory: Whatever way you do it, for an "out of place" convolution, you need at least as much memory as the twice image size (input and output images) plus the convolution kernel image size. GPUs still don't often have as much memory as the main system RAM, so larger images and kernels become problematic in a real space GPU method.

Compute: The FFT method reduces the complexity to a single multiplication operation between each pixel in the FT of the image and kernel. So this is good on a CPU with access to large RAM, so long as the FFT operation is fast. The FFT method reduces the complexity in operations and memory accesses to a multiplication operation between each corresponding pixel in the transformed image and the transformed kernel. When doing a real space "out of place" convolution in series on a single or few CPUs, it takes as many multiplication operations and memory accesses as there are total combinations of input and kernel pixel locations - much more complex than in the FFT method. Multi-threading on a modern multi-core CPU will make a real space "out of place" convolution a few times more efficient, but on a GPU with enough compute cores, it could be more efficient than the FFT method, but it would a need huge number of compute cores. In 2019, the sizes of biological 3D images commonly acquired on modern microscopes, are too large for the real space "out of place" convolution on a GPU in terms of memory and number of cores to be able to do it in a single chunk. For small images and kernels, its all good. Maybe there is a clever way around this limitation?

Compute-memory bandwidth: All the above is complicated by the fact that as well as the memory limitation, there is also a memory to compute bandwidth (transfer speed) bottleneck. The speed of data transfer between the GPU and the rest of the system is very dependent on the performance of the GPU and it's components: A laptop with integrated GPU on the motherboard's chip is likely to be weak compared to a powerful GPU card in a workstation that has more memory, more cores and higher bandwidth.

frauzufall commented 5 years ago

ping @tibuch