Is there a practical example of use of imfilter in GPUs?

jjgomezcadenas commented 3 years ago

Hello!

I am interested in using Julia/imfilter for a deconvolution problem in particle physics. In short, our experiment (http://next.ific.uv.es/next/) takes "electronic movies" of electrons propagating in dense xenon gas. We divide those pictures in a stack of about 100 2D images. Each image is "blurred" by the diffusion of the ionisation electrons on their way to our sensors, but the diffusion can be very well described (by a kernel approximately gaussian) and thus we can deconvolve our pictures and reconstuct the track, using a classical Richardson-Lucy deconvolution. If curious see:

Boosting background suppression in the NEXT experiment through Richardson-Lucy deconvolution 2102.11931 [physics.ins-det]

The algorithm (and all our software) is implemented in python (+numpy, scipy, etc). LR algoritm is implemented in scikit-images, one of the libraries of the ecosystem and is pretty fast, but not fast enough for our problem. Tipically we take about 2.5 seconds per image, so about 30 seconds per event... and we need to process zillions of events. Currently we solve the problem by butre force, running in large clusters one event per node. Clearly not optimal.

The problem appears to be well suited for GPUS, since it involves loops and more loops of convolutions. I have implemented RL in Julia, using imfilter (it was suprisingly easy, the interface and doc is very clever, congrats!) and I have checked that it runs pretty fast in the CPU (about 0.5 second in my Mac, in multi-thread mode). Now I would like to port the algorithm to GPUs. The docs indicate that one could in principle call imfilter invoking CudaLibs() as resource, but I have failed to find a single example of anyone trying this.

So my question (apologies for the long intro). Can one run imfilter with CudaLibs() out of the box today? If yes, is there a practical example, a test, a tutorial one can look at? If not, are there plans to implement it?

Thanks a lot and congrats for a really clever library

timholy commented 3 years ago

There isn't now, but it's definitely on the agenda: https://julialang.org/jsoc/gsoc/images/#gpu_support_for_many_algorithms_medium

That said, no one has claimed it yet, though I have my fingers crossed for this GSoC period. If you're interested in tackling it, would love to have this!

jjgomezcadenas commented 3 years ago

Thanks for input. We certainly would love to help, but we are newcomers to Julia (I moved all our software from C++ to python a few years ago, I would have tried Julia now). My plan is to use Julia in specific areas, where there is a distinct advantage to the overburden of adding a second language to the software system, and imaging processing using GPUs would be one, if the software were mature enough. From what I can see, this may take still a while. One possibility that I will think about is to try a toy example and evaluate how hard it would be extending the problem.

Thanks again.

timholy commented 3 years ago

It's hard to say how hard it would be (Julia is exceptionally nice for GPU programming), but it is probably not the best project for a Julia beginner. Let's leave this open and you'll know it's done when it gets closed. Certainly don't expect anything in the next couple of months, but I'm hopeful it won't be too bad and that someone will find the time to tackle it.

rafaqz commented 3 years ago

Maybe relevent, DynamicGrids.jl can run direct convolutions on GPU using KernelAbstractions.jl. A simple direct windowing implementation is pretty easy to write as a toy example. But something faster or other algorithms for larger window sizes might be more difficult. Unfortunately the code is pretty abstract as it handles running arbitrary functions over arbitrary neighborhood shapes, but maybe gives an idea of what is involved.

In my limited experience really fast GPU code does seem to require knowledge of GPU blocks, threads and shared memory to optimize. Which I don't have yet either, but I'm interested to see what you come up with. Even a toy example on GPU is still pretty fast.

jjgomezcadenas commented 3 years ago

Hi again. I have run a simple exercise, convoluting an image in CPU/GPU, using FFTs. You can find it here:

https://github.com/jjgomezcadenas/JImages/blob/main/nb/ConvolvingLenaGPUFFT.ipynb

Several remarkable points:

The straight (and possibly pretty naive) algorithm in the CPU is short of slow. It takes about 0.5 second to convolver a classical Lena (512, 512) image with a 5x5 gaussian filter. I actually don't understand what I do wrong. I have tried the same exercise using frame sliding and I manage to go somewhat faster (0.1 to 0.3 seconds depending on machine).
The straight translation to the GPU (simply call CUDA FFTs) improves speed by a factor 20 to 50. Not bad.
But the, using imfilter out of the box I get a similar performance, in the CPU with single thread!

I thought @timholy may be interested in the exercise. If there is any obvious recipe that I can apply to improve the algorithm I will be glad to try. I am also amazed how fast the imfilter algorithm runs (again, I would be happy to learn any trick you care to share).

Thanks

rafaqz commented 3 years ago

Have you tried a much larger kernel with a larger image? These sizes are not favourable for FFTs or GPUs. I think imfilter chooses FFT when the kernel size is over 30!

You could also compare to imfilter specifying CPU1 and FFT to force it to use FFT.

jjgomezcadenas commented 3 years ago

I will try. The bottom line, in any case is that it should be very simple to add the functionality of running in the GPU for imfilter, at least in the case of ffts, which, as you say, could be very useful out of the box for large image/kernel processing.

Our problem, however, goes in the opposite direction, we need to deconvolve many small images. Each electron in our detector is reconstructed as a stack of about 100 images, each image a square o about 300 pixels size (eacy image size is variable). Kernels are also small and of variable size. For each image we need to apply an iterative deconvolution procedure (lucy-richardson) which takes about 100 loops to converge, each loop on itself implies 2 convolutions. This repeats for many events. I suspect one should find a way to send the full stack of 100 images simultaenously to the GPU.

Thanks!

timholy commented 3 years ago

In that notebook, it's not obvious whether you're accounting for compilation time. Are those times you show after running it a second time?

imfilter uses at least one important trick: it attempts to factor the kernel, and if it is separable (as gaussians are) then it computes a 2d convolution as a series of two 1-d convolutions. So rather than a convolution with 25 elements it's two 5-element convolutions.

jjgomezcadenas commented 3 years ago

Yes, yes, I run several times, the times you see in the NB is after a few runs. The separation trick is nice, indeed and should account at least for a good portion of the huge difference with my naive FFT implementation. The fact that the GPU accelerates my toy calculation by a factor 50 also suggests that simply using CUDA FFTs in imfilter (which should be easy, I guess) would result in extra gains, even for "smallish" kernels and images.

I have looked at the imfilter code to try to figure out if I could introduced these, but is sort of convoluted itself... ;)!

timholy commented 3 years ago

To clarify, the factorization trick is only used for FIR filtering, not FFT filtering. Did you force FFT filtering with imfilter for comparison?

Since this is mostly about "why is my alternative implementation slow?" I confess I don't really have the bandwidth to help you figure that out. Try profiling your implementation and viewing it with ProfileView or equivalent to see if you can figure out what's going on; doing the same for imfilter's implementation will allow you to compare.

jjgomezcadenas commented 3 years ago

Hi Tim, thanks a lot for input. Actually I am not that interested in my alternative implementation which I did just to compare GPU vs CPU. That exercise yields an improvement using GPU. But then, of course, what I would like to do is to adapt imfilter to GPU, as originally stated. Alas, the code seems too complex for me to mess up with.

rafaqz commented 3 years ago

Im interested in adding GPU FFT to DynamicGrids.jl at some stage. It would be good if we could share a base package somehow, eventually... DG convolutions code is even more convoluted - the price of generality.

Here it seems that imfilter just calls imfilter until all the arguments are clarified, so you should be able to find the method signature that GPU FFT can plug into. This is a very common coding style with multiple dispatch in Julia.

yuvalwas commented 11 months ago

Hi, just wanted to check if there is any news on this. Is there any easy way to use imfilter on a CuArray?

ashwani-rathee commented 11 months ago

There is not yet that I know of, but we want to move in that direction

JuliaImages / ImageFiltering.jl

Is there a practical example of use of imfilter in GPUs? #209