CUDA, for loop and batches

edouardoyallon commented 9 years ago

Hi,

That's it, CUDA is working in this implementation. However, there are some strange behavior that might be due to the tuning I am doing.

First, mixing low level and high level with FFI(which is the JIT of Lua) is pretty much dangerous and leads easily to segfaults. Secondly, for some small values CUDA does worst than the CPU. Thirdly, CUDA does not handle well the Lua "for loop", which means that performing a forloop on the cuFFT routines crash when the loop is too large. I believe it is because there is some syncronization that is required, but until now I do not know how to perform it. Finally, the Guru FFT is not avalaible with the nVidia implementation, consequently I'll have to trick slightly the FFTs. This means that the batch is limited to 1 dimension(or the number of dimension on the left more precisely), and the transform is limited to 3.(which is fine except if a convolution along more than 3 variables is required)

If anyone has some ideas how to fix that, otherwise the code is simply in the CUDA branch.(you'll have to carefully install the cuFFTW.so library however)

edouardoyallon commented 9 years ago

Just to explain the interest to go on CUDA, for this signal: x=torch.randn(128,3,256,256,2) (which is a batch of 128 colour images of size 256x256 in complex domain) FFT computation time on CUDA: 0.047905778884888 FFT computation time on CPU: 1.4280892372131

We're getting closer to imagenet!

edouardoyallon commented 9 years ago

The code is working with cuda!

To compute a batch of scattering coefficients, of size 128x3x32x32 CPU: 9.3523290157318 GPU: 1.194904088974

I did not investigate yet the bottlenecks, however I guess it is still pretty much optimizable! I did check only on the value of low pass and the ifrst layer, I will double check that soon.

However, there is a slight bug, which might be due to the way I do FFT(ffi, ..?). Indeed, I can't do a for loop with this CUDA implementation: with forloop, it does segfault.

edouardoyallon commented 9 years ago

Problem found, the GPU is running out of memory... uhuh, need to avoid reproducing what we did on MATLAB... :smile:

edouardoyallon commented 8 years ago

Fixed. Thank you sergey!! The issue is that in the original script of fftw3, the guy that coded it did not handle the garbagecollector! Now, going to avoid all the calls to cufftPlanMany, that takes half of the computation time!

edouardoyallon / scatwave

CUDA, for loop and batches #9