FFT Cuda - Githubissues

edouardoyallon commented 9 years ago

Hi,

The most difficult function could be the FFT:

http://docs.nvidia.com/cuda/cufft/#multi-dimensional
FFT can segfault a lot due to constraints such as dimension along you want to apply it and mini batch
It needs to be as fast as possible

Should we make a function specific to 1D, 2D,... or should we keep it like fft(x,k) in MATLAB style?

Help requested on that.

lostanlen commented 9 years ago

Hi,

Being able to choose the subscript is essential to the algorithm. In 1d:

time scattering operates on the first subscript
1d space scattering operates on the second subscript (cf Mathieu)
frequency scattering operates on the second subscript when the audio file is not "chunked" into windows, on the third subscript where there is chunking
octave scattering operates on the fourth subscript if there is chunking, on the third chunking if there is no chunking.

As regards 2d scattering, I believe the only use case is dimensions [1 2].

So I wouldn't mind having two separate methods for FFT (1d and 2d). I would mind a lot if performing 1d FFT along the first, second, third, fourth dimensions would be in four different branches of the code. Then again, is fft2d faster than two 1d ffts applied sequentially ? (memory allocation concerns put apart).

Numeric types : it is crucial to have FFT R2C (real-to-complex) and IFFT C2C (complex-to-complex). Our signals are real in spatial domain but complex in the Fourier domain.

Finally, an API for cufftXtSetCallback() should be available, to avoid the performance cost of creating an FFTW plan (``cuFFTMakePlan()) each time we process a signal with the same scattering architecture.

FFT can segfault a lot due to constraints such as dimension along you want to apply it and mini batch

Can you elaborate ?

edouardoyallon commented 9 years ago

Being able to choose the subscript is essential to the algorithm. In 1d:

time scattering operates on the first subscript

1d space scattering operates on the second subscript (cf Mathieu)

frequency scattering operates on the second subscript when the audio file is not "chunked" into > windows, on the third subscript where there is chunking

octave scattering operates on the fourth subscript if there is chunking, on the third chunking if there is > no chunking.

Fair enough. But if it appears that by chance we can optimise those routines, that would be great to do it.(or to think of it), but let's make simple and follow the "fft(x,k)" then.

As regards 2d scattering, I believe the only use case is dimensions [1 2].

Yeah.

So I wouldn't mind having two separate methods for FFT (1d and 2d). I would mind a lot if performing 1d FFT along the first, second, third, fourth dimensions would be in four different branches of the code. They should be merged, but also optimised to feed that to a CUDA classifier(i.e. deepnet based on CUDA).

Then again, is fft2d faster than two 1d ffts applied sequentially ? (memory allocation concerns put apart).

I don't know. I think it could be slightly slower, yet, in the torch setting, there are many ways to avoid memory allocation so that could be fine.

Numeric types : it is crucial to have FFT R2C (real-to-complex) and IFFT C2C (complex-to-complex). Our signals are real in spatial domain but complex in the Fourier domain.

Yeah, that's implemented. We do not need C2R however.

Finally, an API for cufftXtSetCallback() should be available, to avoid the performance cost of creating an FFTW plan (``cuFFTMakePlan()) each time we process a signal with the same scattering architecture.

I was thinking to hardcode it, and build all those functions each time we load the software. What do you think?

FFT can segfault a lot due to constraints such as dimension along you want to apply it and mini batch Can you elaborate ?

The idea would be to input 128 images/signals each we process the scattering. The memory is difficult to handle and so the implementation will have to be performed with a lot of care, that's all I meant. The version I have currently segfault sometimes and I don't understand fully why... We need unit test.

You can look at the code of FFT I did via fftw. I think it's not optimal, but that's a first step, and there is the code of Facebook here: https://github.com/facebook/fbcunn

lostanlen commented 9 years ago

But if it appears that by chance we can optimise those routines, that would be great to do it.

We do not have the tools to optimize FFTs themselves. What we can optimize is the way we perform convolutions : in Fourier domain for medium/large output sizes (time, 2d space, frequency), in spatial domain for small output sizes (scales for images, octaves for audio). Convolution along angles could be via either of the two. So there is a method dispatch upstream, in the convolution function ; but not is the call to FFT.

I was thinking to hardcode it, and build all those functions each time we load the software. What do you think ?

Good idea. Then we need to call cuFFTMakePlan() on the "setup" function of the toolbox, which is called only once. Ideally the architecture generated by setup (or the name you fancy) should be immutable and passed by value at the callsite, but that's another story.

The idea would be to input 128 images/signals each we process the scattering.

That makes a lot of sense. It adds another dimension to the input tensor however. So the convolution along angles would naturally be along the fourth dimension (not the third), and so forth. My toolbox scattering.m reasons about this at runtime. We may want to have this at compile-time to speed up the performance if we manage to leverage immutability. What I would like to avoid is to specify these things at code-typing-time :)

lostanlen commented 9 years ago

After reading issue #5, I see you want that at code-typing-time. I can live with that actually. Be aware that is demands a great load of dedication though.

edouardoyallon commented 9 years ago

But if it appears that by chance we can optimise those routines, that would be great to do it.

We do not have the tools to optimize FFTs themselves.

We do. It simply means coding them in a nice way. No need to go too much in CUDA.

What we can optimize is the way we perform convolutions : in Fourier domain for medium/large output sizes (time, 2d space, frequency), in spatial domain for small output sizes (scales for images, octaves for audio). Convolution along angles could be via either of the two. So there is a method dispatch upstream, in the convolution function ; but not is the call to FFT.

If you have more than 8 coefficients filters, then it makes also sens to still do it in FFT. That will be great to implement it, yet it is not the main bottleneck right now.(which is the FFT on the signal along the space/time variable)

I was thinking to hardcode it, and build all those functions each time we load the software. What do you think ?

Good idea. Then we need to call cuFFTMakePlan() on the "setup" function of the toolbox, which is called only once. Ideally the architecture generated by setup (or the name you fancy) should be immutable and passed by value at the callsite, but that's another story.

Yeah. Let's say that we have it in mind, and first want something running. I think the main point is to do Unit test..

The idea would be to input 128 images/signals each we process the scattering.

That makes a lot of sense. It adds another dimension to the input tensor however. So the convolution along angles would naturally be along the fourth dimension (not the third), and so forth. My toolbox scattering.m reasons about this at runtime. We may want to have this at compile-time to speed up the performance if we manage to leverage immutability. What I would like to avoid is to specify these things at code-typing-time :)

Well, I don't agree on that in the sens that before specifying things in high level, we need to find the correct frameworks. So, until we find it, it's better if everybody understands each part of the code, simultaneously for text(maybe not for text, because as far as I understand, scattering is not necessarily the correct solution), images, sounds & video.

Right now, we need a plan to perform some collaborative work.

edouardoyallon commented 9 years ago

Fixed. The regular FFTW/CUDA functions are used, and for now, only via FFI, avoiding low-level instructions.

edouardoyallon / scatwave

FFT Cuda #1