Batch Mode Data Ordering?

spinicist commented 2 years ago

Hello,

I'm very interested in adding VkFFT to my project here: https://github.com/spinicist/riesling as an addition alongside FFTW.

However I'm confused about the data orderings supported for batched FFTs in VkFFT. Currently my batch dimension is the innermost/fastest varying dimension for my input array, and the FFT dimensions are the outermost. You can see how I set up an FFTW plan here: https://github.com/spinicist/riesling/blob/main/src/fft/cpu.hpp#L58 - note I specify N FFTs that are spaced 1 element apart in memory.

Does VkFFT support the same data ordering? I have looked at the docs but confess I am confused as to how batch mode works - they way I read them, it seems the batches have to be the outermost dimension. I am hoping I misunderstood.

Thanks in advance.

DTolm commented 2 years ago

Hello, It is correct that batching is done for the outermost dimension as of now (the one that has all the FFT elements between). There is a workaround for batching in inner-most dimension though - you initialize FFT as a multidimensional and use omitDimension on the first axis to disable it (then you can use the batch number as the first dimension). This limits the max FFT dimension to 2 and doesn't fork for R2C/C2R, so I will these cases later. Best regards, Dmitrii

spinicist commented 2 years ago

Thanks for getting back so quickly - it is much appreciated. Sadly I need a batched 3D transform. Is the batch-innermost ordering something you plan to support in future, and if so any idea when?

Alternative workaround ideas:

Combine your suggested workaround with a "fake" batch on the 3rd FFT dimension, and then add a second 2D FFT where the first dimension is omitted and the size of the product of the real batch size and first two FFT dimensions. Do you have any idea if this will perform well?
Transpose my data before upload to the GPU. Do you know of any trick that would allow me to transpose the data during upload? (GPUs are not my speciality).

DTolm commented 2 years ago

Is the batch-innermost ordering something you plan to support in the future, and if so any idea when?

More and more people ask for it, so I will add it. It doesn't require any new algorithms (except for strided R2C/C2R), so I think early June is an ok time to aim.

Combine your suggested workaround with a "fake" batch on the 3rd FFT dimension, and then add a second 2D FFT where the first dimension is omitted and the size of the product of the real batch size and first two FFT dimensions. Do you have any idea if this will perform well?

Creation of two plans: 1d (fake 2d) and 2d (fake 3d) will work. As this is how all multidimensional FFTs are done (as a batch of 1D FFTs), this should work well.

Transpose my data before upload to the GPU. Do you know of any trick that would allow me to transpose the data during upload? (GPUs are not my speciality).

You can efficiently transpose on the GPU (https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc) so doing it before upload will be most likely slower (because of the lower bandwidth of general RAM). VkFFT is not doing any transpositions so these additional steps will make this approach slower due to the bigger number of memory transfers.

spinicist commented 2 years ago

I consider early June a fast turn around, so that sounds absolutely great to me. Thanks!

spinicist commented 2 years ago

Hello again - any updates on this?

DTolm commented 2 years ago

Hi,

Sorry, I did not have much time in May and now I am in the process of reorganizing and cleaning VkFFT code base (to make such modular changes easier in the future), so I decided to do this after. Sorry for the inconvenience!

Best regards, Dmitrii Tolmachev

spinicist commented 2 years ago

No need to apologise - refactoring is always important. I look forward to when you have finished that!

spinicist commented 1 year ago

Hello, did you have time to implement this yet? After much distraction, I have some time to look at this in my project again, and now with the Metal backend it is even more appealing.

DTolm commented 1 year ago

Hello,

I am sorry that it takes so long, the restructuring of the code took way more time than originally expected but is almost finalized. I also had other priority projects in my PhD so I didn't get to do this yet. I really want to implement proper non-unit strides support, there are just some algorithms that need to be adapted to it, which is why it wasn't as fast as I thought.

Best regards, Dmitrii

spinicist commented 1 year ago

No need to apologise, I totally understand. When you get to this I am sitting ready to try it and cite your paper.

DTolm commented 1 year ago

Hello,

I have implemented the arbitrary number of dimensions support in VkFFT (so far on the develop branch). By defining VKFFT_MAX_FFT_DIMENSIONS, it is now possible to mimic fftw guru interface. The innermost stride is always fixed to be 1, but there can be an arbitrary number of outer strides. To achieve the innermost batching you want, initialize N+1 dim FFT and omit the innermost one using omitDimension[0] = 1. You can see how VkFFT configuration is analogous to FFTW in the sample 14 code.

I have run some comparison tests and for basic functionality (without zero-padding/convolution which are less tested) it should work. If something is found to be wrong, I will gladly fix it in the future.

Best regards, Dmitrii

DTolm / VkFFT

Batch Mode Data Ordering? #71