DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.48k stars 88 forks source link

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

Open nschaeff opened 2 months ago

nschaeff commented 2 months ago

Hello,

For small (1D) transforms that are part of a more complex workflow, it would be interesting to be able to call VkFFT from a kernel on data already residing in shared memory. Would that be possible to implement? If yes it is a feature that could interest me in the future.

For instance, within my cuda kernel that works on some data, I would call

__shared__ double my_data[N];
double* aux_data_needed_by_vkfft = pointer_returnd_by_vkfft_when_planning;

vkfft_perform_dft_inplace(my_data,  aux_data_needed_by_vkfft);

I assume the blocksize and data size must both be known at the planning stage for VkFFT to generate the function. Alternatively, VkFFT could also return a string with the code of the device function, that could be included into another cuda program compiled at runtime with nvrtc.

[For instance, this would allow to do some filtering within a kernel: generate some data -> FFT -> set some frequencies to zero -> IFFT -> further process data -> write to memory.]

DTolm commented 2 months ago

Hello,

I have been thinking about how kernel inlining (and callbacks) can be handled in VkFFT for some time, as the runtime generation nature of VkFFT should allow it to be implemented smoothly. Mostly what I need for this is to learn how separate linking is done in different APIs and decide on how the interface will look like. Speaking of the latter, I think VkFFT can be modified to generate codelets for any sequence for an arbitrary fixed number of threads, so this won't be a constraint, but handling the input/output format (in/to registers, in/to shared memory) can be more tricky. I will try to come back to this soon, after finishing the separate complex data layout for the next release.

Best regards, Dmitrii