DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.52k stars 91 forks source link

Multiple systems - one kernel - multiple outputs convolution #159

Open CoffeeExterminator opened 7 months ago

CoffeeExterminator commented 7 months ago

Hi, I'm trying to perform a line-by-line image convolution using vkFFT: I have an image of size N M and a kernel of size N 1, and I'm trying to perform a convolution for each line with this same kernel. So, I'm performing an FFT for the kernel, and then trying to perform another FFT with 1D convolution for the image, setting numberBatches to M. But the convolution is only performed for the first line. I've tried playing with other parameters like coordinateFeatures and numberKernels, but haven't found the way to get the result I want. Am I missing something or is such convolution not currently possible?

DTolm commented 7 months ago

Hello,

I believe this is the same request as https://github.com/vincefn/pyvkfft/issues/33#issuecomment-1950252081. Shouldn't be hard to add and seems to be really useful. I will try to add this functionality in the next couple of days,

Best regards, Dmitrii

CoffeeExterminator commented 7 months ago

I believe it is indeed the same request, I haven't seen it was already mentioned since it's for pyvkfft. It would be a very useful addition, looking forward to it, thanks.

DTolm commented 7 months ago

I have added a sample (53) that shows how the requested functionality can be implemented in the new develop version of VkFFT. I haven't fully tested it yet apart from the example, but the implementation didn't require any big modifications, so it should work. If some more help is needed - I will be happy to answer any related questions.

CoffeeExterminator commented 7 months ago

Hi, I have tested the new functionality for the example I mentioned earlier and the C2C multiple batches convolution worked fine. However, I have encountered some unexpected behavior with out-of-place R2C convolution (not sure if it's related to batches), which I hope you could clarify.

  1. Not exactly related, but are there any requirements for dimension sizes when it comes to convolution? I haven't seen any mentioned, but have encountered initialization of app_convolution fail with specific sizes, one specific example being
    configuration.FFTdim = 1; 
    configuration.size[0] = 449;
    configuration.size[1] = 1;
    configuration.size[2] = 1;

    This can be reproduced by modifying any of the samples with convolution (I personally have tried 50, 52, 53) accordingly and doesn't seem to depend on number of batches or coordinateFeatures.

  2. As said, I was trying to perform a line-by-line image convolution. So, I have a 1D kernel and an image (single channel) and the idea was to first perform 1D R2C FFT for the kernel, then perform 1D R2C FFT for each line of the image (setting the number of batches to image height) with convolution (using the kernel FFT I got in the previous step), and finally perform IFFT for the result. Now, about the complications I've faced. Here are stage-by-stage results of FFT-IFFT for the image without any convolution (the original image in the top left corner, its FFT below and the inverse FFT to the right). Everything is as expected. fft_no_convolution After that I enable convolution with a kernel (it's the same as identity kernel, except in place of 1.0 there's 0.5, so I expect to get a darkened image in the end). I expected to see the original image, the somewhat similar picture for FFT and the darker version of original image for the result of IFFT. However, what I got was this: fft+convolution Notice how the result of the forward FFT with convolution doesn't look like the result of the FFT, but instead looks like something done to the original image. And the final result is nothing like I expected as well. I have then tried to split the process by first performing FFT for the image with no convolution, then perform a C2C convolution (by initializing and using a different application) for the result, and then perform the IFFT. This got me the expected result for every stage. convolution_separated So, to my understanding either the convolution happened in the wrong place, or my configuration parameters were very-very wrong. I'm providing the simplified code with all the main steps and configurations.
    
    uint64_t srcSize = sizeof(float) * width * height;
    uint64_t dstSize = sizeof(float) * 2 * (width / 2 + 1) * height;

uint64_t kernelSrcSize = width sizeof(float); uint64_t kernelDstSize = sizeof(float) 2 * (width / 2 + 1);

// kernel fft kernel_configuration.FFTdim = 1; kernel_configuration.size[0] = width; kernel_configuration.size[1] = 1; kernel_configuration.size[2] = 1;

kernel_configuration.kernelConvolution = true; kernel_configuration.coordinateFeatures = 1; kernel_configuration.normalize = 1;

kernel_configuration.performR2C = 1; kernel_configuration.isInputFormatted = 1; kernel_configuration.inputBuffer = &kernelSrc; kernel_configuration.inputBufferSize = &kernelSrcSize; kernel_configuration.buffer = &kernel; kernel_configuration.bufferSize = &kernelDstSize; ... auto resFFT = initializeVkFFT(&app_kernel, kernel_configuration); ... resFFT = VkFFTAppend(&app_kernel, -1, &launchParams); ... // image fft and convolution convolution_configuration = kernel_configuration; convolution_configuration.kernelConvolution = false; convolution_configuration.performConvolution = true; convolution_configuration.numberBatches = height;

convolution_configuration.isInputFormatted = 1; convolution_configuration.inputBuffer = &src; convolution_configuration.buffer = &dst; convolution_configuration.inputBufferSize = &srcSize; convolution_configuration.bufferSize = &dstSize; convolution_configuration.performR2C = 1; convolution_configuration.inverseReturnToInputBuffer = 1; convolution_configuration.kernel = &kernel; convolution_configuration.singleKernelMultipleBatches = true; convolution_configuration.kernelSize = &kernelDstSize; ... resFFT = initializeVkFFT(&app_convolution, convolution_configuration); ... resFFT = VkFFTAppend(&app_convolution, -1, &launchParams); ... // result ifft resFFT = VkFFTAppend(&app_convolution, 1, &launchParams);

DTolm commented 7 months ago

Not exactly related, but are there any requirements for dimension sizes when it comes to convolution?

Yes, there is an issue that the convolution optimization has not been implemented if the last axis it is merged in has to do something else but plain radix2-13 C2C. This includes 1D R2C, R2R, Rader and Bluestein sequences. I need to fix this to work for all cases as not many people used this functionality before (so the documentation on convolutions is also lacking). Sorry for the confusion.

CoffeeExterminator commented 7 months ago

Yes, there is an issue that the convolution optimization has not been implemented if the last axis it is merged in has to do something else but plain radix2-13 C2C.

Thanks for clarifying. Any hint what it should be aligned to then, to not encounter such problems for now?

As for the second issue, I'm starting to suspect it might have something to do with buffer layout in my configuration, but I'm still looking into it.

DTolm commented 7 months ago

Any hint what it should be aligned to then, to not encounter such problems for now?

Convolutions should be safe now for C2C sequences decomposable as a multiplication of an arbitrary number of primes from 2 to 13.

But isn't the second issue related to 1D R2C convolutions that are not supported currently?

CoffeeExterminator commented 7 months ago

Convolutions should be safe now for C2C sequences decomposable as a multiplication of an arbitrary number of primes from 2 to 13.

Thanks again.

But isn't the second issue related to 1D R2C convolutions that are not supported currently?

I must have missed it. So R2C convolutions aren't supported specifially for 1D case? Cause I think there's a 2D R2C convolution in sample 52.

DTolm commented 7 months ago

I must have missed it. So R2C convolutions aren't supported specifially for 1D case? Cause I think there's a 2D R2C convolution in sample 52.

Exactly. The last axis of transform needs to be a simple C2C to merge the convolution in it. I will try to fix this in the future to be more general.

CoffeeExterminator commented 7 months ago

That explains it then, thank you.