DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.48k stars 88 forks source link

possible bug with zero padding feature and C2R transforms of large size #180

Open nschaeff opened 3 days ago

nschaeff commented 3 days ago

In some cases, for large sizes (more below), using a R2C and C2R transform, I obtain wrong FFT results with the zeroPadding feature in frequency domain. If I remove the zeroPadding feature, everything works well. Maybe I'm not doing things correctly, the padding for R2C format is not clearly described.

On an AMD MI210, it fails for nfft > 4096 (not all values, though). nfft=6144 fails, for instance. On an nvidia A30, it fails for larger values, nfft >= 12288 (some smaller values fail too, like 11264).

Steps to reproduce:

nfft = 12288;  // other values fail too, and some smaller values also, like 4094 on MI210, but 4095 works on MI210.
kmax = 1;  // maximum frequency represented. can be almost anything, as long as zero padding is performed.
config.FFTdim = 1; //FFT dimension: 1D, but we use a second dimension to get non-unit strides.
config.size[0] = nfft;
config.isInputFormatted = 1;        // out-of-place: separate buffer for input and output
config.inverseReturnToInputBuffer = 1;
config.inputBufferStride[0] = nfft;     // spatial data layout
config.bufferStride[0] = nfft/2 + 1;    // spectral data layout
config.doublePrecision = 1;    // fails also in single precision
if (1) {   // disable this block for the FFT to work correctly again
  config.performZeropadding[0] = 1;
  config.frequencyZeroPadding = 1;
  config.fft_zeropad_left[0] = kmax + 1;            // first zero element
  config.fft_zeropad_right[0] = nfft - kmax;        // first non-zero element (I tried also with nfft/2+1, does not seem to matter)
}
config.numberBatches = 16;   // this value does not seem to matter
config.performR2C = 1;

Performing FFT in both direction gives wrong results.

nschaeff commented 2 days ago

I suspect it happens in the situation where all the following three conditions are realized together: padding + multiple uploads + R2C

DTolm commented 2 days ago

Hello,

Yes, there is an issue with multiple uploads even R2C behaving incorrectly with zeropadding when using an even decomposition algorithm that represents R2C/C2R as a C2C of half-length with some post-processing. It has issues with odd zeropadding sizes as it reads data as complex values. There is another algorithm implemented that does real transforms as internal callbacks that I enabled to substitute the aforementioned even R2C/C2R algorithm if zeropadding is requested. It uses 2x more temporary memory in multiple uploads, but the size of memory transferred and performance should be similar to the even R2C algorithm.

As for the fft_zeropad_right[0] - the complex output of R2C or input of C2R are assumed to be conjugate, so it should be <= nfft/2+1, all values above will have the same result as nfft/2+1. If chosen < nfft/2+1, the system will have this format: [1,1, 0 ... 0, 1,1] for [0, nfft/2+1) elements, and then the complex conjugate will be read in reverse as: [1, 0 ... 0, 1,1].

Best regards, Dmitrii