DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.47k stars 88 forks source link

vkFFT on Android #41

Open johuether opened 2 years ago

johuether commented 2 years ago

Hi, I'm currently developing an Android App which needs to calculate FFTs.

Till now I used OpenCV for this. In an attempt to speed the FFT implementation on Android devices I successfully was able to implement your library so that it works on most devices and the results are also quite close to the implementation provided by OpenCV.

But I noticed differences which were a little too big in my opinion. I got an avg. difference of 0.06 on a 128x128 system, which is quiet bigger then the results in your precision tests. Also I noticed differences on different mobile GPUs. I tested a Snapdragon 855+ (Adreno 640 GPU) which got a avg. difference of 0.01 and a Exynos 990 (Mali G77) which achieved the 0.06 avg. difference.

Is this difference because of the devices or do you think I misconfigured something?

Setup Notes: Values are uniform random numbers calculated by OpenCV and then passed to an modified version of the sample 0. Calculation in C2C but imaginary part is 0 -> didn't used R2C for now cause I didn't really understand how to align the data properly

DTolm commented 2 years ago

Hello,

It sounds great that the code works on Android. As for the precision, some of the devices (namely - Intel iGPUs) have really bad precision on their hardware special function units, used to calculate sines and cosines. I haven't personally tested mobile GPUs, but the workaround is to enable LUT usage in single precision. To do so, provide useLUT=1 in the configuration and let's see if that helps.

As for R2C layout, I am now making a documentation covering all of this, it will be out soon.

Best regards, Dmitrii

johuether commented 2 years ago

Thanks,

Yes, that helps. Now the difference is in the magnitude of your results.

But I have a second question:

I also need to compute the cross correlation of two images. So I need to calculate the cross power spectrum after I calculated the FFTs.

Is there a easy way to integrate this into one pipeline, so I don't have to copy buffers between my calculations?

DTolm commented 2 years ago

So far there is no callback functionality in VkFFT yet (I need to think on how it can be done for all backends first).

As for the cross-correlation, I can add it as an alternative to convolution calculation today in the evening. It will be done as: you compute the FFT of a base image, use it as a kernel, and when you compute FFT of the second image, last FFT step, cross-correlation and first iFFT step will be merged.

johuether commented 2 years ago

Yeah, that would be great.

At the moment I calculate the cross-correlation the following way:

Are the results the same?

DTolm commented 2 years ago

I have added support for cross-correlation and normalized cross-power spectrum calculation (it normalizes kernel multiplication in the frequency domain). So, it works in a way that it merges steps from calculate FFT of image2 to calculate inverse of the last step in one kernel (plus X axis FFTs if you do 2D FFTs). You need to configure convolution calculation with cross-correlation enabled:

configuration.performConvolution = 1; configuration.conjugateConvolution = 1; (1 - the current FFT will be conjugated, 2 - the kernel will be conjugated) configuration.kernel = &kernelBuffer; (specify the other image's FFT here, use kernelConvolution = 1 when making plan for that FFT) kernelSize = &kernelSize;

You will need two separate applications, one for simple FFT, the other for cross-correlation. Hope this helps and feel free to ask any questions.

Nullkooland commented 2 years ago

Hi, could you upload an example of how to do cross-correlation between two images ? @DTolm

johuether commented 2 years ago

Thanks a lot for your effort.

I have tested the correlation now. Works like a charm if I use numberBatches = 1. When I set it to something greater than 1 I get an shader parse error:

"ERROR: 0:53: '' : syntax error, unexpected STAR\nERROR: 1 compilation errors. No code generated.\n\n"

Its thrown in VkFFTPlanAxis in the second axis. Do you have any clues whats causing this?

DTolm commented 2 years ago

@Talsoake can you please attach the configuration you used and the generated kernels (if you use keepShaderCode parameter, VkFFT will print them)

@Goose-Bomb Sure, I will add this to documentation, which is going to be released soon

johuether commented 2 years ago

This is the generated shader code:

 #version 450

    layout (local_size_x = 32, local_size_y = 16, local_size_z = 1) in;
 const float loc_PI = 3.1415926535897932384626433832795f;
 const float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
 layout(push_constant) uniform PushConsts
    {
        uint coordinate;
    uint batchID;
    uint workGroupShiftX;
    uint workGroupShiftY;
    uint workGroupShiftZ;
 } consts;
 layout(std430, binding = 0) buffer DataIn{
        vec2 inputs[32768];
 };
 layout(std430, binding = 1) buffer DataOut{
        vec2 outputs[32768];
 };
 layout(std430, binding = 2) buffer Kernel_FFT{
        vec2 kernel_obj[32768];
 };
 layout(std430, binding = 3) readonly buffer DataLUT {
    vec2 twiddleLUT[];
 };
 uint sharedStride = 32;
 shared vec2 sdata[4096];
 void main() {
        vec2 temp_0;
    vec2 temp_1;
    vec2 temp_2;
    vec2 temp_3;
    vec2 temp_4;
    vec2 temp_5;
    vec2 temp_6;
    vec2 temp_7;
    vec2 w;
    vec2 loc_0;
    vec2 iw;
    uint stageInvocationID;
    uint blockInvocationID;
    uint sdataID;
    uint combinedID;
    uint inoutID;
    uint LUTId=0;
        if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
            inoutID = (1 * (gl_LocalInvocationID.y + 0) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_0=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 16) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_1=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 32) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_2=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 48) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_3=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 64) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_4=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 80) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_5=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 96) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_6=inputs[inoutID];
        inoutID = (1 * (gl_LocalInvocationID.y + 112) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 +  * 16384;
            temp_7=inputs[inoutID];
    }
            if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
            stageInvocationID = (gl_LocalInvocationID.y+ 0) % (1);
        LUTId = stageInvocationID + 0;
    w = twiddleLUT[LUTId];
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
    loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
    temp_4.x = temp_0.x - loc_0.x;
    temp_4.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
    loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
    temp_5.x = temp_1.x - loc_0.x;
    temp_5.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
    loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
    temp_6.x = temp_2.x - loc_0.x;
    temp_6.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_3.x - loc_0.x;
    temp_7.y = temp_3.y - loc_0.y;
    temp_3.x = temp_3.x + loc_0.x;
    temp_3.y = temp_3.y + loc_0.y;
    w=twiddleLUT[LUTId+1];
    loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
    loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
    temp_2.x = temp_0.x - loc_0.x;
    temp_2.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
    loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
    temp_3.x = temp_1.x - loc_0.x;
    temp_3.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
    loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
    temp_6.x = temp_4.x - loc_0.x;
    temp_6.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
    loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
    temp_7.x = temp_5.x - loc_0.x;
    temp_7.y = temp_5.y - loc_0.y;
    temp_5.x = temp_5.x + loc_0.x;
    temp_5.y = temp_5.y + loc_0.y;
    w=twiddleLUT[LUTId+2];
    loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
    loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
    temp_1.x = temp_0.x - loc_0.x;
    temp_1.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
    loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
    temp_3.x = temp_2.x - loc_0.x;
    temp_3.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
    iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
    loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
    loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
    temp_5.x = temp_4.x - loc_0.x;
    temp_5.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    w.x = -iw.y;
    w.y = iw.x;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_6.x - loc_0.x;
    temp_7.y = temp_6.y - loc_0.y;
    temp_6.x = temp_6.x + loc_0.x;
    temp_6.y = temp_6.y + loc_0.y;
    loc_0 = temp_1;
    temp_1 = temp_4;
    temp_4 = loc_0;
    loc_0 = temp_3;
    temp_3 = temp_6;
    temp_6 = loc_0;
 }      sharedStride = 32;
    barrier();
        if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
        stageInvocationID = gl_LocalInvocationID.y + 0;
    blockInvocationID = stageInvocationID;
    stageInvocationID = stageInvocationID % 1;
    blockInvocationID = blockInvocationID - stageInvocationID;
    inoutID = blockInvocationID * 8;
    inoutID = inoutID + stageInvocationID;
    sdataID = inoutID + 0;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_0.x = temp_0.x * 0.00781250000000000f;
    temp_0.y = temp_0.y * 0.00781250000000000f;
    sdata[sdataID] = temp_0;
    sdataID = inoutID + 1;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_1.x = temp_1.x * 0.00781250000000000f;
    temp_1.y = temp_1.y * 0.00781250000000000f;
    sdata[sdataID] = temp_1;
    sdataID = inoutID + 2;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_2.x = temp_2.x * 0.00781250000000000f;
    temp_2.y = temp_2.y * 0.00781250000000000f;
    sdata[sdataID] = temp_2;
    sdataID = inoutID + 3;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_3.x = temp_3.x * 0.00781250000000000f;
    temp_3.y = temp_3.y * 0.00781250000000000f;
    sdata[sdataID] = temp_3;
    sdataID = inoutID + 4;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_4.x = temp_4.x * 0.00781250000000000f;
    temp_4.y = temp_4.y * 0.00781250000000000f;
    sdata[sdataID] = temp_4;
    sdataID = inoutID + 5;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_5.x = temp_5.x * 0.00781250000000000f;
    temp_5.y = temp_5.y * 0.00781250000000000f;
    sdata[sdataID] = temp_5;
    sdataID = inoutID + 6;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_6.x = temp_6.x * 0.00781250000000000f;
    temp_6.y = temp_6.y * 0.00781250000000000f;
    sdata[sdataID] = temp_6;
    sdataID = inoutID + 7;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_7.x = temp_7.x * 0.00781250000000000f;
    temp_7.y = temp_7.y * 0.00781250000000000f;
    sdata[sdataID] = temp_7;
 }  barrier();
        if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
            stageInvocationID = (gl_LocalInvocationID.y+ 0) % (8);
        LUTId = stageInvocationID + 3;
        temp_0 = sdata[sharedStride*(gl_LocalInvocationID.y+0)+gl_LocalInvocationID.x];
        temp_1 = sdata[sharedStride*(gl_LocalInvocationID.y+16)+gl_LocalInvocationID.x];
        temp_2 = sdata[sharedStride*(gl_LocalInvocationID.y+32)+gl_LocalInvocationID.x];
        temp_3 = sdata[sharedStride*(gl_LocalInvocationID.y+48)+gl_LocalInvocationID.x];
        temp_4 = sdata[sharedStride*(gl_LocalInvocationID.y+64)+gl_LocalInvocationID.x];
        temp_5 = sdata[sharedStride*(gl_LocalInvocationID.y+80)+gl_LocalInvocationID.x];
        temp_6 = sdata[sharedStride*(gl_LocalInvocationID.y+96)+gl_LocalInvocationID.x];
        temp_7 = sdata[sharedStride*(gl_LocalInvocationID.y+112)+gl_LocalInvocationID.x];
    w = twiddleLUT[LUTId];
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
    loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
    temp_4.x = temp_0.x - loc_0.x;
    temp_4.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
    loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
    temp_5.x = temp_1.x - loc_0.x;
    temp_5.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
    loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
    temp_6.x = temp_2.x - loc_0.x;
    temp_6.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_3.x - loc_0.x;
    temp_7.y = temp_3.y - loc_0.y;
    temp_3.x = temp_3.x + loc_0.x;
    temp_3.y = temp_3.y + loc_0.y;
    w=twiddleLUT[LUTId+8];
    loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
    loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
    temp_2.x = temp_0.x - loc_0.x;
    temp_2.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
    loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
    temp_3.x = temp_1.x - loc_0.x;
    temp_3.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
    loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
    temp_6.x = temp_4.x - loc_0.x;
    temp_6.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
    loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
    temp_7.x = temp_5.x - loc_0.x;
    temp_7.y = temp_5.y - loc_0.y;
    temp_5.x = temp_5.x + loc_0.x;
    temp_5.y = temp_5.y + loc_0.y;
    w=twiddleLUT[LUTId+16];
    loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
    loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
    temp_1.x = temp_0.x - loc_0.x;
    temp_1.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
    loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
    temp_3.x = temp_2.x - loc_0.x;
    temp_3.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
    iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
    loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
    loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
    temp_5.x = temp_4.x - loc_0.x;
    temp_5.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    w.x = -iw.y;
    w.y = iw.x;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_6.x - loc_0.x;
    temp_7.y = temp_6.y - loc_0.y;
    temp_6.x = temp_6.x + loc_0.x;
    temp_6.y = temp_6.y + loc_0.y;
    loc_0 = temp_1;
    temp_1 = temp_4;
    temp_4 = loc_0;
    loc_0 = temp_3;
    temp_3 = temp_6;
    temp_6 = loc_0;
 }  barrier();
        if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
        stageInvocationID = gl_LocalInvocationID.y + 0;
    blockInvocationID = stageInvocationID;
    stageInvocationID = stageInvocationID % 8;
    blockInvocationID = blockInvocationID - stageInvocationID;
    inoutID = blockInvocationID * 8;
    inoutID = inoutID + stageInvocationID;
    sdataID = inoutID + 0;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdata[sdataID] = temp_0;
    sdataID = inoutID + 8;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdata[sdataID] = temp_1;
    sdataID = inoutID + 16;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdata[sdataID] = temp_2;
    sdataID = inoutID + 24;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdata[sdataID] = temp_3;
    sdataID = inoutID + 32;
    sdataID = sharedStride * sdataID;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_0.x = temp_0.x + loc_0.x;
        temp_6 = sdata[sharedStride*(gl_LocalInvocationID.y+96)+gl_LocalInvocationID.x];
    temp_7.x = temp_3.x - loc_0.x;
        temp_real0 = 0;
        temp_imag0 += kernel_obj[inoutID+0].y * temp_2.x - kernel_obj[inoutID+0].x * temp_2.y ;
        temp_imag0 = 0;
        temp_real0 += kernel_obj[inoutID+0].x * temp_5.x + kernel_obj[inoutID+0].y * temp_5.y;
        temp_real0 = 0;
            inoutID = ((gl_GlobalInvocationID.x) % (128)) + ((gl_LocalInvocationID.y+112)+((gl_GlobalInvocationID.x)/128)%(1)+((gl_GlobalInvocationID.x)/128)*(128)) * 128 +  * 16384;
    w=twiddleLUT[LUTId+2];
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdataID = sdataID + gl_LocalInvocationID.x;
    temp_4.y = temp_4.y * 0.00781250000000000f;
    sdataID = inoutID + 6;
    temp_1.y = temp_1.y + loc_0.y;
    temp_2.y = temp_2.y + loc_0.y;
    loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
    temp_6.y = temp_6.y + loc_0.y;
    sdataID = sdataID + gl_LocalInvocationID.x;
    sdataID = sdataID + gl_LocalInvocationID.x;
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
    loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
    temp_2.y = temp_2.y + loc_0.y;
 }      sharedStride = 32;
            outputs[inoutID] = temp_5;

My config is:

FFTSize is 128

configuration.FFTdim = 2; 
    configuration.size[0] = FFTSize;
    configuration.size[1] = FFTSize;
    configuration.size[2] = 1;
    configuration.normalize = 1;
    configuration.useLUT = 1;
    configuration.performConvolution = 1;
    configuration.conjugateConvolution = 1;
    configuration.numberBatches = windowCount;
    configuration.device = &vkGPU->device;
    configuration.queue = &vkGPU->queue; //to allocate memory for LUT, we have to pass a queue, vkGPU->fence, commandPool and physicalDevice pointers
    configuration.fence = &vkGPU->fence;
    configuration.commandPool = &vkGPU->commandPool;
    configuration.physicalDevice = &vkGPU->physicalDevice;
    configuration.isCompilerInitialized = 1;//compiler can be initialized before VkFFT plan creation. if not, VkFFT will create and destroy one after initialization
    VkFFTResult resFFT = VKFFT_SUCCESS;
    bufferSize =
            sizeof(float) * 2 * configuration.size[0] * configuration.size[1] * configuration.size[2] *
            configuration.numberBatches;
    VkDeviceMemory bufferDeviceMemory = {};
        resFFT = allocateBuffer(vkGPU, &bufferSecondImages, &bufferDeviceMemory,
                                VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT |
                                VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT,
                                bufferSize);
    configuration.buffer = &bufferSecondImages;
    configuration.bufferSize = &bufferSize;
    configuration.keepShaderCode = 1;
    configuration.kernel = &bufferFirstImages;
    configuration.kernelSize = &bufferSize;
    resFFT = initializeVkFFT(&app2, configuration);
johuether commented 2 years ago

I solved it.

Like in #42 using coordinateFeatures did the trick.

instead of configuration.numberBatches = windowCount;

I used configuration.coordinateFeatures = windowCount;

@DTolm I can upload the changes that are needed to run VkFFT on Android. If you want to, please write me a message.

DTolm commented 2 years ago

@Talsoake While this workaround works, the original issue still remains. I have fixed it in the next version (and also reworked the way dispatching of batched kernels is done, which will make it faster). I will close it with the next commit.

As for the Android changes, can you contact me by email about this (dtolm96@gmail.com)? It would be good to make benchmark work on Android as a demonstration.

johuether commented 2 years ago

I tried your new version, but nothing changed. Its still an error with parsing the shader.

But the workaround with using coordinateFeatures instead of numberBatches still works.

DTolm commented 2 years ago

The code should not be the same as before, as I have changed how numberBatches works at its core. Your configuration also works on my machine. Can you send me the full output of keepShaderCode again? Thank you.

johuether commented 2 years ago

Of course: new ShaderCode:

 #version 450

    layout (local_size_x = 16, local_size_y = 8, local_size_z = 1) in;
 const float loc_PI = 3.1415926535897932384626433832795f;
 const float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
 layout(push_constant) uniform PushConsts
    {
        uint workGroupShiftX;
    uint workGroupShiftY;
    uint workGroupShiftZ;
 } consts;
 layout(std430, binding = 0) buffer DataIn{
        vec2 inputs[0];
 };
 layout(std430, binding = 1) buffer DataOut{
        vec2 outputs[0];
 };
 layout(std430, binding = 2) readonly buffer DataLUT {
    vec2 twiddleLUT[];
 };
 uint sharedStride = 130;
 shared vec2 sdata[1088];
 // sharedStride - fft size,  gl_WorkGroupSize.y - grouped consecutive ffts

    void main() {
        vec2 temp_0;
    vec2 temp_1;
    vec2 temp_2;
    vec2 temp_3;
    vec2 temp_4;
    vec2 temp_5;
    vec2 temp_6;
    vec2 temp_7;
    vec2 w;
    vec2 loc_0;
    vec2 iw;
    uint stageInvocationID;
    uint blockInvocationID;
    uint sdataID;
    uint combinedID;
    uint inoutID;
    uint LUTId=0;
        { 
            combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 0;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 128;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 256;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 384;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 512;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 640;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 768;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
        combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 896;
        inoutID = (combinedID % 128) + (combinedID / 128) * 128;
            inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
        sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
    }
        barrier();
        stageInvocationID = (gl_LocalInvocationID.x+ 0) % (1);
        LUTId = stageInvocationID + 0;
        sdataID = gl_LocalInvocationID.x + 0;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_0 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_1 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 32;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_2 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 48;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_3 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 64;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_4 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 80;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_5 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 96;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_6 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 112;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_7 = sdata[sdataID];
    w = twiddleLUT[LUTId];
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
    loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
    temp_4.x = temp_0.x - loc_0.x;
    temp_4.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
    loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
    temp_5.x = temp_1.x - loc_0.x;
    temp_5.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
    loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
    temp_6.x = temp_2.x - loc_0.x;
    temp_6.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_3.x - loc_0.x;
    temp_7.y = temp_3.y - loc_0.y;
    temp_3.x = temp_3.x + loc_0.x;
    temp_3.y = temp_3.y + loc_0.y;
    w=twiddleLUT[LUTId+1];
    loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
    loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
    temp_2.x = temp_0.x - loc_0.x;
    temp_2.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
    loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
    temp_3.x = temp_1.x - loc_0.x;
    temp_3.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
    loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
    temp_6.x = temp_4.x - loc_0.x;
    temp_6.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
    loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
    temp_7.x = temp_5.x - loc_0.x;
    temp_7.y = temp_5.y - loc_0.y;
    temp_5.x = temp_5.x + loc_0.x;
    temp_5.y = temp_5.y + loc_0.y;
    w=twiddleLUT[LUTId+2];
    loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
    loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
    temp_1.x = temp_0.x - loc_0.x;
    temp_1.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
    loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
    temp_3.x = temp_2.x - loc_0.x;
    temp_3.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
    iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
    loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
    loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
    temp_5.x = temp_4.x - loc_0.x;
    temp_5.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    w.x = -iw.y;
    w.y = iw.x;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_6.x - loc_0.x;
    temp_7.y = temp_6.y - loc_0.y;
    temp_6.x = temp_6.x + loc_0.x;
    temp_6.y = temp_6.y + loc_0.y;
    loc_0 = temp_1;
    temp_1 = temp_4;
    temp_4 = loc_0;
    loc_0 = temp_3;
    temp_3 = temp_6;
    temp_6 = loc_0;
    barrier();
    stageInvocationID = gl_LocalInvocationID.x + 0;
    blockInvocationID = stageInvocationID;
    stageInvocationID = stageInvocationID % 1;
    blockInvocationID = blockInvocationID - stageInvocationID;
    inoutID = blockInvocationID * 8;
    inoutID = inoutID + stageInvocationID;
    sdataID = inoutID + 0;
    sharedStride = 136;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_0.x = temp_0.x * 0.00781250000000000f;
    temp_0.y = temp_0.y * 0.00781250000000000f;
    sdata[sdataID] = temp_0;
    sdataID = inoutID + 1;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_1.x = temp_1.x * 0.00781250000000000f;
    temp_1.y = temp_1.y * 0.00781250000000000f;
    sdata[sdataID] = temp_1;
    sdataID = inoutID + 2;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_2.x = temp_2.x * 0.00781250000000000f;
    temp_2.y = temp_2.y * 0.00781250000000000f;
    sdata[sdataID] = temp_2;
    sdataID = inoutID + 3;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_3.x = temp_3.x * 0.00781250000000000f;
    temp_3.y = temp_3.y * 0.00781250000000000f;
    sdata[sdataID] = temp_3;
    sdataID = inoutID + 4;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_4.x = temp_4.x * 0.00781250000000000f;
    temp_4.y = temp_4.y * 0.00781250000000000f;
    sdata[sdataID] = temp_4;
    sdataID = inoutID + 5;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_5.x = temp_5.x * 0.00781250000000000f;
    temp_5.y = temp_5.y * 0.00781250000000000f;
    sdata[sdataID] = temp_5;
    sdataID = inoutID + 6;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_6.x = temp_6.x * 0.00781250000000000f;
    temp_6.y = temp_6.y * 0.00781250000000000f;
    sdata[sdataID] = temp_6;
    sdataID = inoutID + 7;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
    combinedID = gl_LocalInvocationID.y * sharedStride;
    sdataID = sdataID + combinedID;
    temp_7.x = temp_7.x * 0.00781250000000000f;
    temp_7.y = temp_7.y * 0.00781250000000000f;
    sdata[sdataID] = temp_7;
    barrier();
        stageInvocationID = (gl_LocalInvocationID.x+ 0) % (8);
        LUTId = stageInvocationID + 3;
        sdataID = gl_LocalInvocationID.x + 0;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_0 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 16;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_1 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 32;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_2 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 48;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_3 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 64;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_4 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 80;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_5 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 96;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_6 = sdata[sdataID];
        sdataID = gl_LocalInvocationID.x + 112;
    sdataID = (sdataID / 16) * 17 + sdataID % 16;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
        temp_7 = sdata[sdataID];
    w = twiddleLUT[LUTId];
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
    loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
    temp_4.x = temp_0.x - loc_0.x;
    temp_4.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
    loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
    temp_5.x = temp_1.x - loc_0.x;
    temp_5.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
    loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
    temp_6.x = temp_2.x - loc_0.x;
    temp_6.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.x = temp_3.x - loc_0.x;
    temp_7.y = temp_3.y - loc_0.y;
    temp_3.x = temp_3.x + loc_0.x;
    temp_3.y = temp_3.y + loc_0.y;
    w=twiddleLUT[LUTId+8];
    loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
    loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
    temp_2.x = temp_0.x - loc_0.x;
    temp_2.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
    loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
    temp_3.x = temp_1.x - loc_0.x;
    temp_3.y = temp_1.y - loc_0.y;
    temp_1.x = temp_1.x + loc_0.x;
    temp_1.y = temp_1.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
    loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
    temp_6.x = temp_4.x - loc_0.x;
    temp_6.y = temp_4.y - loc_0.y;
    temp_4.x = temp_4.x + loc_0.x;
    temp_4.y = temp_4.y + loc_0.y;
    loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
    loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
    temp_7.x = temp_5.x - loc_0.x;
    temp_7.y = temp_5.y - loc_0.y;
    temp_5.x = temp_5.x + loc_0.x;
    temp_5.y = temp_5.y + loc_0.y;
    w=twiddleLUT[LUTId+16];
    loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
    loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
    temp_1.x = temp_0.x - loc_0.x;
    temp_1.y = temp_0.y - loc_0.y;
    temp_0.x = temp_0.x + loc_0.x;
    temp_0.y = temp_0.y + loc_0.y;
    iw.x = -w.y;
    iw.y = w.x;
    loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
    loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
    temp_3.x = temp_2.x - loc_0.x;
    temp_3.y = temp_2.y - loc_0.y;
    temp_2.x = temp_2.x + loc_0.x;
    temp_2.y = temp_2.y + loc_0.y;
    iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
    iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
    loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
    loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.y = temp_6.y - loc_0.y;
    temp_4 = loc_0;
    sdataID = sdataID + combinedID;
    sdata[sdataID] = temp_5;
        sdataID = gl_LocalInvocationID.x + 64;
        sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
    loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
        temp_2 = sdata[sdataID];
    w = twiddleLUT[LUTId];
    temp_2.x = temp_2.x + loc_0.x;
    loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
    temp_7.y = temp_3.y - loc_0.y;
    inoutID = blockInvocationID * 2;
    sdataID = sdataID + combinedID;
    sdata[sdataID] = temp_5;
    sdata[sdataID] = temp_6;
    blockInvocationID = stageInvocationID;
    sdataID = sdataID + combinedID;
DTolm commented 2 years ago

The issue seems to be related to the fact that value in bufferSize passed by a pointer is zero. While VkFFT has a check that a pointer is non-zero, it doesn't have a check for the value being non-zero. I will add it in the next update. May I ask you to double check the bufferSize value, as I am still not able to reproduce the error.

johuether commented 2 years ago

Yes your right,

it was a problem with bufferSize, but it was my fault. After I updated my source with your changes I still computed the bufferSize coordinateFeatures instead of numberBatches.

So now everything is working fine with numberBatches.

Thanks

DTolm commented 2 years ago

This is good to hear!

In case you encounter other issues - feel free to share.

einthusan commented 2 years ago

So can the readme file be updated to mention Android support is available? I plan to use this for a game with a mid-spec PowerVR GPU. Thanks for writing this library.

DTolm commented 2 years ago

@einthusan

I have not done testing on Android myself yet, there is a mention of this in the future plans section. Though, the code should work on any Vulkan capable platform.

gsgou commented 10 months ago

@Talsoake any sample on how you integrated vkFFT on Android to share?