Open johuether opened 2 years ago
Hello,
It sounds great that the code works on Android. As for the precision, some of the devices (namely - Intel iGPUs) have really bad precision on their hardware special function units, used to calculate sines and cosines. I haven't personally tested mobile GPUs, but the workaround is to enable LUT usage in single precision. To do so, provide useLUT=1 in the configuration and let's see if that helps.
As for R2C layout, I am now making a documentation covering all of this, it will be out soon.
Best regards, Dmitrii
Thanks,
Yes, that helps. Now the difference is in the magnitude of your results.
But I have a second question:
I also need to compute the cross correlation of two images. So I need to calculate the cross power spectrum after I calculated the FFTs.
Is there a easy way to integrate this into one pipeline, so I don't have to copy buffers between my calculations?
So far there is no callback functionality in VkFFT yet (I need to think on how it can be done for all backends first).
As for the cross-correlation, I can add it as an alternative to convolution calculation today in the evening. It will be done as: you compute the FFT of a base image, use it as a kernel, and when you compute FFT of the second image, last FFT step, cross-correlation and first iFFT step will be merged.
Yeah, that would be great.
At the moment I calculate the cross-correlation the following way:
Are the results the same?
I have added support for cross-correlation and normalized cross-power spectrum calculation (it normalizes kernel multiplication in the frequency domain). So, it works in a way that it merges steps from calculate FFT of image2 to calculate inverse of the last step in one kernel (plus X axis FFTs if you do 2D FFTs). You need to configure convolution calculation with cross-correlation enabled:
configuration.performConvolution = 1; configuration.conjugateConvolution = 1; (1 - the current FFT will be conjugated, 2 - the kernel will be conjugated) configuration.kernel = &kernelBuffer; (specify the other image's FFT here, use kernelConvolution = 1 when making plan for that FFT) kernelSize = &kernelSize;
You will need two separate applications, one for simple FFT, the other for cross-correlation. Hope this helps and feel free to ask any questions.
Hi, could you upload an example of how to do cross-correlation between two images ? @DTolm
Thanks a lot for your effort.
I have tested the correlation now. Works like a charm if I use numberBatches = 1. When I set it to something greater than 1 I get an shader parse error:
"ERROR: 0:53: '' : syntax error, unexpected STAR\nERROR: 1 compilation errors. No code generated.\n\n"
Its thrown in VkFFTPlanAxis in the second axis. Do you have any clues whats causing this?
@Talsoake can you please attach the configuration you used and the generated kernels (if you use keepShaderCode parameter, VkFFT will print them)
@Goose-Bomb Sure, I will add this to documentation, which is going to be released soon
This is the generated shader code:
#version 450
layout (local_size_x = 32, local_size_y = 16, local_size_z = 1) in;
const float loc_PI = 3.1415926535897932384626433832795f;
const float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
layout(push_constant) uniform PushConsts
{
uint coordinate;
uint batchID;
uint workGroupShiftX;
uint workGroupShiftY;
uint workGroupShiftZ;
} consts;
layout(std430, binding = 0) buffer DataIn{
vec2 inputs[32768];
};
layout(std430, binding = 1) buffer DataOut{
vec2 outputs[32768];
};
layout(std430, binding = 2) buffer Kernel_FFT{
vec2 kernel_obj[32768];
};
layout(std430, binding = 3) readonly buffer DataLUT {
vec2 twiddleLUT[];
};
uint sharedStride = 32;
shared vec2 sdata[4096];
void main() {
vec2 temp_0;
vec2 temp_1;
vec2 temp_2;
vec2 temp_3;
vec2 temp_4;
vec2 temp_5;
vec2 temp_6;
vec2 temp_7;
vec2 w;
vec2 loc_0;
vec2 iw;
uint stageInvocationID;
uint blockInvocationID;
uint sdataID;
uint combinedID;
uint inoutID;
uint LUTId=0;
if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
inoutID = (1 * (gl_LocalInvocationID.y + 0) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_0=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 16) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_1=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 32) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_2=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 48) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_3=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 64) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_4=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 80) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_5=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 96) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_6=inputs[inoutID];
inoutID = (1 * (gl_LocalInvocationID.y + 112) + ((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128));
inoutID = ((gl_GlobalInvocationID.x) % (128)) + (inoutID) * 128 + * 16384;
temp_7=inputs[inoutID];
}
if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
stageInvocationID = (gl_LocalInvocationID.y+ 0) % (1);
LUTId = stageInvocationID + 0;
w = twiddleLUT[LUTId];
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
temp_4.x = temp_0.x - loc_0.x;
temp_4.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
temp_5.x = temp_1.x - loc_0.x;
temp_5.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
temp_6.x = temp_2.x - loc_0.x;
temp_6.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_3.x - loc_0.x;
temp_7.y = temp_3.y - loc_0.y;
temp_3.x = temp_3.x + loc_0.x;
temp_3.y = temp_3.y + loc_0.y;
w=twiddleLUT[LUTId+1];
loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
temp_2.x = temp_0.x - loc_0.x;
temp_2.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
temp_3.x = temp_1.x - loc_0.x;
temp_3.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
temp_6.x = temp_4.x - loc_0.x;
temp_6.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
temp_7.x = temp_5.x - loc_0.x;
temp_7.y = temp_5.y - loc_0.y;
temp_5.x = temp_5.x + loc_0.x;
temp_5.y = temp_5.y + loc_0.y;
w=twiddleLUT[LUTId+2];
loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
temp_1.x = temp_0.x - loc_0.x;
temp_1.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
temp_3.x = temp_2.x - loc_0.x;
temp_3.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
temp_5.x = temp_4.x - loc_0.x;
temp_5.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
w.x = -iw.y;
w.y = iw.x;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_6.x - loc_0.x;
temp_7.y = temp_6.y - loc_0.y;
temp_6.x = temp_6.x + loc_0.x;
temp_6.y = temp_6.y + loc_0.y;
loc_0 = temp_1;
temp_1 = temp_4;
temp_4 = loc_0;
loc_0 = temp_3;
temp_3 = temp_6;
temp_6 = loc_0;
} sharedStride = 32;
barrier();
if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
stageInvocationID = gl_LocalInvocationID.y + 0;
blockInvocationID = stageInvocationID;
stageInvocationID = stageInvocationID % 1;
blockInvocationID = blockInvocationID - stageInvocationID;
inoutID = blockInvocationID * 8;
inoutID = inoutID + stageInvocationID;
sdataID = inoutID + 0;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_0.x = temp_0.x * 0.00781250000000000f;
temp_0.y = temp_0.y * 0.00781250000000000f;
sdata[sdataID] = temp_0;
sdataID = inoutID + 1;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_1.x = temp_1.x * 0.00781250000000000f;
temp_1.y = temp_1.y * 0.00781250000000000f;
sdata[sdataID] = temp_1;
sdataID = inoutID + 2;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_2.x = temp_2.x * 0.00781250000000000f;
temp_2.y = temp_2.y * 0.00781250000000000f;
sdata[sdataID] = temp_2;
sdataID = inoutID + 3;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_3.x = temp_3.x * 0.00781250000000000f;
temp_3.y = temp_3.y * 0.00781250000000000f;
sdata[sdataID] = temp_3;
sdataID = inoutID + 4;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_4.x = temp_4.x * 0.00781250000000000f;
temp_4.y = temp_4.y * 0.00781250000000000f;
sdata[sdataID] = temp_4;
sdataID = inoutID + 5;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_5.x = temp_5.x * 0.00781250000000000f;
temp_5.y = temp_5.y * 0.00781250000000000f;
sdata[sdataID] = temp_5;
sdataID = inoutID + 6;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_6.x = temp_6.x * 0.00781250000000000f;
temp_6.y = temp_6.y * 0.00781250000000000f;
sdata[sdataID] = temp_6;
sdataID = inoutID + 7;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_7.x = temp_7.x * 0.00781250000000000f;
temp_7.y = temp_7.y * 0.00781250000000000f;
sdata[sdataID] = temp_7;
} barrier();
if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
stageInvocationID = (gl_LocalInvocationID.y+ 0) % (8);
LUTId = stageInvocationID + 3;
temp_0 = sdata[sharedStride*(gl_LocalInvocationID.y+0)+gl_LocalInvocationID.x];
temp_1 = sdata[sharedStride*(gl_LocalInvocationID.y+16)+gl_LocalInvocationID.x];
temp_2 = sdata[sharedStride*(gl_LocalInvocationID.y+32)+gl_LocalInvocationID.x];
temp_3 = sdata[sharedStride*(gl_LocalInvocationID.y+48)+gl_LocalInvocationID.x];
temp_4 = sdata[sharedStride*(gl_LocalInvocationID.y+64)+gl_LocalInvocationID.x];
temp_5 = sdata[sharedStride*(gl_LocalInvocationID.y+80)+gl_LocalInvocationID.x];
temp_6 = sdata[sharedStride*(gl_LocalInvocationID.y+96)+gl_LocalInvocationID.x];
temp_7 = sdata[sharedStride*(gl_LocalInvocationID.y+112)+gl_LocalInvocationID.x];
w = twiddleLUT[LUTId];
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
temp_4.x = temp_0.x - loc_0.x;
temp_4.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
temp_5.x = temp_1.x - loc_0.x;
temp_5.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
temp_6.x = temp_2.x - loc_0.x;
temp_6.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_3.x - loc_0.x;
temp_7.y = temp_3.y - loc_0.y;
temp_3.x = temp_3.x + loc_0.x;
temp_3.y = temp_3.y + loc_0.y;
w=twiddleLUT[LUTId+8];
loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
temp_2.x = temp_0.x - loc_0.x;
temp_2.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
temp_3.x = temp_1.x - loc_0.x;
temp_3.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
temp_6.x = temp_4.x - loc_0.x;
temp_6.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
temp_7.x = temp_5.x - loc_0.x;
temp_7.y = temp_5.y - loc_0.y;
temp_5.x = temp_5.x + loc_0.x;
temp_5.y = temp_5.y + loc_0.y;
w=twiddleLUT[LUTId+16];
loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
temp_1.x = temp_0.x - loc_0.x;
temp_1.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
temp_3.x = temp_2.x - loc_0.x;
temp_3.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
temp_5.x = temp_4.x - loc_0.x;
temp_5.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
w.x = -iw.y;
w.y = iw.x;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_6.x - loc_0.x;
temp_7.y = temp_6.y - loc_0.y;
temp_6.x = temp_6.x + loc_0.x;
temp_6.y = temp_6.y + loc_0.y;
loc_0 = temp_1;
temp_1 = temp_4;
temp_4 = loc_0;
loc_0 = temp_3;
temp_3 = temp_6;
temp_6 = loc_0;
} barrier();
if (((gl_GlobalInvocationID.x) / 128) % (1)+((gl_GlobalInvocationID.x) / 128) * (128) < 128) {
stageInvocationID = gl_LocalInvocationID.y + 0;
blockInvocationID = stageInvocationID;
stageInvocationID = stageInvocationID % 8;
blockInvocationID = blockInvocationID - stageInvocationID;
inoutID = blockInvocationID * 8;
inoutID = inoutID + stageInvocationID;
sdataID = inoutID + 0;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
sdata[sdataID] = temp_0;
sdataID = inoutID + 8;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
sdata[sdataID] = temp_1;
sdataID = inoutID + 16;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
sdata[sdataID] = temp_2;
sdataID = inoutID + 24;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
sdata[sdataID] = temp_3;
sdataID = inoutID + 32;
sdataID = sharedStride * sdataID;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_0.x = temp_0.x + loc_0.x;
temp_6 = sdata[sharedStride*(gl_LocalInvocationID.y+96)+gl_LocalInvocationID.x];
temp_7.x = temp_3.x - loc_0.x;
temp_real0 = 0;
temp_imag0 += kernel_obj[inoutID+0].y * temp_2.x - kernel_obj[inoutID+0].x * temp_2.y ;
temp_imag0 = 0;
temp_real0 += kernel_obj[inoutID+0].x * temp_5.x + kernel_obj[inoutID+0].y * temp_5.y;
temp_real0 = 0;
inoutID = ((gl_GlobalInvocationID.x) % (128)) + ((gl_LocalInvocationID.y+112)+((gl_GlobalInvocationID.x)/128)%(1)+((gl_GlobalInvocationID.x)/128)*(128)) * 128 + * 16384;
w=twiddleLUT[LUTId+2];
sdataID = sdataID + gl_LocalInvocationID.x;
sdataID = sdataID + gl_LocalInvocationID.x;
temp_4.y = temp_4.y * 0.00781250000000000f;
sdataID = inoutID + 6;
temp_1.y = temp_1.y + loc_0.y;
temp_2.y = temp_2.y + loc_0.y;
loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
temp_6.y = temp_6.y + loc_0.y;
sdataID = sdataID + gl_LocalInvocationID.x;
sdataID = sdataID + gl_LocalInvocationID.x;
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
temp_2.y = temp_2.y + loc_0.y;
} sharedStride = 32;
outputs[inoutID] = temp_5;
My config is:
FFTSize is 128
configuration.FFTdim = 2;
configuration.size[0] = FFTSize;
configuration.size[1] = FFTSize;
configuration.size[2] = 1;
configuration.normalize = 1;
configuration.useLUT = 1;
configuration.performConvolution = 1;
configuration.conjugateConvolution = 1;
configuration.numberBatches = windowCount;
configuration.device = &vkGPU->device;
configuration.queue = &vkGPU->queue; //to allocate memory for LUT, we have to pass a queue, vkGPU->fence, commandPool and physicalDevice pointers
configuration.fence = &vkGPU->fence;
configuration.commandPool = &vkGPU->commandPool;
configuration.physicalDevice = &vkGPU->physicalDevice;
configuration.isCompilerInitialized = 1;//compiler can be initialized before VkFFT plan creation. if not, VkFFT will create and destroy one after initialization
VkFFTResult resFFT = VKFFT_SUCCESS;
bufferSize =
sizeof(float) * 2 * configuration.size[0] * configuration.size[1] * configuration.size[2] *
configuration.numberBatches;
VkDeviceMemory bufferDeviceMemory = {};
resFFT = allocateBuffer(vkGPU, &bufferSecondImages, &bufferDeviceMemory,
VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT |
VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT,
bufferSize);
configuration.buffer = &bufferSecondImages;
configuration.bufferSize = &bufferSize;
configuration.keepShaderCode = 1;
configuration.kernel = &bufferFirstImages;
configuration.kernelSize = &bufferSize;
resFFT = initializeVkFFT(&app2, configuration);
I solved it.
Like in #42 using coordinateFeatures did the trick.
instead of
configuration.numberBatches = windowCount;
I used
configuration.coordinateFeatures = windowCount;
@DTolm I can upload the changes that are needed to run VkFFT on Android. If you want to, please write me a message.
@Talsoake While this workaround works, the original issue still remains. I have fixed it in the next version (and also reworked the way dispatching of batched kernels is done, which will make it faster). I will close it with the next commit.
As for the Android changes, can you contact me by email about this (dtolm96@gmail.com)? It would be good to make benchmark work on Android as a demonstration.
I tried your new version, but nothing changed. Its still an error with parsing the shader.
But the workaround with using coordinateFeatures instead of numberBatches still works.
The code should not be the same as before, as I have changed how numberBatches works at its core. Your configuration also works on my machine. Can you send me the full output of keepShaderCode again? Thank you.
Of course: new ShaderCode:
#version 450
layout (local_size_x = 16, local_size_y = 8, local_size_z = 1) in;
const float loc_PI = 3.1415926535897932384626433832795f;
const float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
layout(push_constant) uniform PushConsts
{
uint workGroupShiftX;
uint workGroupShiftY;
uint workGroupShiftZ;
} consts;
layout(std430, binding = 0) buffer DataIn{
vec2 inputs[0];
};
layout(std430, binding = 1) buffer DataOut{
vec2 outputs[0];
};
layout(std430, binding = 2) readonly buffer DataLUT {
vec2 twiddleLUT[];
};
uint sharedStride = 130;
shared vec2 sdata[1088];
// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts
void main() {
vec2 temp_0;
vec2 temp_1;
vec2 temp_2;
vec2 temp_3;
vec2 temp_4;
vec2 temp_5;
vec2 temp_6;
vec2 temp_7;
vec2 w;
vec2 loc_0;
vec2 iw;
uint stageInvocationID;
uint blockInvocationID;
uint sdataID;
uint combinedID;
uint inoutID;
uint LUTId=0;
{
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 0;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 128;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 256;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 384;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 512;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 640;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 768;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
combinedID = (gl_LocalInvocationID.x + 16 * gl_LocalInvocationID.y) + 896;
inoutID = (combinedID % 128) + (combinedID / 128) * 128;
inoutID = (inoutID) + gl_WorkGroupID.y * 1024 + (gl_GlobalInvocationID.z / 1) * 16384;
sdata[(combinedID % 128) + (combinedID / 128) * sharedStride] = inputs[inoutID];
}
barrier();
stageInvocationID = (gl_LocalInvocationID.x+ 0) % (1);
LUTId = stageInvocationID + 0;
sdataID = gl_LocalInvocationID.x + 0;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_0 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_1 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 32;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_2 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 48;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_3 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 64;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_4 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 80;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_5 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 96;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_6 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 112;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_7 = sdata[sdataID];
w = twiddleLUT[LUTId];
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
temp_4.x = temp_0.x - loc_0.x;
temp_4.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
temp_5.x = temp_1.x - loc_0.x;
temp_5.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
temp_6.x = temp_2.x - loc_0.x;
temp_6.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_3.x - loc_0.x;
temp_7.y = temp_3.y - loc_0.y;
temp_3.x = temp_3.x + loc_0.x;
temp_3.y = temp_3.y + loc_0.y;
w=twiddleLUT[LUTId+1];
loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
temp_2.x = temp_0.x - loc_0.x;
temp_2.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
temp_3.x = temp_1.x - loc_0.x;
temp_3.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
temp_6.x = temp_4.x - loc_0.x;
temp_6.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
temp_7.x = temp_5.x - loc_0.x;
temp_7.y = temp_5.y - loc_0.y;
temp_5.x = temp_5.x + loc_0.x;
temp_5.y = temp_5.y + loc_0.y;
w=twiddleLUT[LUTId+2];
loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
temp_1.x = temp_0.x - loc_0.x;
temp_1.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
temp_3.x = temp_2.x - loc_0.x;
temp_3.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
temp_5.x = temp_4.x - loc_0.x;
temp_5.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
w.x = -iw.y;
w.y = iw.x;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_6.x - loc_0.x;
temp_7.y = temp_6.y - loc_0.y;
temp_6.x = temp_6.x + loc_0.x;
temp_6.y = temp_6.y + loc_0.y;
loc_0 = temp_1;
temp_1 = temp_4;
temp_4 = loc_0;
loc_0 = temp_3;
temp_3 = temp_6;
temp_6 = loc_0;
barrier();
stageInvocationID = gl_LocalInvocationID.x + 0;
blockInvocationID = stageInvocationID;
stageInvocationID = stageInvocationID % 1;
blockInvocationID = blockInvocationID - stageInvocationID;
inoutID = blockInvocationID * 8;
inoutID = inoutID + stageInvocationID;
sdataID = inoutID + 0;
sharedStride = 136;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_0.x = temp_0.x * 0.00781250000000000f;
temp_0.y = temp_0.y * 0.00781250000000000f;
sdata[sdataID] = temp_0;
sdataID = inoutID + 1;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_1.x = temp_1.x * 0.00781250000000000f;
temp_1.y = temp_1.y * 0.00781250000000000f;
sdata[sdataID] = temp_1;
sdataID = inoutID + 2;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_2.x = temp_2.x * 0.00781250000000000f;
temp_2.y = temp_2.y * 0.00781250000000000f;
sdata[sdataID] = temp_2;
sdataID = inoutID + 3;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_3.x = temp_3.x * 0.00781250000000000f;
temp_3.y = temp_3.y * 0.00781250000000000f;
sdata[sdataID] = temp_3;
sdataID = inoutID + 4;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_4.x = temp_4.x * 0.00781250000000000f;
temp_4.y = temp_4.y * 0.00781250000000000f;
sdata[sdataID] = temp_4;
sdataID = inoutID + 5;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_5.x = temp_5.x * 0.00781250000000000f;
temp_5.y = temp_5.y * 0.00781250000000000f;
sdata[sdataID] = temp_5;
sdataID = inoutID + 6;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_6.x = temp_6.x * 0.00781250000000000f;
temp_6.y = temp_6.y * 0.00781250000000000f;
sdata[sdataID] = temp_6;
sdataID = inoutID + 7;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
combinedID = gl_LocalInvocationID.y * sharedStride;
sdataID = sdataID + combinedID;
temp_7.x = temp_7.x * 0.00781250000000000f;
temp_7.y = temp_7.y * 0.00781250000000000f;
sdata[sdataID] = temp_7;
barrier();
stageInvocationID = (gl_LocalInvocationID.x+ 0) % (8);
LUTId = stageInvocationID + 3;
sdataID = gl_LocalInvocationID.x + 0;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_0 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 16;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_1 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 32;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_2 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 48;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_3 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 64;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_4 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 80;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_5 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 96;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_6 = sdata[sdataID];
sdataID = gl_LocalInvocationID.x + 112;
sdataID = (sdataID / 16) * 17 + sdataID % 16;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
temp_7 = sdata[sdataID];
w = twiddleLUT[LUTId];
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
temp_4.x = temp_0.x - loc_0.x;
temp_4.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
temp_5.x = temp_1.x - loc_0.x;
temp_5.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
temp_6.x = temp_2.x - loc_0.x;
temp_6.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.x = temp_3.x - loc_0.x;
temp_7.y = temp_3.y - loc_0.y;
temp_3.x = temp_3.x + loc_0.x;
temp_3.y = temp_3.y + loc_0.y;
w=twiddleLUT[LUTId+8];
loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
temp_2.x = temp_0.x - loc_0.x;
temp_2.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
temp_3.x = temp_1.x - loc_0.x;
temp_3.y = temp_1.y - loc_0.y;
temp_1.x = temp_1.x + loc_0.x;
temp_1.y = temp_1.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
temp_6.x = temp_4.x - loc_0.x;
temp_6.y = temp_4.y - loc_0.y;
temp_4.x = temp_4.x + loc_0.x;
temp_4.y = temp_4.y + loc_0.y;
loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
temp_7.x = temp_5.x - loc_0.x;
temp_7.y = temp_5.y - loc_0.y;
temp_5.x = temp_5.x + loc_0.x;
temp_5.y = temp_5.y + loc_0.y;
w=twiddleLUT[LUTId+16];
loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
temp_1.x = temp_0.x - loc_0.x;
temp_1.y = temp_0.y - loc_0.y;
temp_0.x = temp_0.x + loc_0.x;
temp_0.y = temp_0.y + loc_0.y;
iw.x = -w.y;
iw.y = w.x;
loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
temp_3.x = temp_2.x - loc_0.x;
temp_3.y = temp_2.y - loc_0.y;
temp_2.x = temp_2.x + loc_0.x;
temp_2.y = temp_2.y + loc_0.y;
iw.x = w.x * loc_SQRT1_2 - w.y * loc_SQRT1_2;
iw.y = w.y * loc_SQRT1_2 + w.x * loc_SQRT1_2;
loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.y = temp_6.y - loc_0.y;
temp_4 = loc_0;
sdataID = sdataID + combinedID;
sdata[sdataID] = temp_5;
sdataID = gl_LocalInvocationID.x + 64;
sdataID = sdataID + sharedStride * gl_LocalInvocationID.y;
loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
temp_2 = sdata[sdataID];
w = twiddleLUT[LUTId];
temp_2.x = temp_2.x + loc_0.x;
loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
temp_7.y = temp_3.y - loc_0.y;
inoutID = blockInvocationID * 2;
sdataID = sdataID + combinedID;
sdata[sdataID] = temp_5;
sdata[sdataID] = temp_6;
blockInvocationID = stageInvocationID;
sdataID = sdataID + combinedID;
The issue seems to be related to the fact that value in bufferSize passed by a pointer is zero. While VkFFT has a check that a pointer is non-zero, it doesn't have a check for the value being non-zero. I will add it in the next update. May I ask you to double check the bufferSize value, as I am still not able to reproduce the error.
Yes your right,
it was a problem with bufferSize, but it was my fault. After I updated my source with your changes I still computed the bufferSize coordinateFeatures instead of numberBatches.
So now everything is working fine with numberBatches.
Thanks
This is good to hear!
In case you encounter other issues - feel free to share.
So can the readme file be updated to mention Android support is available? I plan to use this for a game with a mid-spec PowerVR GPU. Thanks for writing this library.
@einthusan
I have not done testing on Android myself yet, there is a mention of this in the future plans section. Though, the code should work on any Vulkan capable platform.
@Talsoake any sample on how you integrated vkFFT on Android to share?
Hi, I'm currently developing an Android App which needs to calculate FFTs.
Till now I used OpenCV for this. In an attempt to speed the FFT implementation on Android devices I successfully was able to implement your library so that it works on most devices and the results are also quite close to the implementation provided by OpenCV.
But I noticed differences which were a little too big in my opinion. I got an avg. difference of 0.06 on a 128x128 system, which is quiet bigger then the results in your precision tests. Also I noticed differences on different mobile GPUs. I tested a Snapdragon 855+ (Adreno 640 GPU) which got a avg. difference of 0.01 and a Exynos 990 (Mali G77) which achieved the 0.06 avg. difference.
Is this difference because of the devices or do you think I misconfigured something?
Setup Notes: Values are uniform random numbers calculated by OpenCV and then passed to an modified version of the sample 0. Calculation in C2C but imaginary part is 0 -> didn't used R2C for now cause I didn't really understand how to align the data properly