fftw_plan_guru64_split_dft() fails sporadically with imported wisdom

qnxor commented 7 years ago

EDIT: The behaviour actually exists on both Windows and Linux, but not always. It works fine for some data lengths, and it fails for others. See my 3rd post below. EDIT 2: After more testing, it seems that the separation between the real and imaginary pointers matters, see my 4th post below. Is the wisdom tied to the separation between the real and imaginary pointers?

I'm using 3.3.6-pl1 compiled with mingw64 and gcc 5.3.0, but I tested 3.3.4, 3.3.5 and several gcc versions. I'm using fftw_plan_guru64_split_dft(..., FFTW_MEASURE) and virtually everything runs fine, output is correct, and I can also export the wisdom and then import it successfully (import returns 1), but after the successful wisdom import:

if I run fftw_plan_guru64_split_dft(..., FFTW_WISDOM_ONLY | FFTW_MEASURE) then the planner returns NULL, failing to create a plan, but it doesn't do this all the time; it works for some data size and for others it fails. See my posts below.
if I run fftw_plan_guru64_split_dft(..., FFTW_MEASURE) then the planner ignores the imported wisdom, which can be clearly detected as it takes just as much time as the first time.

I have no such issues when using the basic interface instead of the guru interface (but in that case the real and imaginary pointers are always 1 apart). Is this a known issue or does it ring any bells? If not I'll try to provide test code to reproduce it (the current code is big)

qnxor commented 7 years ago

Test code to reproduce it. Run first ./wisdomtest.exe to export wisdom then run ./wisdomtest.exe import to import the previously computed wisdom and use it in the planner. The planner fails here (EDIT: not always! see post below). Removing FFTW_WISDOM_ONLY just causes the planner to ignore the imported wisdom.

wisdomtest.cpp (default data size is 2048-by-2048):

#include "fftw3.h"
#include <stdlib.h>
#include <string.h>
#include <omp.h>

#define die(cond,...) {if(cond){printf(__VA_ARGS__);exit(1);}}

double* init(long n) {
    return (double*)calloc(n,sizeof(double));  // zeros, replacing with fftw_malloc() doesn't help
}

int main(int argc, char* argv[]) {
    long n = 2048;
    unsigned wisdom = FFTW_MEASURE;
    //die(!fftw_init_threads(), "Could not initialize FFTW multi-threading.");
    //fftw_plan_with_nthreads(omp_get_max_threads());
    for (int k=1; k<argc; ++k)
        if (!strcasecmp("import", argv[k])) {
            die(!fftw_import_wisdom_from_filename("wisdom.txt"), "Invalid wisdom file.\n");
            wisdom |= FFTW_WISDOM_ONLY;
        }
        else if (long _n = atol(argv[k]))
            n = _n;
    double* xre = init(n*n);
    double* xim = init(n*n);
    double* yre = init(n*n);
    double* yim = init(n*n);
    int rank = 2;
    fftw_iodim64* dims = (fftw_iodim64*)malloc(rank*sizeof(fftw_iodim64));
    dims[0].n = n;
    dims[0].is = dims[0].os = 1;
    dims[1].n = n;
    dims[1].is = dims[1].os = n;
    int howmany = 0;
    fftw_iodim64* howmany_dims = NULL;
    //int howmany = 1;
    //fftw_iodim64* howmany_dims = (fftw_iodim64*)malloc(rank*sizeof(fftw_iodim64));
    //howmany_dims[0].n = 1;
    //howmany_dims[0].is = howmany_dims[0].os = 1;
    printf("n=%ld, wisdom=%u, %s: ", n, wisdom, (wisdom & FFTW_WISDOM_ONLY) ? "imported" : "measure");
    fftw_plan p = fftw_plan_guru64_split_dft(rank, dims, howmany,
        howmany_dims, xre, xim, yre, yim, wisdom);
    printf("%s\n", p ? "ok" : "plan failed!");
    if (!p) exit(1);
    fftw_execute_split_dft(p, xre, xim, yre, yim);
    fftw_export_wisdom_to_filename("wisdom.txt");
    fftw_destroy_plan(p);
    //fftw_cleanup_threads();
    return 0;
}

Compiled with static libs as g++ -static-libgcc -static-libstdc++ wisdomtest.cpp -o wisdomtest.exe libfftw3.a. I tested the above in win8.1 x64 on an intel i7-3770k with mingw64 x86_64-5.3.0-release-posix-seh-rt_v4-rev0. libfftw3 was compiled with ./configure --enable-sse2 --enable-avx --enable-threads --enable-openmp --with-pic --with-our-malloc (the --with-our-malloc was needed as compilation of kalloc.c complains without it).

qnxor commented 7 years ago

I did more tests on both WIndows and Linux. On Windows I also tested the DLL downloaded from the website.

What I discovered is that it works for some data sizes n but not for others. I updated the sample code above to specify the data size from the command line. I tested n=4,8,16,32,64,128,256,512,1024,2048. Execute first wisdomtest <n> (to plan and export wisdom) then wisdom <n> import.

Without multithreading, in Windows it works for n=4,64,128 but fails for the rest. In Linux it works for all except n=8,16.

With multithreading (uncomment the top two and bottom commented lines and compile with -fopenmp -lgomp) it works for n=16,32,64,128 and fails for the rest.

qnxor commented 7 years ago

I have a horrible feeling that it has to do with the separation between the real and imaginary pointers. I'm aware of the SIMD restrictions and fftw_malloc, but is the computed wisdom tied to the separation between the real and imaginary pointers? Because if I make the imaginary array follow immediately the real array then it always works! i.e. if I replace the non-consecutive allocation of real & imag pointers:

    double* xre = init(n*n);
    double* xim = init(n*n);
    double* yre = init(n*n);
    double* yim = init(n*n);

with a consecutive allocation of real & imag pointers (note xim and yim):

    double *xre = init(2*n*n);
    double *xim = xre + n*n;
    double *yre = init(2*n*n);
    double *yim = yre + n*n;

... then it always works!

Isn't this a bug? Is the wisdom really tied to original separation between real+imag arrays?

The alignment of the 4 individual arrays is not an issue here. I tried fftw_malloc() instead of malloc() and the issue is still there. I also checked fftw_alignment_of() for each of the 4 arrays which returns 0 for all 4 of them all the time, suggesting that they are all aligned in the same way.

I can also confirm that the planner fails sporadically if either the input or the output pairs have non-consecutive real+imag arrays (both the input and the output need to have exactly consecutive real+imag arrays to always work).

My use case is such that I can't control the separation between the real and imag pointers :( ... I would need to create copies for both input and output arrays, which is costy (I deal with very large data).

I did read this doc page about reusing an already computed plan, which requires the distance between the real and imag pointers to stay the same, but it doesn't say anything about whether using previously computed wisdom also requires the same condition.

Please be a bug ...

qnxor commented 7 years ago

Looking at the exported wisdom for the consecutive vs non-consecutive memory allocation cases (see my post above) for two runs of the test code with the same parameters, I see major differences in the non-consecutive case of the wisdom. See here

Two runs of non-consecutive allocation: https://www.diffchecker.com/SbNhlKej

Two runs of consecutive allocation: https://www.diffchecker.com/5QW1APxU

stevengj commented 7 years ago

The wisdom, like the plan, is specific to the memory offset between the real and imaginary parts: http://fftw.org/doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions

It should be possible to relax this somewhat.

qnxor commented 7 years ago

I feared that would be the case (hadn't yet looked at the fftw code so I was still hopeful). I had read that page, which says nothing about the wisdom but only about reusing the plan.

I think it should be releaxed because otherwise importing wisdom is of limited use for the split case.

stevengj commented 7 years ago

One way to work around this is to allocate your real and imaginary parts, for an array of length N, like this:

double *xre = fftw_malloc(sizeof(double) * N * 2);
double *xim = xre + N;

This way, the offset between the real and imaginary parts will be the same in every run and stored wisdom will be applicable.

If N is even, both xre and xim will also be 16-byte aligned for SIMD. If you need odd N as well, then you could do:

double *xre = fftw_malloc(sizeof(double) * (N * 2 + 1));
double *xim = xre + N + (N % 2); /* add N%2 to ensure 16-byte alignment */

qnxor commented 7 years ago

@stevengj thanks for replying. I think I had tried that and it was still prone to the same issue, but I don't recall 100% (the discussion above is 8+ months old). The best thing would be to fix this in the fftw library itself, as you suggested above, by relaxing the constraints of the wisdom.

stevengj commented 7 years ago

(You also have to replace calloc with fftw_malloc for consistent alignment.)

FFTW / fftw3

fftw_plan_guru64_split_dft() fails sporadically with imported wisdom #88