Difference between OpenMP and sequential

apraga commented 7 years ago

Hi,

Disclaimer: I've already sent a mail at fftw (at) fftw.org , but I figured it would be more useful here. Also, I've modified a bit the code (the plan is now private). So I apologize for double posting.

I'm comparing the real DFT of a 3D array in Fortran, in OpenMP and sequential. In some cases (1 over 1000), the results are different. Below is the parallel version :

  subroutine test_real_3D_threads(input, output)
    double precision ::  input(N, N, N)
    double complex :: output(N, N, N)
    integer*8 :: plan
    integer :: iret, i
    integer :: id, nb_threads

    call dfftw_plan_with_nthreads(omp_get_max_threads())
!$OMP PARALLEL SHARED(input, output) PRIVATE(plan)
!$OMP CRITICAL
    call dfftw_plan_dft_r2c_3d(plan, N,N,N, input, output, FFTW_MEASURE)
!$OMP END CRITICAL

    call init(input)
    ! Don't forget to initialize
    output(:,:,:) = 0

    call dfftw_execute_dft_r2c(plan, input, output)

    !print *, "fft", output(1:2,1:2, 1:2)
!$OMP CRITICAL
    call dfftw_destroy_plan(plan)
!$OMP END CRITICAL
!$OMP END PARALLEL

end subroutine

I'm using FFTW 3.3.6 compiled with --enable-openmp and the program is compiled and linked with:

gfortran  -g -fbacktrace -fopenmp fftw3_fortran.f90 -c
gfortran -g -fbacktrace -fopenmp  fftw3_fortran.o -o fftw3_fortran -lfftw3 -lfftw3_omp -lm

Attached is the whole source code for a reproductible example (rename as .txt) comparing the sequential and parallel version. fftw3_fortran.txt. The tests are run 1000 times with the bash loop:

 nb_threads=8; for i in `seq 1 1000`; do OMP_NUM_THREADS=$nb_threads ./fftw3_fortran; if [ $? -ne 0 ]; then print "Error"; break; fi; done

Thanks for your help !

dfarns commented 7 years ago

FFTW_MEASURE does not guarantee bit reproducibility

apraga commented 7 years ago

Thank you for your quick answer. In my case, the relative error can climb to 290% (!) between the parallel and sequential version, even with FFTW_ESTIMATE. It would be surprising the difference came from bit reproducibility in my opinion.

matteo-frigo commented 7 years ago

What do you mean by "relative error"?

You cannot compute the relative errors of individual output elements. The error bound of FFT algorithms is only in the total norm of the output, something like ||OUT1-OUT0||/||OUT0|| < ERROR_BOUND, where ||.|| is the L2 norm (square root of the sum of the squares). There is no bound on ||OUT1[i]-OUT0[i]||/||OUT0[i]||.

stevengj commented 7 years ago

To expand on Matteo's comment, suppose that the exact output of the FFT is supposed to be (1,0,0). Because of roundoff errors, however, we might get (1-2e-15,1e-15,-3e-15). Another FFT algorithm, e.g. a different plan or a parallel plan, might get (1+1e-15,2e-15,1e-15). If you compare individual elements, e.g. -3e-15 to 1e-15, you might conclude that there is a huge error (factor of 3). But if you look at the root-mean-square (L2 norm) difference, then in both cases it is quite small compared to the root-mean-square of the expected vector (1,0,0).

apraga commented 7 years ago

You're totally right, sorry about the silly mistake.

However, computing the L2 norm sometimes results in very large norm. For example, with an array of 9x9x9, the L2 norm can be 3000. Maybe I'm still missing something... I've included the updated reproducible example here: fftw3_fortran.txt.

matteo-frigo commented 7 years ago

Your program works for me. What exactly did you do and what did you expect to happen?

apraga commented 7 years ago

I expect the program to return a L2 norm between the sequential and parallel version less than 1e-11 (computated are in double precision). In the actual code, thousands of FFT are computed. To replicate this situation, I used the small example above to ensure the parallel version always give the appropriate result. So i ran the program a thousand times (thus computing the same DFT a thousand times) with:

 for i in `seq 1 1000`; do OMP_NUM_THREADS=4 ./fftw3_fortran; if [ $? -ne 0 ]; then print "error"; break; fi; done

In these runs, there is at last one occurence of a L2 norm on the order of hundreds or even thousands. I have tested it on Ubuntu (8 threads, 1 per core) and Archlinux (4 cores, 1 thread per core).

apraga commented 6 years ago

Hi, I'm bumping this issue a bit as I've found no solution. At the moment, I'm not sure FFTW is "safe" to use with OpenMP as this tests gives different results in sequential and in parallel. It is most likely an issue with the test but I don't see it. @matteo-frigo What did you mean by "it works for you" ? Thanks.

FFTW / fftw3

Difference between OpenMP and sequential #97