ffts_real only works for single-precision

danra commented 7 years ago

Should also support double-precision

mcourteaux commented 7 years ago

I also get a segfault when trying to use the:

p = ffts_init_2d_real(n, n, sign);

plan, no matter how big I allocate these buffers.

mcourteaux commented 7 years ago

I modified the test.c file to use the plan ffts_init_2d_real.

martijn@Martijns-MacBook-Pro ~/ffts/build (master)$ lldb -o r -- ./ffts_test 512 -1
(lldb) target create "./ffts_test"
Current executable set to './ffts_test' (x86_64).
(lldb) settings set -- target.run-args  "512" "-1"
(lldb) r
ffts_test was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 56122 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
   187              __m128 t3 = _mm_load_ps(A + i);
   188              __m128 t4 = _mm_load_ps(B + i);
   189
-> 190              _mm_store_ps(out + i, _mm_add_ps(_mm_addsub_ps(
   191                  _mm_mul_ps(t1, _mm_moveldup_ps(t3)),
   192                  _mm_mul_ps(_mm_shuffle_ps(t1, t1, _MM_SHUFFLE(2,3,0,1)),
   193                  _mm_movehdup_ps(t3))), _mm_addsub_ps(

Process 56122 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
    frame #1: 0x0000000100003000 ffts_test`ffts_execute_nd_real(p=0x0000000100600140, in=0x0000000101900800, out=0x0000000101b00000) at ffts_real_nd.c:96 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

linkotec commented 7 years ago

Try replacing _mm_store_ps with _mm_storeu_ps as there is no guarantee that the memory is correctly aligned.

mcourteaux commented 7 years ago

Check the pointers: in=0x0000000101900800, out=0x0000000101b00000. Looks fairly aligned to me. Are you sure your comment still holds?

mcourteaux commented 7 years ago

However, I tried what you said, and now it crashes here:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72                      op[x*TSIZE] = ip[x];
   73                   */
   74                   __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75                   __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76                   __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77                   __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78                   __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 56308 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
    frame #1: 0x0000000100003032 ffts_test`ffts_execute_nd_real(p=0x0000000100502b20, in=<unavailable>, out=0x0000000101200000) at ffts_real_nd.c:99 [opt]
    frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
    frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)

The value of w seems to be 257 (which looks like 2^8+1), and looks indeed something that screws with the alignment required.

mcourteaux commented 7 years ago

Also, the hardcoded double datatype cast for a "single-precision" compiled library looks odd. I am a complete newbie to this library tho.

linkotec commented 7 years ago

I have done very little testing with 2D or higher ranking transforms. But most of problems are related with alignment, as intrinsics used require strict 128 bit align.

Double datatype casting is used here to simplify following _mm_shuffle_pd instrinsics. We could have used _mm_load_ps (or _mm_loadu_ps) and then cast using _mm_castps_pd to _m128d datatype,

You could test with my benchFFTS, https://github.com/linkotec/benchFFTS. It supports tests for 2D, for example "bench_ffts -s ocb512x128"

mcourteaux commented 7 years ago

Some thing of course: (note the orb vs ocb)

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s ocb512x128
Problem: ocb512x128, setup: 317.00 us, time: 410.53 us, ``mflops'': 12771
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s orb512x128
[1]    93048 segmentation fault  ./bench_ffts -s orb512x128
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$

LLDB gives the same as previously:

martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ lldb -o run -- ./bench_ffts -s orb512x128                                                                  [ruby-2.0.0p648]
(lldb) target create "./bench_ffts"
Current executable set to './bench_ffts' (x86_64).
(lldb) settings set -- target.run-args  "-s" "orb512x128"
(lldb) run
bench_ffts was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 93163 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000001000036ce bench_ffts`ffts_transpose(in=<unavailable>, out=0x0000000101042000, w=65, h=<unavailable>) at ffts_transpose.c:75 [opt]
   72                      op[x*TSIZE] = ip[x];
   73                   */
   74                   __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75                   __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
   76                   __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
   77                   __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
   78                   __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));

Process 93163 launched: './bench_ffts' (x86_64)
(lldb)

mcourteaux commented 7 years ago

It looks like the problem is due to the "half spectrum" thing, which causes you to have N/2 + 1 coefficients. The constant coefficient is the one causing the +1 I believe. This is why the w is 64+1 = 65, and the transpose failing on this misaligned data. Note that I did not investigate this, but this is just what pops into my head, with my knowledge of fourier transforms.

jtojanen commented 7 years ago

I think you have found the problem. At the moment I don't have the time to fix it. One option is that you use 2D complex transform with imaginary part set to zero, so all input value are real numbers. It is slower but at least it works.

mcourteaux commented 7 years ago

Nice! 😄 Well, speed is a very important aspect of my application. So for now, I will use the complex transform. Once fixed, I might change it back to the real_to_complex_2d transform. However, if you believe the fix is fairly easy and can outline it, I might consider trying to fix this (with PR off course).

emmenlau commented 2 years ago

Hi, did anyone work on this? I'd like to use the 2d real transform too, would be great if it works...

anthonix / ffts

ffts_real only works for single-precision #59