Open danra opened 7 years ago
I also get a segfault when trying to use the:
p = ffts_init_2d_real(n, n, sign);
plan, no matter how big I allocate these buffers.
I modified the test.c file to use the plan ffts_init_2d_real
.
martijn@Martijns-MacBook-Pro ~/ffts/build (master)$ lldb -o r -- ./ffts_test 512 -1
(lldb) target create "./ffts_test"
Current executable set to './ffts_test' (x86_64).
(lldb) settings set -- target.run-args "512" "-1"
(lldb) r
ffts_test was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 56122 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
187 __m128 t3 = _mm_load_ps(A + i);
188 __m128 t4 = _mm_load_ps(B + i);
189
-> 190 _mm_store_ps(out + i, _mm_add_ps(_mm_addsub_ps(
191 _mm_mul_ps(t1, _mm_moveldup_ps(t3)),
192 _mm_mul_ps(_mm_shuffle_ps(t1, t1, _MM_SHUFFLE(2,3,0,1)),
193 _mm_movehdup_ps(t3))), _mm_addsub_ps(
Process 56122 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x0000000100002765 ffts_test`ffts_execute_1d_real(p=<unavailable>, input=<unavailable>, output=0x0000000101d00808) at ffts_real.c:190 [opt]
frame #1: 0x0000000100003000 ffts_test`ffts_execute_nd_real(p=0x0000000100600140, in=0x0000000101900800, out=0x0000000101b00000) at ffts_real_nd.c:96 [opt]
frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)
Try replacing _mm_store_ps with _mm_storeu_ps as there is no guarantee that the memory is correctly aligned.
Check the pointers: in=0x0000000101900800, out=0x0000000101b00000
. Looks fairly aligned to me. Are you sure your comment still holds?
However, I tried what you said, and now it crashes here:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
72 op[x*TSIZE] = ip[x];
73 */
74 __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75 __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
76 __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
77 __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
78 __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));
Process 56308 launched: './ffts_test' (x86_64)
bind: Invalid command `rl_complete'.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x000000010000349e ffts_test`ffts_transpose(in=<unavailable>, out=0x0000000101200000, w=257, h=<unavailable>) at ffts_transpose.c:75 [opt]
frame #1: 0x0000000100003032 ffts_test`ffts_execute_nd_real(p=0x0000000100502b20, in=<unavailable>, out=0x0000000101200000) at ffts_real_nd.c:99 [opt]
frame #2: 0x0000000100000d3d ffts_test`main(argc=<unavailable>, argv=<unavailable>) at test.c:162 [opt]
frame #3: 0x0000000100113235 libdyld.dylib`start + 1
(lldb)
The value of w
seems to be 257
(which looks like 2^8+1), and looks indeed something that screws with the alignment required.
Also, the hardcoded double
datatype cast for a "single-precision" compiled library looks odd. I am a complete newbie to this library tho.
I have done very little testing with 2D or higher ranking transforms. But most of problems are related with alignment, as intrinsics used require strict 128 bit align.
Double datatype casting is used here to simplify following _mm_shuffle_pd instrinsics. We could have used _mm_load_ps (or _mm_loadu_ps) and then cast using _mm_castps_pd to _m128d datatype,
You could test with my benchFFTS, https://github.com/linkotec/benchFFTS. It supports tests for 2D, for example "bench_ffts -s ocb512x128"
Some thing of course: (note the orb
vs ocb
)
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s ocb512x128
Problem: ocb512x128, setup: 317.00 us, time: 410.53 us, ``mflops'': 12771
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ ./bench_ffts -s orb512x128
[1] 93048 segmentation fault ./bench_ffts -s orb512x128
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$
LLDB gives the same as previously:
martijn@Martijns-MacBook-Pro ~/benchFFTS/build (master)$ lldb -o run -- ./bench_ffts -s orb512x128 [ruby-2.0.0p648]
(lldb) target create "./bench_ffts"
Current executable set to './bench_ffts' (x86_64).
(lldb) settings set -- target.run-args "-s" "orb512x128"
(lldb) run
bench_ffts was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 93163 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x00000001000036ce bench_ffts`ffts_transpose(in=<unavailable>, out=0x0000000101042000, w=65, h=<unavailable>) at ffts_transpose.c:75 [opt]
72 op[x*TSIZE] = ip[x];
73 */
74 __m128d q0 = _mm_load_pd((double*)(ip0 + 0*w));
-> 75 __m128d q1 = _mm_load_pd((double*)(ip0 + 1*w));
76 __m128d q2 = _mm_load_pd((double*)(ip0 + 2*w));
77 __m128d q3 = _mm_load_pd((double*)(ip0 + 3*w));
78 __m128d q4 = _mm_load_pd((double*)(ip0 + 4*w));
Process 93163 launched: './bench_ffts' (x86_64)
(lldb)
It looks like the problem is due to the "half spectrum" thing, which causes you to have N/2 + 1 coefficients. The constant coefficient is the one causing the +1 I believe. This is why the w
is 64+1 = 65
, and the transpose
failing on this misaligned data. Note that I did not investigate this, but this is just what pops into my head, with my knowledge of fourier transforms.
I think you have found the problem. At the moment I don't have the time to fix it. One option is that you use 2D complex transform with imaginary part set to zero, so all input value are real numbers. It is slower but at least it works.
Nice! 😄 Well, speed is a very important aspect of my application. So for now, I will use the complex transform. Once fixed, I might change it back to the real_to_complex_2d
transform. However, if you believe the fix is fairly easy and can outline it, I might consider trying to fix this (with PR off course).
Hi, did anyone work on this? I'd like to use the 2d real transform too, would be great if it works...
Should also support double-precision