Seemingly random memory-related errors while running FINUFFT inside of OpenMP parallel section

flatironinstitute / finufft

Non-uniform fast Fourier transform library of types 1,2,3 in dimensions 1,2,3

Other

285 stars 72 forks source link

Seemingly random memory-related errors while running FINUFFT inside of OpenMP parallel section #346

Closed silkskier closed 11 months ago

silkskier commented 1 year ago

I've experienced some weird issues caused by threading while trying to use FINUFFT function inside OpenMP parallel region.

The application consists of 3 function 'layers' from the parallel region. My main function runs wrapper of a NFFT-based periodogram in a parallel loop;

    finufft_opts* opts = new finufft_opts;
    finufftf_default_opts(opts);
    opts->nthreads = 1;
    opts->modeord = 1;

   …

#pragma omp parallel for
for (unsigned int i = 0; i < file_count; i++) {
    auto [frequency, amplitude, max_power] = periodogram(grid, nfft_grid, files[i], opts);

    if (filter(frequency, max_power, min_power, filter_range, amplitude, min_amplitude, max_amplitude)) {
        #pragma omp critical
        {// Enter critical section to write to the file
            out.print(fmt::format("{:}\t{:.9f}\t{:.4f}\t{:.3f}\t{:.3f}\n",files[i], frequency, 1/frequency, amplitude, max_power));
            out.flush();
        };
    }

#pragma omp critical
{filesComputed += 1;}
}

Then there is the wrapper function selecting the periodogram and processing the results. The part passing the arguments to the deepest function looks like that;

 nfftls_b(data, nfftgrid, grid, opts);

Within the last function the NUFFT is called thru an inline function, two times one after another;

  inline void fastsum_b(float *x, std::complex<float> *y, float *f, std::complex<float> *out, unsigned int n, unsigned int nk, int dk, int shift, finufft_opts* opts){
      int ier = finufftf1d1(n, &x[0], &y[0], +1, 1e-4, 2 * dk, &out[0], opts);
}
…
   fastsum_b(ts, wy, nfftgrid.freq.data(), Sh_Ch, n, nk, dk, shift, opts);
   for (unsigned int i=0; i<n; ++i) {ts[i] *= 2;}
   fastsum_b(ts, w, nfftgrid.freq.data(), S2_C2, n, nk, dk, shift, opts);

The code after commenting '#pragma omp parallel' seems to work just fine, however despite the library being compiled with -DFFTW_PLAN_SAFE does seem to induce random memory-related crashes. They're 4 of them happening regularly - 'malloc(): unaligned tcache chunk detected', 'terminated by signal SIGSEGV (Address boundary error)', 'malloc(): unsorted double linked list corrupted' and 'double free or corruption (fasttop)/(!prev)'.

The loop does sometimes complete a few iterations (up to 10) before crashing, so the issue doesn't always manifest instantly.

EDIT: I've tried changing number of worker threads, and the number of iterations before crash seems to be strongly correlated with number of worker threads - at 2 threads it usually crashes around 100th iteration, while with 12 of them it rarely finishes even one.

Also putting the finufftf1d1 library calls inside neither OpenMP critical nor master section doesn't change anything, however in that case the code just stops execution (without crashing, however, the CPU load does drop almost to 0%), which results in the application having to be manually terminated.

GDB also doesn't seem to give any meaningful information about the crashes themselves -

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff35ff6c0 (LWP 825196)]
[New Thread 0x7ffff2dfe6c0 (LWP 825197)]
[New Thread 0x7ffff25fd6c0 (LWP 825198)]
[New Thread 0x7ffff1dfc6c0 (LWP 825199)]
[New Thread 0x7ffff15fb6c0 (LWP 825200)]
[New Thread 0x7ffff0dfa6c0 (LWP 825201)]
[New Thread 0x7ffff05f96c0 (LWP 825202)]
[New Thread 0x7fffefdf86c0 (LWP 825203)]
[New Thread 0x7fffef5f76c0 (LWP 825204)]
[New Thread 0x7fffeedf66c0 (LWP 825205)]
[New Thread 0x7fffee5f56c0 (LWP 825206)]
[New Thread 0x7fffeddf46c0 (LWP 825207)]
corrupted size vs. prev_size

Thread 1 "glspeaks" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44

I've tried passing everything by both value and reference, but it also doesn't seem to change anything.

I'm not sure if I'm doing anything wrong, but the way I'm calling the functions doesn't seem to be any different than in thread-safe examples, so I guess that may be caused by some hidden bug in the library itself and not a fault of my code (however I'm still not sure of that).

silkskier commented 1 year ago

Update;

By writing the 'fastsum_b' function directly and reducing the optimization level from -Ofast to -O3 I've managed to almost make it work for 2 threads.

Also address sanitizer have shown, that all the issues are caused by double free, here is one of the logs:

==836931==ERROR: AddressSanitizer: attempting double-free on 0x6040000d3a10 in thread T4:
0.03% complete, 339.07 seconds left.    #0 0x5624f25a8128 in __interceptor_free.part.0 (/home/kacper/Pulpit/glspeaks/glspeaks+0xcc128) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #1 0x7fc077ce7277 in htab_insert (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16277) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #2 0x7fc077ce8494 in mkplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x17494) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #3 0x7fc077ce866e in fftwf_mkplan_d (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x1766e) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #4 0x7fc077cf310b in mkplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x2210b) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #5 0x7fc077ce7d3d in search0 (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16d3d) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #6 0x7fc077ce7fbd in mkplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16fbd) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #7 0x7fc077ce866e in fftwf_mkplan_d (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x1766e) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #8 0x7fc077cefa72 in mkplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x1ea72) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #9 0x7fc077ce7d3d in search0 (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16d3d) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #10 0x7fc077ce7fbd in mkplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16fbd) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #11 0x7fc077dd73ec in fftwf_mkapiplan (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x1063ec) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #12 0x7fc077de15eb in fftwf_plan_many_dft (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x1105eb) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #13 0x5624f26b5c0a in finufftf_makeplan (/home/kacper/Pulpit/glspeaks/glspeaks+0x1d9c0a) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #14 0x5624f26baa1c in finufft::common::invokeGuruInterface(int, int, int, long, float*, float*, float*, std::complex<float>*, int, float, long*, long, float*, float*, float*, std::complex<float>*, finufft_opts*) (/home/kacper/Pulpit/glspeaks/glspeaks+0x1dea1c) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #15 0x5624f26bac18 in finufftf1d1 (/home/kacper/Pulpit/glspeaks/glspeaks+0x1dec18) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #16 0x5624f26125e7 in nfftls_b(star const&, Grid, Grid const&, finufft_opts*&) /home/kacper/Pulpit/glspeaks/source/batch_cpu/GLS/NFFTLS.hpp:75
    #17 0x5624f261e121 in periodogram(Grid const&, Grid&, std::filesystem::__cxx11::path, finufft_opts*&) /home/kacper/Pulpit/glspeaks/source/batch_cpu/periodogram_b.hpp:29
    #18 0x5624f262aec0 in main_batch(int, char**) [clone ._omp_fn.0] /home/kacper/Pulpit/glspeaks/source/batch_cpu/main_b.hpp:130
    #19 0x7fc07851836d  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1f36d) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)
    #20 0x7fc0766a63eb in start_thread nptl/pthread_create.c:444
    #21 0x7fc076726a1b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

0x6040000d3a10 is located 0 bytes inside of 48-byte region [0x6040000d3a10,0x6040000d3a40)
freed by thread T0 here:
    #0 0x5624f25a8128 in __interceptor_free.part.0 (/home/kacper/Pulpit/glspeaks/glspeaks+0xcc128) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #1 0x7fc077ce7277 in htab_insert (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x16277) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)
    #2 0x7ce62d02a5e18f  (<unknown module>)

previously allocated by thread T0 here:
    #0 0x5624f25a945f in malloc (/home/kacper/Pulpit/glspeaks/glspeaks+0xcd45f) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #1 0x7fc077ce5054 in fftwf_malloc_plain (/lib/x86_64-linux-gnu/libfftw3f.so.3+0x14054) (BuildId: 25899f30cd71dcdcce36b4e8763c2fe7f8e0b147)

Thread T4 created by T0 here:
    #0 0x5624f2518b16 in pthread_create (/home/kacper/Pulpit/glspeaks/glspeaks+0x3cb16) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745)
    #1 0x7fc07851896e  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1f96e) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)
    #2 0x7fc07850eaa0 in GOMP_parallel (/lib/x86_64-linux-gnu/libgomp.so.1+0x15aa0) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)

SUMMARY: AddressSanitizer: double-free (/home/kacper/Pulpit/glspeaks/glspeaks+0xcc128) (BuildId: 8b0c390af31e6e0029fe2cf778dd44182516b745) in __interceptor_free.part.0
==836931==ABORTING

I'll try to manually compile the library using make instead of using CMP package manager, I guess it may solve the issue.

blackwer commented 1 year ago

Try writing this in such a way that each OpenMP thread gets its own plan. There are internal arrays inside the plan that are not thread safe, so this will never work reliably with only one plan.

silkskier commented 1 year ago

Try writing this in such a way that each OpenMP thread gets its own plan. There are internal arrays inside the plan that are not thread safe, so this will never work reliably with only one plan.

Thanks for reply, I've also tried something like that by splitting pragmas parallel and for like that;

#pragma omp parallel
{
    finufft_opts* opts = new finufft_opts;
    finufftf_default_opts(opts);
    opts->nthreads = 1;
    opts->modeord = 1;

#pragma omp for
for (unsigned int i = 0; i < file_count; i++) {
    auto [frequency, amplitude, max_power] = periodogram(grid, nfft_grid, files[i], opts);

    if (filter(frequency, max_power, min_power, filter_range, amplitude, min_amplitude, max_amplitude)) {
        #pragma omp critical
        {// Enter critical section to write to the file
            out.print(fmt::format("{:}\t{:.9f}\t{:.4f}\t{:.3f}\t{:.3f}\n",files[i], frequency, 1/frequency, amplitude, max_power));
            out.flush();
        };
    }

#pragma omp critical
{filesComputed += 1;}
}}

It also didn't help. Now the application just stops execution at 0% with sanitizer off, and returns something like that with sanitizer;

==13730==ERROR: AddressSanitizer: FPE on unknown address 0x7fc5af42738d (pc 0x7fc5af42738d bp 0x000000000000 sp 0x7fc5a8539fe8 T3)
0.03% complete, 340.61 seconds left.    #0 0x7fc5af42738d in htab_lookup (/usr/local/lib/libfftw3f.so.3+0x2738d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #1 0x7fc5af42817e in mkplan (/usr/local/lib/libfftw3f.so.3+0x2817e) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #2 0x7fc5af4286be in fftwf_mkplan_d (/usr/local/lib/libfftw3f.so.3+0x286be) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #3 0x7fc5af42fc9e in mkplan (/usr/local/lib/libfftw3f.so.3+0x2fc9e) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #4 0x7fc5af427d8d in search0 (/usr/local/lib/libfftw3f.so.3+0x27d8d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #5 0x7fc5af42800d in mkplan (/usr/local/lib/libfftw3f.so.3+0x2800d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #6 0x7fc5af4286be in fftwf_mkplan_d (/usr/local/lib/libfftw3f.so.3+0x286be) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #7 0x7fc5af42fb42 in mkplan (/usr/local/lib/libfftw3f.so.3+0x2fb42) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #8 0x7fc5af427d8d in search0 (/usr/local/lib/libfftw3f.so.3+0x27d8d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #9 0x7fc5af42800d in mkplan (/usr/local/lib/libfftw3f.so.3+0x2800d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #10 0x7fc5af51798c in fftwf_mkapiplan (/usr/local/lib/libfftw3f.so.3+0x11798c) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #11 0x7fc5af521db7 in fftwf_plan_many_dft (/usr/local/lib/libfftw3f.so.3+0x121db7) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966)
    #12 0x55c0f4f47a01 in finufftf_makeplan (/home/kacper/Pulpit/glspeaks/glspeaks+0x213a01) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #13 0x55c0f4f49985 in finufft::common::invokeGuruInterface(int, int, int, long, float*, float*, float*, std::complex<float>*, int, float, long*, long, float*, float*, float*, std::complex<float>*, finufft_opts*) (/home/kacper/Pulpit/glspeaks/glspeaks+0x215985) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #14 0x55c0f4f49b2e in finufftf1d1 (/home/kacper/Pulpit/glspeaks/glspeaks+0x215b2e) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #15 0x55c0f4e5bd44 in nfftls_b(star const&, Grid, Grid const&, finufft_opts*) (/home/kacper/Pulpit/glspeaks/glspeaks+0x127d44) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #16 0x55c0f4e61d20 in periodogram(Grid const&, Grid&, std::filesystem::__cxx11::path, finufft_opts*) (/home/kacper/Pulpit/glspeaks/glspeaks+0x12dd20) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #17 0x55c0f4e8d0f0 in main_batch(int, char**) [clone ._omp_fn.0] (/home/kacper/Pulpit/glspeaks/glspeaks+0x1590f0) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #18 0x7fc5b10c836d  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1f36d) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)
    #19 0x7fc5aeea63eb in start_thread nptl/pthread_create.c:444
    #20 0x7fc5aef26a1b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE (/usr/local/lib/libfftw3f.so.3+0x2738d) (BuildId: 1b17b32ceac6864458a4eb7600afb7e066a0b966) in htab_lookup
Thread T3 created by T0 here:
    #0 0x55c0f4d763d6 in pthread_create (/home/kacper/Pulpit/glspeaks/glspeaks+0x423d6) (BuildId: 5ee7c7e6f6d572edb905b4f57b4f719075cf44bb)
    #1 0x7fc5b10c896e  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1f96e) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)
    #2 0x7fc5b10beaa0 in GOMP_parallel (/lib/x86_64-linux-gnu/libgomp.so.1+0x15aa0) (BuildId: 3fd8675529e36fe0ed668405353f3a19627bde6c)

==13730==ABORTING

Also, it somehow messes up the loop vectorization, and now the code runs over 4x slower, while NFFT is not performed, but that's not directly related to the library itself.

I've also tried using the old makefile and it didn't change anything.

Anyway thanks, if it's a nontrivial issue and I'm the only one affected by that I'm not sure it's worth fixing. I wanted to use FINUFFT mostly to check how fast can it get at calculating Lomb-Scargle periodograms with the guru interface in parallel.

In my use case, (typically ~80 thousand frequencies per NFFT) planning time did significantly outweigh computation which isn't surprising given that FINUFFT is based on the exponential of semicircle kernel.

I guess I'll just have to implement pruned type-1 real to complex thread-sage NFFT based on low-order approximation instead of relying on a library optimized for much larger transforms.

I will still try to somehow solve that, but I'm starting to run out of ideas about what can be done with that. Also, my compiled FINUFFT doesn't have any issues with running thread-safe examples, so I guess it just doesn't like my code for some reason.

ahbarnett commented 1 year ago

Dear SilkSkier,

Your goal is reasonable.

It looks like you read the docs re thread-safe FFTW: https://finufft.readthedocs.io/en/latest/install.html#compilation-flags-and-make-inc-settings Does examples/threadsafe1d1 run without segfault for you? To prevent segfault on my laptop I have to use FFTWOMPSUFFIX = threads which switches the FFTW multithread library from OMP to pthreads. Did you try this?

For small transforms (<1e5 in 1D is small), single-thread FINUFFT might be faster. Then you can use in OMP.

Did you use gdb to see if crash is in FFTW: ? https://finufft.readthedocs.io/en/latest/trouble.html#crash-segfault-issues-and-advice

You may want to ask @lgarrison about his FINUFFT-based L-S periodogram code.

Finally, I suggest you refactor code to use FINUFFT's vectorized interface - I imagine you have the same set of NU pts for many xforms? If so, do them all at once. The sorting will be done only once, and a batched FFT is done, and the result is 2x-10x speedup. There would be no calling finufft inside a OMP parallel block then. Rather, you set up a batch of (say #cores) xform data vectors, request a guru plan for that batch size, then send all to execute together, then process outputs however you want via OMP. Then repeat everything except the plan in a serial loop (reusing this plan and the same data IO pointers) until you chunk through all your data. Is that clear?

Just to whet your appetite on a 1e5-sized 1d problem, run this:

(base) alex@ross /home/alex/numerics/finufft> test/finufft1dmany_test 100 1e5 1e5 1e-6 1
test 1d1 many vs repeated single: ------------------------------------
[finufft_makeplan] new plan: FINUFFT version 2.2.0.dev0 .................
[finufft_makeplan] 1d1: (ms,mt,mu)=(100000,1,1) (nf1,nf2,nf3)=(200000,1,1)
               ntrans=100 nthr=16 batchSize=15  spread_thread=2
[finufft_makeplan] kernel fser (ns=7):      0.000749 s
[finufft_makeplan] fwBatch 0.05GB alloc:    8e-06 s
[finufft_makeplan] FFTW plan (mode 64, nthr=16):    0.0062 s
[finufft_setpts] sort (didSort=1):      0.0045 s
[finufft_execute] start ntrans=100 (7 batches, bsize=15)...
[finufft_execute] done. tot spread:     0.12 s
               tot FFT:             0.113 s
               tot deconvolve:          0.0184 s
ntr=100: 100000 NU pts to 100000 modes in 0.268 s   3.74e+07 NU pts/s
    one mode: rel err in F[37000] of trans#99 is 3.56e-07
100 of: 100000 NU pts to 100000 modes in 1.65 s     6.06e+06 NU pts/s
            speedup      T_FINUFFT1D1 / T_finufft1d1many = 6.16
    consistency check: sup ( ||f_many-f||_2 / ||f||_2  ) =  0

Ie, 6x faster. Is that good motivation? Does your machine give similar speedup? (what NU pts/sec are you getting?)

That test code uses a single call to finufft1d1many, but you could easily use guru instead, following examples/guru*. Let me know if I should add an example doing what I suggest above.

Also, don't bother with pruned, etc.

ahbarnett commented 1 year ago

Oh, I forgot to say, if you can make a minimally-complete example which crashes, in the style of example/guru1d1.cpp or something, we can help more easily. But I encourage you to actually use batching, and then finufft will never be called within OMP par.

silkskier commented 1 year ago

Both thread-safe examples worked just fine in my case, however haven't tried FFTWOMPSUFFIX = threads yet. I'll try it later and check if it does check the issue.

About the speed of transforms I've gotten theese results (on laptop with Ryzen 4600h);

test 1d1 many vs repeated single: ------------------------------------
[finufftf_makeplan] new plan: FINUFFT version 2.2.0.dev0 .................
[finufftf_makeplan] 1d1: (ms,mt,mu)=(100000,1,1) (nf1,nf2,nf3)=(200000,1,1)
               ntrans=100 nthr=12 batchSize=12  spread_thread=2
[finufftf_makeplan] kernel fser (ns=7):         0.00737 s
[finufftf_makeplan] fwBatch 0.02GB alloc:       1.5e-05 s
[finufftf_makeplan] FFTW plan (mode 64, nthr=12):       0.00254 s
[finufftf_setpts] sort (didSort=1):             6.6e-05 s
[finufftf_execute] start ntrans=100 (9 batches, bsize=12)...
[finufftf_execute] done. tot spread:            0.0516 s
               tot FFT:                         0.0619 s
               tot deconvolve:                  0.0558 s
ntr=100: 1000 NU pts to 100000 modes in 0.181 s         5.52e+05 NU pts/s
        one mode: rel err in F[37000] of trans#99 is 0.00274
100 of: 1000 NU pts to 100000 modes in 0.776 s  1.29e+05 NU pts/s
                        speedup          T_FINUFFT1D1 / T_finufft1d1many = 4.29
        consistency check: sup ( ||f_many-f||_2 / ||f||_2  ) =  1.23e-07
test 1d2 many vs repeated single: ------------------------------------
[finufftf_makeplan] new plan: FINUFFT version 2.2.0.dev0 .................
[finufftf_makeplan] 1d2: (ms,mt,mu)=(100000,1,1) (nf1,nf2,nf3)=(200000,1,1)
               ntrans=100 nthr=12 batchSize=12  spread_thread=2
[finufftf_makeplan] kernel fser (ns=7):         0.000613 s
[finufftf_makeplan] fwBatch 0.02GB alloc:       1.9e-05 s
[finufftf_makeplan] FFTW plan (mode 64, nthr=12):       0.00182 s
[finufftf_setpts] sort (didSort=0):             1.2e-05 s
[finufftf_execute] start ntrans=100 (9 batches, bsize=12)...
[finufftf_execute] done. tot deconvolve:                0.0253 s
               tot FFT:                         0.0602 s
               tot interp:                      0.00218 s
ntr=100: 100000 modes to 1000 NU pts in 0.0911 s        1.1e+06 NU pts/s
        one targ: rel err in c[500] of trans#99 is 0.000431
100 of: 100000 modes to 1000 NU pts in 0.902 s  1.11e+05 NU pts/s
                        speedup          T_FINUFFT1D2 / T_finufft1d2many = 9.9
        consistency check: sup ( ||c_many-c||_2 / ||c||_2 ) =  2.08e-07
test 1d3 many vs repeated single: ------------------------------------
[finufftf_makeplan] new plan: FINUFFT version 2.2.0.dev0 .................
[finufftf_makeplan] 1d3: ntrans=100
        M=1000 N=100000
        X1=3.14 C1=2 S1=5e+04 D1=8.5e+04 gam1=1.00001 nf1=125000
[finufftf_setpts t3] widcen, batch 0.01GB alloc:        0.00033 s
[finufftf_setpts t3] phase & deconv factors:    0.0014 s
[finufftf_setpts t3] sort (didSort=1):          1.8e-05 s
[finufftf_setpts t3] inner t2 plan & setpts:    0.00338 s
[finufftf_execute t3] start ntrans=100 (9 batches, bsize=12)...
[finufftf_execute t3] done. tot prephase:               0.000177 s
                  tot spread:                   0.0264 s
                  tot type 2:                   0.155 s
                  tot deconvolve:               0.00757 s
ntr=100: 1000 NU to 100000 NU in 0.196 s        5.16e+07 tot NU pts/s
        one targ: rel err in F[25000] of trans#99 is 0.00337
100 of: 1000 NU to 100000 NU in 1.64 s          6.15e+06 tot NU pts/s
                        speedup          T_FINUFFT1D3 / T_finufft1d3many = 8.4
        consistency check: sup ( ||f_many-f||_2 / ||f||_2 ) =  1.77e-05

The results here do seem to almost match the ones I've been getting with my single-threaded Lomb-Scargle test with a vectorized interface.

Non-FFT-based recursive vectorized Lomb-Scargle on my PC was able to consistently reach ~ 4.4e7 frequencies per second (over 80x speed-up in comparison to the 'naive' implementation) in the multithreaded test.

NFFT-based implementation does need to calculate 2 transforms each twice the length of the frequency array, which makes the FINUFFT require ~1.8e8 points per second, which is archivable only for extremely long transforms (typically over 1 million frequencies long, equivalent to 500k elements long frequencies array). Arrays of that length aren't typically used in almost any applications, also the slightly higher accuracy of recursive Lomb-Scargle (being mathematically equivalent to the 'naive' implementation, while NFFT one is only a reasonably accurate approximation).

I do not know how much potential speed-up would pruned FFT give, but I would expect at least something close to 15%.

As for the low order approximation kernel (instead of exponential of semicircle), it usually gives ~10x speed-up in planning, while making the FFT calculation phase about 2x longer. I guess with all additional micro-optimizations it could catch up to the recursive implementation with grids as short as ~40 thousand target frequencies, so it should make it a viable alternative for at least some short-period pulsator surveys (however I still doubt that would be enough to outperform recursive LS in most real-life scenarios).

ahbarnett commented 1 year ago

Hi SilkSkier,

On Mon, Sep 18, 2023 at 5:35 PM silkskier @.***> wrote:

Both thread-safe examples worked just fine in my case, however haven't tried FFTWOMPSUFFIX = threads yet. I'll try it later and check if it does check the issue.

About the speed of transforms I've gotten theese results (on laptop with Ryzen 4600h);

Sim to my laptop.

test 1d1 many vs repeated single: ------------------------------------ [finufftf_makeplan] new plan: FINUFFT version 2.2.0.dev0 ................. [finufftf_makeplan] 1d1: (ms,mt,mu)=(100000,1,1) (nf1,nf2,nf3)=(200000,1,1) ntrans=100 nthr=12 batchSize=12 spread_thread=2 [finufftf_makeplan] kernel fser (ns=7): 0.00737 s [finufftf_makeplan] fwBatch 0.02GB alloc: 1.5e-05 s [finufftf_makeplan] FFTW plan (mode 64, nthr=12): 0.00254 s [finufftf_setpts] sort (didSort=1): 6.6e-05 s [finufftf_execute] start ntrans=100 (9 batches, bsize=12)... [finufftf_execute] done. tot spread: 0.0516 s tot FFT: 0.0619 s tot deconvolve: 0.0558 s ntr=100: 1000 NU pts to 100000 modes in 0.181 s 5.52e+05 NU pts/s one mode: rel err in F[37000] of trans#99 is 0.00274 100 of: 1000 NU pts to 100000 modes in 0.776 s 1.29e+05 NU pts/s speedup T_FINUFFT1D1 / T_finufft1d1many = 4.29 consistency check: sup ( ||f_many-f||_2 / ||f||_2 ) = 1.23e-07

You omitted the calling cmd here, but are you requesting only eps=1e-3?

More basic question: you only want 1e3 NU pts to 1e5 U grid? That is low density (M/N ~ 1e-2). Means limit is RAM (or cache) out-of-order access on the big modes array is limiting. The # NU pts/sec will not be a fair metric.

Try upsampfac = 1.25 and it is 2x faster.

How many such xforms do you actually want to do with the same NU pts (my ntrans=100 was just an example) ?

The results here do seem to almost match the ones I've been getting with my

single-threaded Lomb-Scargle test with a vectorized interface.

Non-FFT-based recursive vectorized Lomb-Scargle on my PC was able to consistently reach ~ 4.4e7 frequencies per second (over 80x speed-up in comparison to the 'naive' implementation) in the multithreaded test.

you have shifted the metric to N (# freqs) per sec.

In that case, note in the above test/finufft1d1many_test you are getting (when you do upsampfac=1.25): 0.1 sec for 100 vectors size 1e5, which equals 1e8 freq pts/sec. Is that adequate? Is seems to beat your "recursive vectorized" L-S (whatever that means).

NFFT-based implementation does need to calculate 2 transforms each twice

the length of the frequency array, which makes the FINUFFT require ~1.8e8 points per second, which is archivable only for extremely long transforms

Note your shift in metric from NU pts/sec to freqs/sec. No-one can get 2e8 NU pts/sec since M/N is so small << 1. But freq pts/sec is close; see above. To coherently talk we'll need to fix a metric.

(typically over 1 million frequencies long, which corresponds to 500k elements long frequencies array). Arrays of that length aren't typically used in almost any applications, also the slightly higher accuracy of recursive Lomb-Scargle (being mathematically equivalent to the 'naive' implementation, while NFFT one is only a reasonably accurate approximation).

I don't understand your use of the word "reasonably": you can dial in any tolerance eps you want at minimal extra cost; that's a feature of a well-implemented NUFFT.

I do not know how much potential speed-up would pruned FFT give, but I would expect at least something close to 15%.

As I said, don't bother. There are much bigger factors on the table :)

As for the low order approximation kernel (instead of exponential of semicircle), it usually gives ~10x speed-up in planning, while making the FFT calculation phase about 2x longer.

Not sure whose code you're talking about here...

I guess with all additional micro-optimizations it could catch up to the recursive implementation with grids as short as ~40 thousand target frequencies, so it should make it a viable alternative for at least some short-period pulsator surveys (however I still doubt that would be enough to outperform recursive LS in most real-life scenarios).

Lehman Garrison I Cc'ed before might be able to give you benchmarks or understand the astro-specifics here.

Another suggestion is use our GPU code. you should be able to get > 1e9 "pts"/sec where "pts" is M+N or some sensible metric.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/346#issuecomment-1724488703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSQZFDUMSHS6MO344HDX3C5CXANCNFSM6AAAAAA435ZAEA . You are receiving this because you commented.Message ID: @.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

silkskier commented 1 year ago

Thank you for the last reply, I guess I've misinterpreted the 'pts' metric. About the recursive implementation I mean the one proposed by Kurtz in 1985 (https://academic.oup.com/mnras/article/213/4/773/951672). The algorithm after some additional modifications (including enabling SIMD instructions and rewriting the code in C/C++ instead of Fortran). I've set the eps=1e-3 just because I wanted to check how it affects the performance, however, I still do understand the basic principles of how FFT/NFFT works (or at least I do think so, I'm not so sure that here).

As for the number of xforms calculated with a single pointer, I'm not really sure, however, probably that will be usually a few thousand of them.

As for approximation performance (the word 'reasonably') I meant mostly slight shifts in the peak location due to approximations caused by grid conversion during transform preparation.

I'll try to make the implementation of that library a little bit more performant periodogram using that library as soon, as I'll be able to make the threading work. However, I guess I'll still try to benchmark low rank approximation sometime in the future just because of sheer curiosity.

EDIT;

Not sure whose code you're talking about here..

I am not aware of any NFFT libraries that would utilize the LRA kernel in a satisfying way, however, there are two available which are based on it - nufft2 command in FastTransforms.jl library and NUFFTW. Neither of them meets my demands, but the type-1 transform should be easy to implement on top of Intel's MKL, which should also be 'fast enough' for most use cases (or rather as fast as it gets without making it too complicated).

silkskier commented 1 year ago

Update;

Neither changing FFTW multithreading preset to pthreads nor compiling single-threaded FFTW instead of multithreaded one did fix the issue. GDB also gives no additional clues on what is causing the issue.

The minimum working example would be quite long due to the issue of having no idea what exactly causes the issue itself (it may be a lot of things including the hypervisor thread being created with std::thread or many other different issues), so I guess I'll close the issue as not planned for now. Thanks for your help.

About the usage of the vectorized interface along with sending files to be analyzed in batches - it's not something I'm willing to do unless it's absolutely necessary. I'm planning to write an application with relatively fast implementations of relevant (for now) periodograms performing well on noisy data. Batching here does introduce quite a lot of overhead for the periodograms not using external libraries (~2% overhead) and makes the progress estimation much less precise. I could write a separate batching interface, especially for NFFTLS, but I'm willing to do that only, if I won't be able to find any more practical solutions for some time.

silkskier commented 12 months ago

@ahbarnett Sorry for not specifying what I meant by NFFT-based Lomb-Scargle being only a rough approximation. By that, I meant mostly all of the mess that happens significantly above the Nyquist frequency. Here are two Lomb-Scargle power spectra for OGLE BLAP-002 with an oscillation frequency ~ 61.8 per day. Pure Lomb-Scargle as well as phase-folding have no issues with that, NFFTLS often ends up messing everything - also sadly short period variables (such as BLAPS) are the only ones that tend to benefit significantly from NFFT - everything staying below the Nyquist frequency or relatively close to it tend to be limited by readout speed even when the recursive LS is applied. Here the NFFTLS is calculated with eps=1e-4, but changing the values (even for double precision) doesn't seem to change anything. Only increasing the oversampling factor does help, but in that case, NFFT becomes even slower than non-NFFT implementation.

EDIT: The first one is the correct LS power spectrum. The second one shows strong peaks for the harmonics of the correct frequency with no real peak discernable. Also, the background noise starts to increase rapidly closely to ~ 10x 'Nyquist' frequency or the uniform grid.

LS_rec NFFTLS

As for the working example, I'll send the one segfaulting probably tomorrow, I have the pruned code almost ready.

silkskier commented 12 months ago

Here is the minimum working example I've been able to make including everything that may be the cause of the issue. To run the application on the test data just execute './example ../test_data' from the build directory after compilation. Can take quite a while to compile (~ 10 seconds) and requires the fmt library (libfmt-dev) installed.

Other than that all required headers that shouldn't be able to cause any issues are included in /include subdirectory.

nfftls_mem_issues_example.tar.gz

In my case, it reproduces exactly the same 'random' memory issues, most commonly double-free.

blackwer commented 11 months ago

Thanks for the example. I'm able to reproduce it. There are a few issues coming into play, and I have a decent workaround implemented, though it hasn't fixed everything. Still looking into it, but here's what I have so far

See: http://www.fftw.org/doc/Thread-safety.html

your CPMAddPackage invocation should say CMAKE_ARGS -DCMAKE_CXX_CXXFLAGS=-DFFTW_PLAN_SAFE
the fftw thread safety stuff doesn't work if fftw uses OpenMP as its threading substrate anyway
the issue is that nearly all fftw calls are not thread safe.

The upshot is that the only thread-safe routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time.

This means that any fix will have to make sure any FFTW_ calls in finufft will need to be mutex/critical guarded. Allocs/frees are NOT thread safe. I'll post a PR with tentative fixes soon.

blackwer commented 11 months ago

@silkskier can you try with the fix at https://github.com/flatironinstitute/finufft/pull/352 You shouldn't need the -DFFTW_PLAN_SAFE bit, though it also shouldn't hurt.

cd build
cmake ..
make -j
./example ../test_data

should work if you can get your cmake pointing at the right finufft commit

ahbarnett commented 11 months ago

I just brought it into master, so, you can try that. Thanks, Alex

On Wed, Sep 27, 2023 at 2:40 PM Robert Blackwell @.***> wrote:

@silkskier https://github.com/silkskier can you try with the fix at #352 https://github.com/flatironinstitute/finufft/pull/352 You shouldn't need the -DFFTW_PLAN_SAFE bit, though it also shouldn't hurt.

cmake .. make -j ./example ../test_data

should work if you can get your cmake pointing at the right finufft commit

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/346#issuecomment-1737902160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSR4JO5HZRVMUYKEGFLX4RXJPANCNFSM6AAAAAA435ZAEA . You are receiving this because you were mentioned.Message ID: @.***>

silkskier commented 11 months ago

Thank you for help, now everything seems to work just fine. I still have an issue with compiling the library itself with the icpx, but I know it's not one of the supported compilers, so that's not an issue.

blackwer commented 11 months ago

I still have an issue with compiling the library itself with the icpx, but I know it's not one of the supported compilers, so that's not an issue.

Please open another issue, but note that icpx enables fast-math by default, so won't correctly work. I had to

cmake .. -DCMAKE_BUILD_TYPE=relwithdebinfo  -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DFINUFFT_BUILD_TESTS=on -DCMAKE_CXX_FLAGS=-fp-model=strict
make -j
ctest

to get this to work

ahbarnett commented 11 months ago

Thanks! Could you add the icpx flags to a new entry in CMakePresets.json ?

On Fri, Sep 29, 2023 at 9:56 AM Robert Blackwell @.***> wrote:

I still have an issue with compiling the library itself with the icpx, but I know it's not one of the supported compilers, so that's not an issue.

Please open another issue, but note that icpx enable fast-math by default, so won't correctly work. I had to

cmake .. -DCMAKE_BUILD_TYPE=relwithdebinfo -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DFINUFFT_BUILD_TESTS=on -DCMAKE_CXX_FLAGS=-fp-model=strict make -j ctest

to get this to work

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/346#issuecomment-1740938068, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSVULBQGSEEOJHQ6FL3X43HRTANCNFSM6AAAAAA435ZAEA . You are receiving this because you were mentioned.Message ID: @.***>

silkskier commented 11 months ago

I still have an issue with compiling the library itself with the icpx, but I know it's not one of the supported compilers, so that's not an issue.

Please open another issue, but note that icpx enables fast-math by default, so won't correctly work. I had to
cmake .. -DCMAKE_BUILD_TYPE=relwithdebinfo  -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DFINUFFT_BUILD_TESTS=on -DCMAKE_CXX_FLAGS=-fp-model=strict
make -j
ctest
to get this to work

I've had a different issue with icpx 2023.2.1 and Debian Testing - in my case even with fast math disabled the library didn't compile correctly - linker doesn't know what it's doing. The library itself compiles with GNU-like flags (as icpx is compatible with both GNU and icpc compiler flags), however it always results in a series of linker errors, while trying to use the compiled library afterwards

/usr/bin/ld: warning: libimf.so, needed by ../libfinufft.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libsvml.so, needed by ../libfinufft.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libirng.so, needed by ../libfinufft.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libintlc.so.5, needed by ../libfinufft.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_end_single'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_for_static_init_8'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_barrier'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_dispatch_next_4'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_critical'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_end_critical'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_push_num_threads'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_dispatch_init_8u'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_for_static_init_4'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_single'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_dispatch_next_8u'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_dispatch_init_4'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_global_thread_num'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_fork_call'
/usr/bin/ld: ../libfinufft.so: undefined reference to `__kmpc_for_static_fini'

Without any of the answers commonly stated online working.

I've been able to track the issue up to the issue up to one of the source files (sadly I've forgotten to write which one of them is responsible for the issues. I'll try to fix it by myself and create a pull request after fixing the issue, however, I doubt it will be done soon (as for now I have severe internet connection issues due to infrastructure damage as a result of a nearby fire).

As for icpx itself, I don't think it's an important issue for now - I'll probably need it for my project in order to enable 'true' pruned FFT to speed up the calculation (sadly FFTW3 still doesn't support the real pruned transforms - it still calculates a whole transform and saves just a half of the computed result. OneMKL (which works significantly better with icpx than either g++ or clang) unlike it allows for real pruned transforms, which typically leads to significant speed-up (2x or something above that) at least for 32-bit accuracy. For 64-bit transforms the gains are often negligible due to memory transfer speed limitations.

blackwer commented 11 months ago

I've been able to track the issue up to the issue up to one of the source files (sadly I've forgotten to write which one of them is responsible for the issues. I'll try to fix it by myself and create a pull request after fixing the issue, however, I doubt it will be done soon (as for now I have severe internet connection issues due to infrastructure damage as a result of a nearby fire).

This is a pathing issue, not a source issue. cmake isn't passing the location of the library files to the linker ld. Additionally, it's not properly linking in OpenMP. Did you run this in a fresh build directory? Stale cmake cache will cause problems like this. Also make sure you are calling the proper icpx setvars.sh magic command.

'll probably need it for my project in order to enable 'true' pruned FFT to speed up the calculation (sadly FFTW3 still doesn't support the real pruned transforms - it still calculates a whole transform and saves just a half of the computed result.

Pruning has shown lackluster gains in our experience. There is a more efficient optimization we are looking into adding directly, at least for 2 and 3d transforms (#304). Also note that for many workflows in the CPU version of finufft, the FFT is only about 1/3 of the work, so Amdahl's law dictates that you can at most get a 33% speedup, if you somehow reduce that to zero. 2x for the transform part seems optimistic, but even then it'd be ~15% reduction in runtime -- nothing to sniff at, but not game changing either. Of course if you have very few NU points and are spreading onto a very large grid, this could change.

OneMKL (which works significantly better with icpx than either g++ or clang)

I think what you might be seeing are issues related to using multiple implementations of OpenMP. If you link MKL with a program compiled using gcc+openmp, then it gets a bit wonky as there are two different threading libraries that are operating in tandem. A lot of the performance issues can be often be resolved by exporting the environment variable MKL_THREADING_LAYER=GNU when using openmp code compiled with the gcc compiler. I don't experiment with clang much so I'm not sure the optimal solution there.

That said, we have discussed making the fft implementation easily swappable, and it's one of the more likely things to be implemented in the near future.

silkskier commented 11 months ago

Thanks for replying again, I'll try to look more deeply into that. As for the execution time, this might look quite different for Lomb-Scargle periodograms which are a very specific use case of NFFT, so I want to try to mess with the files a little bit more.

As for the performance gains in FFT itself some benchmarks seem to match with my results, for example FFTBench.

Also, NFFT-based Lomb-Scargle often requires quite extreme oversampling factors (up to ~ 8) in order to reach the desired performance.

The postprocessing of FINUFFT seems to be decently fast even for that specific use case, however, precomputation time can be significantly reduced (to less than 10% of the original time) by replacing exponential semicircle with low rank approximation kernel based on Chebyshev polynomials (what would force the library to perform more than one FFT (typically between 10 and 16 depending on the desired precision), essentially making the FFT calculation itself the main cost). I've not tested any of the heavy modifications yet, but they may be able to significantly reduce computation time for relatively small datasets.

I'm not sure if I'll be able to archive a significant speed-up by applying all of these modifications, but I'm sure that I won't know that unless I try.

ahbarnett commented 11 months ago

I'll second Robert's comments about performance.

The ratio "spread/interp time divided by FFT time" grows with the density M/N. Maybe L-S is low-density (lots of modes N but few NU pts M)?

Also, NFFT-based Lomb-Scargle often requires quite extreme oversampling

factors (up to ~ 8) in order to reach the desired performance.

you must be referring to some other algorithm, that is beyond our expertise

we can't comment.

Our discussion is confined to 1) you being able to run FINUFFT successfully (I think mostly addressed), and 2) ideas to speed up our library (which we should move to Discussions, ideally).

We don't know about L-S nor astronomy. (well, we do, but it's too confusing to get into it here).

Oversampling (in our context) is entirely protected from the NUFFT user, as it should be :) The user calls a transform, and it's no concern of there's how we compute it.

The postprocessing of FINUFFT seems to be decently fast even for that

specific use case, however, precomputation time can be significantly reduced (to less than 10% of the original time) by replacing exponential semicircle with low rank approximation kernel based on Chebyshev polynomials (what would force the library to perform more than one FFT (typically between 10 and 16 depending on the desired precision), essentially making the FFT calculation itself the main cost). I've not tested any of the heavy modifications yet, but they may be able to significantly reduce computation time for relatively small datasets.

Although you don't explain this, I guess you are referring to Antolin-Ruiz and Townsend's work from 2016? While a cute idea, it does not compare to our performance even in 1d, and for >1d it will die.

As a NUFFT user you are free to try whatever other libraries/ideas you like if you don't believe me.

I'm not sure if I'll be able to archive a significant speed-up by applying

all of these modifications, but I'm sure that I won't know that unless I try.

You are welcome to fork FINUFFT and make a PR if you create a performance speedup in some range of test cases (verified by test/finufft1d_test, etc). Have at it!

Best, Alex

blackwer commented 11 months ago

Thanks! Could you add the icpx flags to a new entry in CMakePresets.json ?

addressed in 7a3ede19