flatironinstitute / finufft

Non-uniform fast Fourier transform library of types 1,2,3 in dimensions 1,2,3
Other
291 stars 73 forks source link

For SAR data of the same size, the time consumed by FINUFFT is much more than that consumed by sinc interpolation #434

Open zengletian1491 opened 5 months ago

zengletian1491 commented 5 months ago

When I simulated synthetic aperture radar (SAR) imaging algorithms,such as Omega-K algorithm (or range migration algorithm), I used FINUFFT (i.e. finufft1d1()) to replace "Stolt interpolation and FFT"(implemented by sinc interpolation).

However, for the SAR data of the same size (6144×1200), the time consumed by FINUFFT is 54.004043 seconds while the time consumed by "Stolt interpolation and FFT"(implemented by sinc interpolation) is only 2.972154 seconds. This makes me confused! In theory, FINUFFT is much faster than "Stolt interpolation and FFT"(implemented by sinc interpolation). Moreover, I change the parameters within FINUFFT, the time consumed is about 50 seconds, which is still much more than the time consumed by "Stolt interpolation and FFT"(implemented by sinc interpolation).

I really don't know why. Is there anyone who can give me some suggestions? Thanks very much.

ahbarnett commented 5 months ago

Dear Zengletian (sorry I don't know your first name), You will have to provide a MCE (minimally complete example code), eg, with simple random data, of the FINUFFT call you are doing. Anyone would need to know all parameters (dimension, M, N's, tol, etc) to help you. Also, I don't know the Stolt algorithm, and would like not to have to learn it, so maybe keep that out of your MCE ? Best, Alex

ahbarnett commented 5 months ago

And, in terms of others to help, Dylan Green at Dartmouth College uses FINUFFT for 2D and 3D SAR. I do not promise he has time to help :)

zengletian1491 commented 5 months ago

Dear Zengletian (sorry I don't know your first name), You will have to provide a MCE (minimally complete example code), eg, with simple random data, of the FINUFFT call you are doing. Anyone would need to know all parameters (dimension, M, N's, tol, etc) to help you. Also, I don't know the Stolt algorithm, and would like not to have to learn it, so maybe keep that out of your MCE ? Best, Alex

Dear Alex Barnett:

I’m sorry for the late reply. My first name is Letian and my last name is Zeng. I will provide a MCE (minimally complete example code) as follows written by MATLAB code: kr = ([-3072 : 3071].'/6144 1.2 + 33.33) (4.0 pi); // 6144×1 column vector kx = (33.33 + [-600: 599]./1200 4.0) (2.0 pi); // 1×1200 row vector c = zeros(6144,1200); for k = 1 : 1200 c(:,k) = finufft1d1((sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)pi, data_in(:,k),-1,1e-12,6144); end Note that data_in is a 6144×1200 matrix of double-precision type (such as SAR data) and you don’t need to run the above code. In the above code, the term “sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)pi” varies with “k”. I have to call finufft1d1() over and over again, which is so time-consuming and much slower than sinc interpolation. Moreover, many algorithms in SAR imaging can encounter the similar situation, i.e., the nonuniform points in different echoes are different, but in one frame of radar data, we have to deal with these multiple echoes. If I call finufft1d1() once for one echo, it would be so time-consuming.

So I would like to know if there are any improved methods to accumulate the speed of finufft1d1() in this situation. Thanks very much!

lu1and10 commented 5 months ago

Dear Zengletian (sorry I don't know your first name), You will have to provide a MCE (minimally complete example code), eg, with simple random data, of the FINUFFT call you are doing. Anyone would need to know all parameters (dimension, M, N's, tol, etc) to help you. Also, I don't know the Stolt algorithm, and would like not to have to learn it, so maybe keep that out of your MCE ? Best, Alex

Dear Alex Barnett:

I’m sorry for the late reply. My first name is Letian and my last name is Zeng. I will provide a MCE (minimally complete example code) as follows written by MATLAB code:

kr = ([-3072 : 3071].'/6144 1.2 + 33.33) (4.0 * pi); // 6144×1 column vector

kx = (33.33 + [-600: 599]./1200 4.0) (2.0 * pi); // 1×1200 row vector

c = zeros(6144,1200);

for k = 1 : 1200

     c(:,k) = finufft1d1((sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)*pi, data_in(:,k),-1,1e-12,6144); 

end 

Note that data_in is a 6144×1200 matrix of double-precision type (such as SAR data) and you don’t need to run the above code. In the above code, the term “sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)*pi” varies with “k”. I have to call finufft1d1() over and over again, which is so time-consuming and much slower than sinc interpolation. Moreover, many algorithms in SAR imaging can encounter the similar situation, i.e., the nonuniform points in different echoes are different, but in one frame of radar data, we have to deal with these multiple echoes. If I call finufft1d1() once for one echo, it would be so time-consuming.

So I would like to know if there are any improved methods to accumulate the speed of finufft1d1() in this situation. Thanks very much!

Hi Letian, you may try the The vectorized (many vector) interface, ie ntrans>1, it can be much faster. See the full matlab docs, https://finufft.readthedocs.io/en/latest/matlab.html#full-documentation

lu1and10 commented 5 months ago

Never mind, I guess many vector interface does not work for you. your source nupts are different each transfer.

zengletian1491 commented 5 months ago

Never mind, I guess many vector interface does not work for you. your source nupts are different each transfer.

Yes, just like this. Nonuniform points are different each transfer, so "many vector interface" does not work. We have to call finufft1d1() for each transfer at the moment.

I would like the developers of this project to add a code implementation for this case, which is very common in SAR imaging.

ahbarnett commented 5 months ago

Dear Letian, One key point of a MCE is to run it (the other is to read it). So, I made it run, as below:

clear
maxNumCompThreads(1)   % note, better because because 1d1 transforms are small
kr = 4*pi*([-3072 : 3071].'/6144 * 1.2 + 33.33);
kx = 2*pi*(33.33 + [-600: 599]/1200*4);
data_in = rand(6144,1200) + 1i*rand(6144,1200);   % complex
c = 0*data_in;    % complex
tic
for k = 1 : 1200
   c(:,k) = finufft1d1((sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)*pi, data_in(:,k),-1,1e-12,6144);
end
toc

It runs in 0.9 seconds on my laptop, ie, processing about 8e6 pts/sec. Is that too slow for you? (what is the time taken for your sinc version? You mention it was about 3 s, so, I think we win, right?). I don't understand how you got 50 sec run-time - was it because of a huge # threads? (I hope you read this, right?)

Plus: If you followed the idea of each thread doing its own 1d1 as in examples/threadsafe1d1.cpp you would get roughly a speed-up equal to the # CPUs, too. Would that not be enough?

I can tell because of the sqrt(kr.^2 - kx(k)^2) that really you are doing a partial transform from a set of 2d k-space points. Are you sure that doing full 2d1 transform is not what you actually want? :)

Best, Alex

zengletian1491 commented 5 months ago

Dear Letian, One key point of a MCE is to run it (the other is to read it). So, I made it run, as below:

clear
maxNumCompThreads(1)   % note, better because because 1d1 transforms are small
kr = 4*pi*([-3072 : 3071].'/6144 * 1.2 + 33.33);
kx = 2*pi*(33.33 + [-600: 599]/1200*4);
data_in = rand(6144,1200) + 1i*rand(6144,1200);   % complex
c = 0*data_in;    % complex
tic
for k = 1 : 1200
   c(:,k) = finufft1d1((sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)*pi, data_in(:,k),-1,1e-12,6144);
end
toc

It runs in 0.9 seconds on my laptop, ie, processing about 8e6 pts/sec. Is that too slow for you? (what is the time taken for your sinc version? You mention it was about 3 s, so, I think we win, right?). I don't understand how you got 50 sec run-time - was it because of a huge # threads? (I hope you read this, right?)

Plus: If you followed the idea of each thread doing its own 1d1 as in examples/threadsafe1d1.cpp you would get roughly a speed-up equal to the # CPUs, too. Would that not be enough?

I can tell because of the sqrt(kr.^2 - kx(k)^2) that really you are doing a partial transform from a set of 2d k-space points. Are you sure that doing full 2d1 transform is not what you actually want? :)

Best, Alex

Dear Alex, Thanks very much for your help! I run your code on my laptop and it costs 117.026671 seconds. I think the configuration of my laptop is enough: th Gen Inter(R) Core(TM) i7-1260P (16 CPUs), ~2.1GHz, 16384MB RAM.

addpath('F:/finufft-2.2.0/finufft-2.2.0/matlab') clear maxNumCompThreads(1) ; % note, better because because 1d1 transforms are small kr = 4pi([-3072 : 3071].'/6144 1.2 + 33.33); kx = 2pi(33.33 + [-600: 599]/12004); data_in = rand(6144,1200) + 1irand(6144,1200); % complex c = 0data_in; % complex tic for k = 1 : 1200 c(:,k) = finufft1d1((sqrt(kr.^2 - kx(k).^2) - 362.27)/(15.98)*pi, data_in(:,k),-1,1e-12,6144); end toc 历时 117.026671 秒。 2024-04-30_085836

I download finufft-2.2.0 on the Win11 operating system, but I do not compile it.May this be the reason that finufft runs slow on MATLAB R2023a on my laptop? Moreover, nufft can be directly called on MATLAB R2023a. But nufft also runs very slow on my laptop.

The sqrt(kr.^2 - kx(k)^2) is 2d k-space, whose column is nonuniform for the reason that different columns has different k (k = 1,2,...,1200) values. For each column data of “data_in”, i.e, data_in(:,k), they corresponds to different nonuniform points sqrt(kr.^2 - kx(k)^2). So doing full 2d1 transform is not what I actually want. If we supposed kx(k) be a constant when k=1,2,...,1200, sqrt(kr.^2 - kx(k)^2) is still nonuniform points but does not differ with k, at this moment we can call nufft1d1many() to efficiently implement FINUFFT.

Best, Letian

lu1and10 commented 5 months ago

If you downloaded the release mex binary for windows, it may not be compiled with best cpu instruction set flag for your laptop. if you are aiming for performance, it will be better that you compiled the C++ code and matlab mex natively on your laptop with march native and O3 on. Hopefully we will have cpu dispatch code soon.

ahbarnett commented 5 months ago

Hi Letian, The runtime is independent of the strength data (of course), hence I don't need to rerun it; the time will be the same. There is something very wrong with your distribution, because you are running 200x slower! My laptop is similar to yours; all laptops are similar within a factor of 2 per core. I suggest you try the mex release for windows (even avx missing will only be a factor of 2), or compiling from source, as Libin says. Best, Alex

On Mon, Apr 29, 2024 at 10:27 PM Libin Lu @.***> wrote:

If you downloaded the release mex binary for windows, it may not be compiled with best cpu instruction set flag for your laptop. if you are aiming for performance, it will be better that you compiled the C++ code and matlab mex natively on your laptop with march native and O3 on. Hopefully we will have cpu dispatch code soon.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/434#issuecomment-2084227512, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSQJSD6AWW4CYPSVHGLY736QPAVCNFSM6AAAAABGZ5B3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBUGIZDONJRGI . You are receiving this because you commented.Message ID: @.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

zengletian1491 commented 5 months ago

If you downloaded the release mex binary for windows, it may not be compiled with best cpu instruction set flag for your laptop. if you are aiming for performance, it will be better that you compiled the C++ code and matlab mex natively on your laptop with march native and O3 on. Hopefully we will have cpu dispatch code soon.

Dear Libin, when I use CMake 3.29.2 to compile finufft 2.2.0 to generate .mex file of MATLAB (with march native on), errors occur as follows: 2024-05-01_172415

The errors occur in running the following code of CMakeLists.txt: if (FINUFFT_BUILD_MATLAB)

When building for matlab, we will fetch the OpenMP library used by matlab

# instead of system default for compatibility.
find_package(Matlab REQUIRED)
find_library(matlab_iomp5_lib NAMES iomp5 HINTS ${Matlab_ROOT_DIR}/sys/os/ PATH_SUFFIXES glnxa64 maci64)
find_library(pthreads_lib NAMES pthread CMAKE_FIND_ROOT_PATH_BOTH)

# Create a "fake" imported library pointing to the matlab openmp implementation
add_library(OpenMP::OpenMP_CXX SHARED IMPORTED)
set_target_properties(OpenMP::OpenMP_CXX PROPERTIES IMPORTED_LOCATION ${matlab_iomp5_lib})
target_link_libraries(OpenMP::OpenMP_CXX INTERFACE ${pthreads_lib})
# Set the OpenMP flag.
if (CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang")
    target_compile_options(OpenMP::OpenMP_CXX INTERFACE -Xclang -fopenmp)
else ()
    target_compile_options(OpenMP::OpenMP_CXX INTERFACE -fopenmp)
endif ()

else ()

For non-matlab builds, find system OpenMP

if (FINUFFT_USE_OPENMP)
    find_package(OpenMP REQUIRED)
endif ()

endif ()

Could you please give me some suggestions?Or can you generate a windows mex file for me? My cpu is Intel(R) Core(TM) i5-10400F CPU @ 2.9GHz (12CPUs), ~2.9GHz and matlab 2017b.Thanks. Best, Letian.

lu1and10 commented 5 months ago

Hi Letian,

Current cmake for matlab mex is tested for macOS, never tested on windows with visual studio. We should improve cmake for matlab on windows. Currently I use mingw compiler for matlab mex on windows using make. See tips https://finufft.readthedocs.io/en/latest/install.html#windows-tips-for-compiling

zengletian1491 commented 5 months ago

Hi Letian,

Current cmake for matlab mex is tested for macOS, never tested on windows with visual studio. We should improve cmake for matlab on windows. Currently I use mingw compiler for matlab mex on windows using make. See tips https://finufft.readthedocs.io/en/latest/install.html#windows-tips-for-compiling

Dear Libin, I compiled the finufft on Ubuntu 20.04 with gcc 7.5.5. I typed in "make test -j", "make examples", "make fortran" in the listed order and the codes ran successfully. However, when I typed in "make matlab", .mexw file could not be generated correctly, and errors occurred as follows:

微信图片_20240504215745

I used MATLAB R2021a. I tried some solutions but still failed.

Could you give me some suggestions? Thanks.

Best, Letian.

lu1and10 commented 5 months ago

seems the Matlab mex is not in your environment PATH, you could find the Matlab install directory by typing matlabroot in Matlab, and the mex probably under bin directory of the Matlab root directory. You can export the mex directory to PATH or you can specify explicitly where you mex is https://github.com/flatironinstitute/finufft/blob/cc8629f78d8aa764b32d15a6d5fd40901e408e4d/makefile#L45

ahbarnett commented 3 months ago

Hi Letian, did this work out for you? I'd like to close the issue.