flatironinstitute / cufinufft

Nonuniform fast Fourier transforms of types 1 and 2, in 1D, 2D, and 3D, on the GPU
Other
83 stars 18 forks source link

Add type 3 for 3D #120

Open AdhocMan opened 2 years ago

AdhocMan commented 2 years ago

For a project, we require a type 3 transform on GPU. Since this feature is still missing, I've implemented it for the 3D case. This was mainly done by porting the fiNUFFT implementation for type 3. The plan setup should also work for 2D, it's just the execution that's missing here.

This might not be quite ready to merge, since some small changes (such as timers) are still missing. However, it would be helpful to get some feedback from you, to know what's still required to get this merged.

AdhocMan commented 2 years ago

I've now modified the python interface, such that the dimension can be used instead of the number of modes for type 3 transforms (similarly to how it works for fiNUFFT). Also, the python interface now also works with cupy in addition to pycuda.

The tests seem to pass, so this is now ready for review.

AdhocMan commented 2 years ago

It's been a while now, so I'm wondering if anyone would be willing to provide some feedback?

janden commented 2 years ago

Apologies for the delay. I'm just getting back from holiday. Hope to look at this later in the week.

ahbarnett commented 2 years ago

Dear Simon, This is an impressive and ambitious feature addition! This required understanding the inner workings of FINUFFT type 3 - not easy - and mashing them up with cuFINUFFT. We will be delighted to bring it in... given time, since it's nontrivial. Could I ask what your application is?

I have pulled and tested your C++ tester on an A100, and find that: 1) math seems to check out to required accuracy - congratulations - this is not easy! 2) performance in larger problems is sadly entirely dominated by the trig function evals in setpts. Eg:

(base) ahb@workergpu035 /mnt/home/ahb/numerics/cufinufft> bin/cufinufft3d3_test 1 1e6 1e6 1e-6
[time  ] dummy warmup call to CUFFT  0.0176 s
[time  ] cufinufft plan:         2.65e-05 s
[time  ]    kernel fser (ns=7):  0.000108 s
[time  ]    Allocate GPU memory plan 2.61e-05 s
[time  ]    Copy fwkerhalf1,2 HtoD   2.66e-05 s
[time  ]    CUFFT Plan       0.0139 s
[time  ]    Allocate GPU memory NUpts0.000238 s
[time  ]    Setup Subprob properties 0.000648 s
[time  ]    Allocate GPU memory NUpts0.000263 s
[time  ]    Setup Subprob properties 0.000784 s
[time  ] cufinufft setNUpts:         0.67 s
[time  ]    Initialize fw        6.14e-06 s
[time  ]    Spread (1)       0.0474 s
[time  ]    Amplify & Copy fktofw    1.33e-05 s
[time  ]    CUFFT Exec       4.4e-05 s
[time  ]    Unspread (1)         0.00809 s
[time  ]    Inner type 2 execution (1)       0.00819 s
[time  ]    Deconvolve       3.89e-05 s
[time  ] cufinufft exec:         0.0557 s
[time  ]    Free gpu memory      0.00417 s
[time  ]    Free gpu memory      0.0054 s
[time  ] cufinufft destroy:      0.00542 s
[Method 1] 1000000 U pts to 1000000 NU pts in 0.731 s:  1.37e+06 NU pts/s
                    (exec-only thoughput: 1.8e+07 NU pts/s)
[gpu   ] one targ: rel err in F[500000] is 4.17e-07

Even though exec is good (20M pts/sec), overall speed is slower than CPU code on my ryzen laptop running

test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6

:( But this is entirely due to setNUpts. Solution is to have GPU do the trig funcs, ie reimplement onedim_nuft_kernel on the GPU shared across its threads. @MelodyShih was due to do this for the CPU function onedim_fseries_kernel, which otherwise is a unnecessary bottleneck in 1d1, 1d2. I encourage her to try that at a down-moment during thesis write-up :) But the codes would look v similar.

Other things to keep in mind:

Ok, that's enough for now! Maybe the rest of the devs have other comments? Thanks, Alex

AdhocMan commented 2 years ago

Thank you both for reviewing this. I'm happy to hear that you could successfully run the tests on your system. The application I'm working on is an imaging algorithm for radio astronomy. For us, performance of the execution is more important then setpts, since we execute multiple times with the same points. So I did not focus much on accelerating the setup, but good to hear that some work is already being done to help there.

I've rebased my branch on the latest master, so it should now be up to date (including the Makefile, where I also added a single precision test now). The changes required for cupy have been reverted and can be discussed separately.

janden commented 2 years ago

@AdhocMan Thanks for addressing the changes. There's just one thing left regarding the templates.

@ahbarnett @MelodyShih Are we good otherwise?

AdhocMan commented 2 years ago

@janden Thanks for taking another look at this. Regarding templates, I'd argue that they are far more maintainable, since the compiler automatically takes care of each typed instance. And it looks like, that the only place where such macros are currently defined is in API header files, so adding this implementation detail there would just clutter the interface. Any chance you could warm up to the idea of using templates? While there may not be instances of defining new templated classes so far, there are certainly many cases of usage through standard library and thrust calls.

janden commented 2 years ago

@AdhocMan Sorry for dropping the ball on this.

I would have to look into this further to see whether templates would make sense for us. If we do decide to go down that route, however, it would be for all the different types (and dimensions). Having different approaches in different parts of the codebase would make things more difficult to maintain. I suggest you make an issue with some examples of how things would look like and we can continue the discussion there.

For now, would you be willing to convert this to use macros? Otherwise I can take a stab.

AdhocMan commented 2 years ago

I likely won't have the time to look into the conversion, but I'd say templates are certainly easier to maintain, since the compiler does all the work of creating individual symbol names for each type. However, feel free to change the templated code to macros here.

matthieumeo commented 2 years ago

Hello,

Thank you very much for the nice work. I am developing a Python computational imaging library and I would like to offer support for NUFFTs of Type 3 on GPUs (for MRI and RA imaging). I am therefore very much looking forward to this PR being merged. @janden et al., do you guys have a tentative timeline? When can we expect this work to be available as part as cuFINUFFT? For information, I am already relying on the FINUFFT library for CPU NUFFTs, so cuFINUFFT is a natural choice for me.

Thanks a lot for your time and work.

janden commented 2 years ago

I'd say templates are certainly easier to maintain, since the compiler does all the work of creating individual symbol names for each type.

Ok we'll take this in with templates and look into converting t1 and t2 to templates as well for consistency.

janden commented 2 years ago

@janden et al., do you guys have a tentative timeline?

Currently we're waiting for #142 to be merged and will then try to merge this as soon as possible. I'm not sure what the timeline is for #142 but my guess is that it'll be merged in the next few weeks.

matthieumeo commented 2 years ago

Ok thanks, looking forward !