Open AdhocMan opened 2 years ago
I've now modified the python interface, such that the dimension can be used instead of the number of modes for type 3 transforms (similarly to how it works for fiNUFFT). Also, the python interface now also works with cupy in addition to pycuda.
The tests seem to pass, so this is now ready for review.
It's been a while now, so I'm wondering if anyone would be willing to provide some feedback?
Apologies for the delay. I'm just getting back from holiday. Hope to look at this later in the week.
Dear Simon, This is an impressive and ambitious feature addition! This required understanding the inner workings of FINUFFT type 3 - not easy - and mashing them up with cuFINUFFT. We will be delighted to bring it in... given time, since it's nontrivial. Could I ask what your application is?
I have pulled and tested your C++ tester on an A100, and find that: 1) math seems to check out to required accuracy - congratulations - this is not easy! 2) performance in larger problems is sadly entirely dominated by the trig function evals in setpts. Eg:
(base) ahb@workergpu035 /mnt/home/ahb/numerics/cufinufft> bin/cufinufft3d3_test 1 1e6 1e6 1e-6
[time ] dummy warmup call to CUFFT 0.0176 s
[time ] cufinufft plan: 2.65e-05 s
[time ] kernel fser (ns=7): 0.000108 s
[time ] Allocate GPU memory plan 2.61e-05 s
[time ] Copy fwkerhalf1,2 HtoD 2.66e-05 s
[time ] CUFFT Plan 0.0139 s
[time ] Allocate GPU memory NUpts0.000238 s
[time ] Setup Subprob properties 0.000648 s
[time ] Allocate GPU memory NUpts0.000263 s
[time ] Setup Subprob properties 0.000784 s
[time ] cufinufft setNUpts: 0.67 s
[time ] Initialize fw 6.14e-06 s
[time ] Spread (1) 0.0474 s
[time ] Amplify & Copy fktofw 1.33e-05 s
[time ] CUFFT Exec 4.4e-05 s
[time ] Unspread (1) 0.00809 s
[time ] Inner type 2 execution (1) 0.00819 s
[time ] Deconvolve 3.89e-05 s
[time ] cufinufft exec: 0.0557 s
[time ] Free gpu memory 0.00417 s
[time ] Free gpu memory 0.0054 s
[time ] cufinufft destroy: 0.00542 s
[Method 1] 1000000 U pts to 1000000 NU pts in 0.731 s: 1.37e+06 NU pts/s
(exec-only thoughput: 1.8e+07 NU pts/s)
[gpu ] one targ: rel err in F[500000] is 4.17e-07
Even though exec is good (20M pts/sec), overall speed is slower than CPU code on my ryzen laptop running
test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6
:( But this is entirely due to setNUpts. Solution is to have GPU do the trig funcs, ie reimplement onedim_nuft_kernel on the GPU shared across its threads. @MelodyShih was due to do this for the CPU function onedim_fseries_kernel, which otherwise is a unnecessary bottleneck in 1d1, 1d2. I encourage her to try that at a down-moment during thesis write-up :) But the codes would look v similar.
Other things to keep in mind:
If you look in my CPU code type-3 tester https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/test/finufft3d_test.cpp#L153 as the problem size grows I grow the space-bandwidth product by making the box in which the (s,t,u) freq targs live grow in proportion to N1,N2,N3, which in this case each should be around N^{1/3}. You'll want to add something like this, since in your tester the FFT is always of size O(1), hardly realistic! Note cufft is ridiculously fast in the above timings :) Feel free to bring in that CPU tester. We somehow want to minimize new code (not sure how, other than a joint tester..).
Since you started, Melody has implemented 1d1 and 1d2, and I cleaned up the Makefile test tasks to match. You may find it easier to base your Makefile off of that.
I don't know cupy, but is that something you can apply to the rest of the project or just 3d3?
Ok, that's enough for now! Maybe the rest of the devs have other comments? Thanks, Alex
Thank you both for reviewing this.
I'm happy to hear that you could successfully run the tests on your system.
The application I'm working on is an imaging algorithm for radio astronomy. For us, performance of the execution is more important then setpts
, since we execute multiple times with the same points. So I did not focus much on accelerating the setup, but good to hear that some work is already being done to help there.
I've rebased my branch on the latest master, so it should now be up to date (including the Makefile, where I also added a single precision test now). The changes required for cupy have been reverted and can be discussed separately.
@AdhocMan Thanks for addressing the changes. There's just one thing left regarding the templates.
@ahbarnett @MelodyShih Are we good otherwise?
@janden Thanks for taking another look at this. Regarding templates, I'd argue that they are far more maintainable, since the compiler automatically takes care of each typed instance. And it looks like, that the only place where such macros are currently defined is in API header files, so adding this implementation detail there would just clutter the interface. Any chance you could warm up to the idea of using templates? While there may not be instances of defining new templated classes so far, there are certainly many cases of usage through standard library and thrust calls.
@AdhocMan Sorry for dropping the ball on this.
I would have to look into this further to see whether templates would make sense for us. If we do decide to go down that route, however, it would be for all the different types (and dimensions). Having different approaches in different parts of the codebase would make things more difficult to maintain. I suggest you make an issue with some examples of how things would look like and we can continue the discussion there.
For now, would you be willing to convert this to use macros? Otherwise I can take a stab.
I likely won't have the time to look into the conversion, but I'd say templates are certainly easier to maintain, since the compiler does all the work of creating individual symbol names for each type. However, feel free to change the templated code to macros here.
Hello,
Thank you very much for the nice work. I am developing a Python computational imaging library and I would like to offer support for NUFFTs of Type 3 on GPUs (for MRI and RA imaging). I am therefore very much looking forward to this PR being merged. @janden et al., do you guys have a tentative timeline? When can we expect this work to be available as part as cuFINUFFT? For information, I am already relying on the FINUFFT library for CPU NUFFTs, so cuFINUFFT is a natural choice for me.
Thanks a lot for your time and work.
I'd say templates are certainly easier to maintain, since the compiler does all the work of creating individual symbol names for each type.
Ok we'll take this in with templates and look into converting t1 and t2 to templates as well for consistency.
@janden et al., do you guys have a tentative timeline?
Currently we're waiting for #142 to be merged and will then try to merge this as soon as possible. I'm not sure what the timeline is for #142 but my guess is that it'll be merged in the next few weeks.
Ok thanks, looking forward !
For a project, we require a type 3 transform on GPU. Since this feature is still missing, I've implemented it for the 3D case. This was mainly done by porting the fiNUFFT implementation for type 3. The plan setup should also work for 2D, it's just the execution that's missing here.
This might not be quite ready to merge, since some small changes (such as timers) are still missing. However, it would be helpful to get some feedback from you, to know what's still required to get this merged.