GPU Implementation of 2nd half of onedim_fseries_kernel()

MelodyShih commented 2 years ago

The pull request adds a GPU implementation of the 2nd half of function onedim_fseries_kernel() and its relative test code and scripts (see fseries_kernel_test.cu and fseriesperf.sh).

The timing results (tol=1e-6) on a V100 GPU and a Intel Xeon Platinum 8268 CPU shows that it gives a speedup ranges from 0.8x to 27.3x:

According to this timing, I add a heuristic in src/cufinufft.cu to switch between the CPU version and the GPU version basing on nf1, nf2 and nf3.

ps. the pull request also includes minor updates in the print statement of interpolation kernels.

MelodyShih commented 2 years ago

Hi @janden , thank you for reviewing the codes and the helpful suggestions. I incorporated them accordingly in the latest commit. I remove the full CPU version of the codes -- agree that keeping one version of the code (CPU/GPU hybrid) is cleaner. Also, for cases that CPU/GPU hybrid version are slower (small nf), the fseries computation is not the bottleneck of the nufft.

If there are other places that requires changes, please let me know, thanks.

janden commented 2 years ago

Great! Sorry I dropped the ball on this. Will merge now.

MelodyShih commented 2 years ago

Thanks for the review!

flatironinstitute / cufinufft

GPU Implementation of 2nd half of onedim_fseries_kernel() #132