JuliaMath / NFFT.jl

Julia implementation of the Non-equidistant Fast Fourier Transform (NFFT)
Other
153 stars 28 forks source link

Draft: first demo of ducc library interface #107

Closed mreineck closed 2 years ago

mreineck commented 2 years ago

This shows how the ducc NUFFT library could be accessed from NFFT.jl.

ducc0_jll has been merged into Yggdrasil today; it is still in a very early stage. I thought it may be good to demonstrate the package as soon as possible to get helpful feedback ... I'm still extremely new to Julia and will make lots of beginner mistaks :-)

Please let me know if you consider this kind of interface useful in principle, and if so, how it could be improved.

mreineck commented 2 years ago

Unfortunately I have no idea how to add the required dependency to Project.toml. Any hint is welcome!

tknopp commented 2 years ago

Looks good. From this point the wrappers will be straight forward, I will have a look into this.

We need a way to set the kernel parameters directly in order to make meaningful comparisons.

tknopp commented 2 years ago

Unfortunately I have no idea how to add the required dependency to Project.toml. Any hint is welcome!

see https://github.com/JuliaMath/NFFT.jl/commit/5032cb2d5204c74d1cc4dc0472cce326a486aece

mreineck commented 2 years ago

see https://github.com/JuliaMath/NFFT.jl/commit/5032cb2d5204c74d1cc4dc0472cce326a486aece

Thanks! My problem was that I could not figure out which hash number needs to be given in the extras section. I realize now that this is listed in the Registry.tomlfile.

mreineck commented 2 years ago

We need a way to set the kernel parameters directly in order to make meaningful comparisons.

Sorry, but there is no way to request a specific kernel support (or any other kernel parameter for that matter). Kernels and oversampling factor are selected depending on the requested accuracy and the given grid size and number of nonuniform points. The user shouldn't have to worry about these technicalities.

I can add the limits for the minimum and maximum oversampling factor to the interface, but that still only provides indirect control over the kernel support.

I'll try to add the "planned" interface tomorrow, but I need a bit of help there, because I do not know how to make sure the plan's destructor is called when appropriate.

codecov[bot] commented 2 years ago

Codecov Report

Base: 88.62% // Head: 88.62% // No change to project coverage :thumbsup:

Coverage data is based on head (fc800a1) compared to base (bbda1e3). Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #107 +/- ## ======================================= Coverage 88.62% 88.62% ======================================= Files 9 9 Lines 1090 1090 ======================================= Hits 966 966 Misses 124 124 ``` Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

tknopp commented 2 years ago

I already found the min/max oversampling factor and added it to the wrapper. It seems to work but unfortunately I get an assertion that no kernel is available for high accuracy. I can run the benchmarks only partially for this reason (limit m to 6).

What's interesting is that the 1D accuracy is shifted compared to 2D/3D.

performanceVsAccuracy

tknopp commented 2 years ago

I'll try to add the "planned" interface tomorrow, but I need a bit of help there, because I do not know how to make sure the plan's destructor is called when appropriate.

I added a stub where the destructor needs to be called:

https://github.com/JuliaMath/NFFT.jl/blob/ducc0/Wrappers/DUCC0.jl#L62

Of course one needs some pointer to the plan that is created before and stored in the DUCC0Plan. See https://github.com/NFFT/NFFT3.jl/blob/main/src/NFFT.jl#L64 for an example.

mreineck commented 2 years ago

I have merged your changes to my branch and added the changes to switch to the planned interface. Things seem to work, including the destructor calls.

Concerning accuracy, my observation is that it is impossible to reach an accuracy below roughly 4e-13 with double precision, no matter which kernel is used. If you fix the oversampling factor to roughly 2, this limit will most likely be even higher.

One thing that surprises me is the weird timing results reported by ducc. When switching verbose output on, a typical output during runtests.jl is

U2nu:
  nthreads=1, grid=(14x12x10), oversampled grid=(28x24x20), supp=11, eps=1e-09
  npoints=1680
  memory overhead: 6.25849e-06GB (index) + 0.000200272GB (oversampled grid)

Total wall clock time for u2nu: 0.2934s
|
+- nu2u proper          :  0.38% (0.0011s)
|  |
|  +- spreading            : 89.07% (0.0010s)
|  +- FFT                  :  7.94% (0.0001s)
|  +- zeroing grid         :  1.71% (0.0000s)
|  +- grid correction      :  0.59% (0.0000s)
|  +- allocating grid      :  0.30% (0.0000s)
|  +- <unaccounted>        :  0.38% (0.0000s)
|  
+- u2nu proper          :  0.32% (0.0009s)
|  |
|  +- interpolation        : 87.89% (0.0008s)
|  +- FFT                  :  8.59% (0.0001s)
|  +- zeroing grid         :  1.90% (0.0000s)
|  +- grid correction      :  0.81% (0.0000s)
|  +- allocating grid      :  0.36% (0.0000s)
|  +- <unaccounted>        :  0.46% (0.0000s)
|  
+- building index       :  0.02% (0.0001s)
+- parameter calculation:  0.00% (0.0000s)
+- correction factors   :  0.00% (0.0000s)
+- sorting coords       :  0.00% (0.0000s)
+- <unaccounted>        : 99.28% (0.2913s)

The timings for the individual sections seem realistic, but when called from C++ or Python, I never see the last unaccounted line go above one percent! So it seems that the C++ code is interrupted at some point and the CPU is doing something completely different for over 99 percent of the time. Could be garbage collection, for example. Is this a known issue when calling from Julia into compiled code?

mreineck commented 2 years ago

Sorry, please ignore the comments about timing for the moment! I think I know what might be going wrong...

tknopp commented 2 years ago

Is this a known issue when calling from Julia into compiled code?

ccall has zero overhead and GC will not run during a call.

tknopp commented 2 years ago

I rebase you changes into my branch since I made some follow up commits.

tknopp commented 2 years ago

I kept the nodes in the plan for the moment. We already have a nodes! setter and I likely will add a nodes getter. And unless I can get the nodes via a ccall it is better to store the array, which is actually just a pointer (don't know if you copy the nodes internally). m and σ need to be kept for the moment. In the future we might generalize this, but right now this is the easiest so unify different NFFT libraries (which is the purpose of AbstractNFFTs).

mreineck commented 2 years ago

I kept the nodes in the plan for the moment. We already have a nodes! setter and I likely will add a nodes getter.

OK, then we need to make sure that the internal plan is destroyed and rebuilt whenever the nodes are set.

And unless I can get the nodes via a ccall it is better to store the array, which is actually just a pointer (don't know if you copy the nodes internally).

I'm storing a genuine copy of the nodes, but it is already re-arranged in the order in which it will be needed during the transforms (this gives a nice speedup); so I cannot really pass it back via a ccall.

m and σ need to be kept for the moment. In the future we might generalize this, but right now this is the easiest so unify different NFFT libraries (which is the purpose of AbstractNFFTs).

Fine!

One thing I noticed: your types for f and fHat are AbstractArray and StridedArray, whereas the C code currently expects something that is stored in memory like a DenseArray. I can extend the code to deal with StridedArray objects, but passing a pointer to an AbstractArray to the C code sounds very dangerous to me. We may have to constrain the types a bit more.

tknopp commented 2 years ago

One thing I noticed: your types for f and fHat are AbstractArray and StridedArray, whereas the C code currently expects something that is stored in memory like a DenseArray. I can extend the code to deal with StridedArray objects, but passing a pointer to an AbstractArray to the C code sounds very dangerous to me. We may have to constrain the types a bit more.

yes indeed. will change before merging.

tknopp commented 2 years ago

Have merged my branch. Thanks for the contribution!

mreineck commented 2 years ago

Here is a plot of a local benchmark run: performanceVsAccuracy

This is on x86_64.

tknopp commented 2 years ago

Which CPU do you have?

On M1Max it looks like this:

performanceVsAccuracy

On the reference machine (AMD EPYC 7702 @ 2.0 GHz) its somewhere in-between.

mreineck commented 2 years ago

Which CPU do you have?

This is a laptop with a Ryzen 7 4800H CPU.

The ducc curves in your plot almost look like there's some constant overhead, but I hve no idea where it could come from. You mentioned at some point that there might be CPU emulation going on; could this be the reason?

mreineck commented 2 years ago

It might be interesting to look at the verbose output from one of your 1D runs, just to see where the time is spent.