flatironinstitute / cufinufft

Nonuniform fast Fourier transforms of types 1 and 2, in 1D, 2D, and 3D, on the GPU
Other
83 stars 18 forks source link

Idea for SM in double-precision 3D #90

Open janden opened 3 years ago

janden commented 3 years ago

So if I've understood correctly, the reason we can't perform SM in 3D with doubles (at high precision) is that the 3D box won't fit. In other words, for 12 digits, we get w = 12 + 1, which means that after padding, a 1×1×1 bin becomes 14×14×14 and takes up 16 * 14 ** 3 = 43904 bytes, so barely fits in shared memory since a complex double is 16 bytes.

What if we ditch interleaved complex numbers (I'm fairly sure cuFFT allows you to output real and imaginary separately)? We'd only need 8 bytes per voxel, which means we can fit a padded bin of size 19×19×17 (49096 bytes) which corresponds to 6×6×4 unpadded. Is this still too small or could it be workable? The other question is of course how much we could expect to improve over GM-sort given the magnitude of the changes needed here.

ahbarnett commented 3 years ago

Good future idea!

Whether it's workable depends on the "efficiency" (eff), which I defined in a little matlab code as comments in the main.tex Take a look near the Remark. And the ratio of SM to GM atomic writes. Would require testing. I suggest we let Melody work on other stuff for a bit. It is the 12-digit folks who have the incentive here.

Of course another approach (which Melody may have tried) is not to pad, rather to have SM handle larger unpadded boxes (closer to pure output-driven, yet still using Msub per subproblem, and using GM atomic add out), but each box has to look through all neighboring NU pts too. Then the kernels get eval'ed more than once for each NU pt, but parallelism may win in the end...

On Fri, Jan 29, 2021 at 3:47 PM Joakim Andén notifications@github.com wrote:

So if I've understood correctly, the reason we can't perform SM in 3D with doubles (at high precision) is that the 3D box won't fit. In other words, for 12 digits, we get w = 12 + 1, which means that after padding, a 1×1×1 bin becomes 14×14×14 and takes up 16 * 14 ** 3 = 43904 bytes, so barely fits in shared memory since a complex double is 16 bytes.

What if we ditch interleaved complex numbers (I'm fairly sure cuFFT allows you to output real and imaginary separately)? We'd only need 8 bytes per voxel, which means we can fit a padded bin of size 19×19×17 (49096 bytes) which corresponds to 6×6×4 unpadded. Is this still too small or could it be workable? The other question is of course how much we could expect to improve over GM-sort given the magnitude of the changes needed here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/cufinufft/issues/90, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSS24JQWSWMYTFLPI5TS4MNEPANCNFSM4WZQHZJA .

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

janden commented 3 years ago

Ok got it. So according to that calculation, with m = [6, 6, 4] and w = 13, we get an efficiency of 0.0234 – not great. The unpadded approach might be more workable, but I'm not sure I understand completely. So points will be “included” in a box if a kernel placed there overlaps with that box? So effectively we have multiple boxes per point. I'm not too concerned with the additional kernel evaluations, but we'll have more GM writes (since we have more boxes), which might cost us.

At any rate, it's worth looking into at some later point. (Also once we see if there are more potential 1e-12 users.)