Draft/request for discussion: optional ducc0-based backend for SHTs

mreineck commented 3 years ago

This is the result of a small exercise I made to verify the flexibility of my new Python SHT interface in the ducc0 package.

The pull request is analogous to https://github.com/SHTOOLS/SHTOOLS/pull/298 (prompted by https://github.com/SHTOOLS/SHTOOLS/issues/258) and adds an optional back-end for carrying out SHTs and flm rotations, which improves performance in most situations. If this is functionality you'd like to see in pyssht, please let me know, otherwise feel free to close the PR.

Currently implemented:

pyssht.forward and pyssht.inverse for all pixelizations except MW_pole. The adjoint transforms and MW_pole support can be added easily if you like the PR.
pyssht.rotate_flms
a small accuracy and performance test program (src/pyssht/ducctest.py)

Beneficial features of the ducc0 algorithms:

higher performance in most situations
comparable accuracy for MW/MWSS transforms; better accuracy for DH/GL transforms
unlimited L and much lower memory overhead for rotate_flms.
optional multithreading support

The code can be tested by installing this branch, doing a pip install ducc0 and finally python ducctest.py.

Compared to my earlier PR for intergrating libsharp support (https://github.com/astro-informatics/ssht-archive/pull/5) this PR is much less invasive and has no impact at all if the ducc0 package is not present on the user machine.

Please let me know what you think!

mreineck commented 3 years ago

Demo output of ducctest.py on a single core of an AMD Ryzen 7 4800H CPU.

martin@marvin:~/codes/ssht$ python3 src/pyssht/ducctest.py 
flm rotation tests:
L=  32:  L2 error=7.787688e-15, speedup factor=2.350531
L=  64:  L2 error=1.828852e-14, speedup factor=5.006331
L= 128:  L2 error=4.888588e-14, speedup factor=9.991688
L= 256:  L2 error=1.428483e-13, speedup factor=17.211348
L= 512:  L2 error=4.099982e-13, speedup factor=26.752970
L=1024:  L2 error=1.119593e-12, speedup factor=33.509578
SHT tests:
MW  , L=  32, Reality=False, Spin=0:  L2 error=1.069220e-14, speedup factor=45.675511
MW  , L=  32, Reality=False, Spin=1:  L2 error=1.131743e-14, speedup factor=0.972574
MW  , L=  32, Reality=True , Spin=0:  L2 error=1.105658e-14, speedup factor=86.940989
MW  , L=  64, Reality=False, Spin=0:  L2 error=1.764786e-14, speedup factor=11.038193
MW  , L=  64, Reality=False, Spin=1:  L2 error=1.926463e-14, speedup factor=1.380736
MW  , L=  64, Reality=True , Spin=0:  L2 error=1.639798e-14, speedup factor=59.431823
MW  , L= 128, Reality=False, Spin=0:  L2 error=5.021204e-14, speedup factor=61.757347
MW  , L= 128, Reality=False, Spin=1:  L2 error=6.262586e-14, speedup factor=4.728689
MW  , L= 128, Reality=True , Spin=0:  L2 error=4.317444e-14, speedup factor=14.079249
MW  , L= 256, Reality=False, Spin=0:  L2 error=1.011778e-13, speedup factor=18.465855
MW  , L= 256, Reality=False, Spin=1:  L2 error=1.236980e-13, speedup factor=5.040028
MW  , L= 256, Reality=True , Spin=0:  L2 error=1.058034e-13, speedup factor=19.899911
MW  , L= 512, Reality=False, Spin=0:  L2 error=3.113635e-13, speedup factor=18.181369
MW  , L= 512, Reality=False, Spin=1:  L2 error=2.290599e-13, speedup factor=11.047243
MW  , L= 512, Reality=True , Spin=0:  L2 error=2.281300e-13, speedup factor=40.174040
MW  , L=1024, Reality=False, Spin=0:  L2 error=5.232240e-13, speedup factor=21.449102
MW  , L=1024, Reality=False, Spin=1:  L2 error=5.493335e-13, speedup factor=14.823629
MW  , L=1024, Reality=True , Spin=0:  L2 error=4.972997e-13, speedup factor=40.481276
MW  , L=2048, Reality=False, Spin=0:  L2 error=3.060674e-12, speedup factor=87.109723
MW  , L=2048, Reality=False, Spin=1:  L2 error=1.876659e-12, speedup factor=36.758097
MW  , L=2048, Reality=True , Spin=0:  L2 error=2.114590e-12, speedup factor=47.269750
MWSS, L=  32, Reality=False, Spin=0:  L2 error=7.584225e-15, speedup factor=52.670301
MWSS, L=  32, Reality=False, Spin=1:  L2 error=8.188209e-15, speedup factor=1.330590
MWSS, L=  32, Reality=True , Spin=0:  L2 error=7.136339e-15, speedup factor=45.936444
MWSS, L=  64, Reality=False, Spin=0:  L2 error=1.826959e-14, speedup factor=58.559479
MWSS, L=  64, Reality=False, Spin=1:  L2 error=2.488170e-14, speedup factor=3.621226
MWSS, L=  64, Reality=True , Spin=0:  L2 error=1.595614e-14, speedup factor=100.420469
MWSS, L= 128, Reality=False, Spin=0:  L2 error=3.753735e-14, speedup factor=8.173109
MWSS, L= 128, Reality=False, Spin=1:  L2 error=3.884384e-14, speedup factor=6.621827
MWSS, L= 128, Reality=True , Spin=0:  L2 error=3.491795e-14, speedup factor=55.736038
MWSS, L= 256, Reality=False, Spin=0:  L2 error=7.448911e-14, speedup factor=16.896459
MWSS, L= 256, Reality=False, Spin=1:  L2 error=9.002071e-14, speedup factor=9.921884
MWSS, L= 256, Reality=True , Spin=0:  L2 error=7.279597e-14, speedup factor=37.560571
MWSS, L= 512, Reality=False, Spin=0:  L2 error=1.774534e-13, speedup factor=23.261393
MWSS, L= 512, Reality=False, Spin=1:  L2 error=2.403929e-13, speedup factor=17.595352
MWSS, L= 512, Reality=True , Spin=0:  L2 error=1.829328e-13, speedup factor=40.459189
MWSS, L=1024, Reality=False, Spin=0:  L2 error=5.854770e-13, speedup factor=36.144677
MWSS, L=1024, Reality=False, Spin=1:  L2 error=4.506434e-13, speedup factor=24.387793
MWSS, L=1024, Reality=True , Spin=0:  L2 error=4.687843e-13, speedup factor=57.871766
MWSS, L=2048, Reality=False, Spin=0:  L2 error=2.298029e-12, speedup factor=71.017774
MWSS, L=2048, Reality=False, Spin=1:  L2 error=1.127773e-12, speedup factor=79.374561
MWSS, L=2048, Reality=True , Spin=0:  L2 error=1.792836e-12, speedup factor=224.463276
GL  , L=  32, Reality=False, Spin=0:  L2 error=1.840703e-11, speedup factor=29.883576
GL  , L=  32, Reality=False, Spin=1:  L2 error=1.883529e-11, speedup factor=1.666013
GL  , L=  32, Reality=True , Spin=0:  L2 error=1.862580e-11, speedup factor=47.404505
GL  , L=  64, Reality=False, Spin=0:  L2 error=2.006364e-11, speedup factor=25.033709
GL  , L=  64, Reality=False, Spin=1:  L2 error=2.003057e-11, speedup factor=1.718936
GL  , L=  64, Reality=True , Spin=0:  L2 error=2.057525e-11, speedup factor=2.316247
GL  , L= 128, Reality=False, Spin=0:  L2 error=4.496383e-11, speedup factor=12.278235
GL  , L= 128, Reality=False, Spin=1:  L2 error=4.434178e-11, speedup factor=9.359636
GL  , L= 128, Reality=True , Spin=0:  L2 error=4.500517e-11, speedup factor=20.092019
GL  , L= 256, Reality=False, Spin=0:  L2 error=9.362785e-11, speedup factor=21.250935
GL  , L= 256, Reality=False, Spin=1:  L2 error=9.367738e-11, speedup factor=10.492432
GL  , L= 256, Reality=True , Spin=0:  L2 error=9.360139e-11, speedup factor=14.862463
GL  , L= 512, Reality=False, Spin=0:  L2 error=1.449155e-10, speedup factor=29.151748
GL  , L= 512, Reality=False, Spin=1:  L2 error=1.447672e-10, speedup factor=20.533196
GL  , L= 512, Reality=True , Spin=0:  L2 error=1.450029e-10, speedup factor=35.325782
GL  , L=1024, Reality=False, Spin=0:  L2 error=1.882280e-10, speedup factor=38.456142
GL  , L=1024, Reality=False, Spin=1:  L2 error=1.881660e-10, speedup factor=26.067922
GL  , L=1024, Reality=True , Spin=0:  L2 error=1.881754e-10, speedup factor=42.859991
GL  , L=2048, Reality=False, Spin=0:  L2 error=1.312866e-02, speedup factor=51.301227
GL  , L=2048, Reality=False, Spin=1:  L2 error=1.307702e-02, speedup factor=32.884071
GL  , L=2048, Reality=True , Spin=0:  L2 error=1.296155e-02, speedup factor=58.415200
DH  , L=  32, Reality=False, Spin=0:  L2 error=1.761299e-11, speedup factor=2.191109
DH  , L=  32, Reality=False, Spin=1:  L2 error=1.834118e-11, speedup factor=2.579484
DH  , L=  32, Reality=True , Spin=0:  L2 error=1.776948e-11, speedup factor=2.405983
DH  , L=  64, Reality=False, Spin=0:  L2 error=2.001454e-11, speedup factor=1.932899
DH  , L=  64, Reality=False, Spin=1:  L2 error=2.011957e-11, speedup factor=1.856050
DH  , L=  64, Reality=True , Spin=0:  L2 error=1.992613e-11, speedup factor=2.354581
DH  , L= 128, Reality=False, Spin=0:  L2 error=4.485695e-11, speedup factor=10.512614
DH  , L= 128, Reality=False, Spin=1:  L2 error=4.478997e-11, speedup factor=10.532030
DH  , L= 128, Reality=True , Spin=0:  L2 error=4.466421e-11, speedup factor=15.247348
DH  , L= 256, Reality=False, Spin=0:  L2 error=9.337148e-11, speedup factor=12.130759
DH  , L= 256, Reality=False, Spin=1:  L2 error=9.362291e-11, speedup factor=10.544331
DH  , L= 256, Reality=True , Spin=0:  L2 error=9.357068e-11, speedup factor=16.058156
DH  , L= 512, Reality=False, Spin=0:  L2 error=1.449665e-10, speedup factor=26.412729
DH  , L= 512, Reality=False, Spin=1:  L2 error=1.447460e-10, speedup factor=20.015510
DH  , L= 512, Reality=True , Spin=0:  L2 error=1.447082e-10, speedup factor=30.894329
DH  , L=1024, Reality=False, Spin=0:  L2 error=1.882401e-10, speedup factor=39.573818
DH  , L=1024, Reality=False, Spin=1:  L2 error=1.881496e-10, speedup factor=27.040965
DH  , L=1024, Reality=True , Spin=0:  L2 error=1.882652e-10, speedup factor=46.202585
DH  , L=2048, Reality=False, Spin=0:  L2 error=1.318579e-02, speedup factor=55.183716
DH  , L=2048, Reality=False, Spin=1:  L2 error=1.276778e-02, speedup factor=34.222753
DH  , L=2048, Reality=True , Spin=0:  L2 error=1.289799e-02, speedup factor=61.578388

jasonmcewen commented 3 years ago

Thanks @mreineck, very interesting. We'll take a look at this in more detail and follow up with any questoins shortly.

@CosmoMatt Can you take a look into this?

mdavezac commented 3 years ago

@mreineck, I've made a few changes to make the PR a bit more inline with the rest of the code. Mostly, it comes down to:

split the ducc0 stuff into a file.pyx of its own
add tests to ensure the interface works in the future
run the tests in the CI
get rid of the globals and the select_backend functions in favor of an extra keyword. I have a personal aversions to globals and state 🤕 😀
some words about ducc in the docs

Do let me know if I've bungled anything.

mreineck commented 3 years ago

Hi @mdavezac, I'm perfectly happy with all the changes, perhaps with a small question mark for

get rid of the globals and the select_backend functions in favor of an extra keyword. I have a personal aversions to globals and state

I totally understand (and share!) your aversion against global state in a library. But on the other hand requiring the explicit setting of a key word in every transform call puts quite a heavy burden on the users (especially those updating an already existing code) , so in that particular case I would personally have erred on the side of convenience :-) Still it's of course entirely your call.

Would you be interested in the adjoint transforms as well? I should be able to produce at least the adjoint synthesis fairly quickly; adjoint analysis might take a bit longer.

jasonmcewen commented 3 years ago

Thanks for your comments @mreineck. I'll leave it to @mdavezac to comment on the issues of globals vs convenience.

On the adjoints side, it would indeed be very useful to support that also. Great if you're able to add that!

mreineck commented 3 years ago

I added inverse_adjoint, which should be the one that is used much more frequently. I also tweaked the documentation a little bit, so that users don't accidentally precompute the dl_array when rotating spherical harmonics with ducc; that would be quite a waste :-)

One remaining question: (how) do we advertise the nthreads parameter? It could be added to the documentation of the respective functions, but I don't know your preferences. Overall I would recommend to mention this parameter, since the scaling is quite good, even though it's not really linear.

Sorry if I messed up the code formatting again; if you have an auto-formatting command and let me know, I'm happy to use it.

mdavezac commented 3 years ago

We are using black for formatting. Otherwise, sounds good, I'll take a look ASAP.

mreineck commented 3 years ago

Not sure if the CI failure is related to my changes ... at first glance it seems to be caused by added checks in test_pyssht.py but not actually originate in the new backend.

mdavezac commented 3 years ago

@mreineck, I've added something in the docs about nthreads and cleaned up the code some more. I think the failing test was just some numerical noise flakiness. As for globals, I'd rather inconvenience users than risk annoying mistakes because of internal state in ssht 😉. But I do get your point!

I'll ask for any other input from the team, but otherwise I think we are good to go.

mreineck commented 3 years ago

Perfect, thanks a lot for the polishing work! Everything's fine from my side.

jasonmcewen commented 3 years ago

@mreineck @mdavezac Thanks both for your efforts on this!

I just wanted to check what the situation is re testing.

Do we have unit tests that compare the ducc0 harmonic coefficients with those computed by ssht? Or are we just comparing consistency of each approach?

Similarly for the inverse_adjoint, how is this tested?

I haven't had a chance to look at the new code so apologies for these simple questions but thought it would be quickest to just ask you guys the situation.

mreineck commented 3 years ago

If I read the tests correctly, they directly compare the results of transforms done via ssht and ducc0 (including inverse_adjoint), so that should be safe.

mdavezac commented 3 years ago

@jasonmcewen, yes we are testing ducc0 vs ssht results for the forward, inverse, and adjoint_inverse transforms as well as the rotations in harmonic space. Comparisons are to a relative tolerance of 1e-6 and an absolute tolerance of 1e-12. We check MW, MWSS, DH, GL, spins 0,1,2, real and complex variations.

jasonmcewen commented 3 years ago

@mdavezac @mreineck Great! Shall we merge?

mreineck commented 3 years ago

Ready when you are!

mreineck commented 3 years ago

One thing that might be useful to mention is that the ducc SHTs are accurate way beyond the limits imposed by some of the ssht transforms (L \approx 1800 for GL and DH, I think), but I didn't find this in he documentation.

mdavezac commented 3 years ago

@mreineck, Thanks for everything. We've made a new release. It should come up on pypi soon.

For completeness, would it be difficult to add the forward adjoint as well?

mreineck commented 3 years ago

For completeness, would it be difficult to add the forward adjoint as well?

Not really difficult, but some decent amount of work, I expect. For "classic" grids this would just be "run a synthesis and then apply the quadrature weights", but for MW and MWSS it is a bit trickier: I'd have to "transpose" my complete analysis machinery, which for these kinds of grid contains a lot of tweaks, shortcuts etc. If this has a practical use somewhere I can try and implement this, but so far I have not encountered one.

jasonmcewen commented 3 years ago

Thanks for your comments @mreineck.

There are practical applications of the forward adjoints since we're using them in solving convex optimization problems on the sphere. For example, see this paper and this paper. We're applying these techniques for a variety of applications, for example for mass mapping (see here) and also some upcoming works on seismic imaging. I'm very interested to see if we can use ducc0 for these problems since they become quite computationally demanding (particularly when going to the ball).

But I appreciate there is quite a lot of work involved to support this. Perhaps we can add it to the wish list for someday? And when these other projects progress further perhaps we could also get a student or postdoc involved to help with the implementation?

mreineck commented 3 years ago

Thanks for the pointers, @jasonmcewen! So far I have only ever seen optimization schemes using the inverse/adjoint_inverse pair. I'll have a go at the implementation over the next few weeks and keep you updated!

jasonmcewen commented 3 years ago

That would be amazing if you were able to look into these implementations at some point over the coming weeks @mreineck! Let us know if we can help at all.

And thanks for all your efforts updating ssht to support ducc0 here too. This will make it very easy for us to pick up ducc0 now for various applications. We hope to give this a try soon and will let you know how it goes.

mreineck commented 3 years ago

OK, the easy part of this seems to be really easy: I think I have a properly working code for Gauss-Legendre grids and Driscoll-Healy grids (unfortunately "my" DH grids seems to be different from "yours"; what ssht defines as "DH" is "Fejer's second rule" in ducc terminology).

So I should be able to provide a prototype very quickly, but ssht will only be able to use it for Gauss-Legendre grids initially. Do you see any benefit in having this particular functionality early, e.g. for testing purposes?

And thanks for all your efforts updating ssht to support ducc0 here too. This will make it very easy for us to pick up ducc0 now for various applications. We hope to give this a try soon and will let you know how it goes.

I'm very much looking forward to hear your experiences with this! In principle, if you work at high band limits and make use of multithreading, you should see speedups way beyond a factor 100.

jasonmcewen commented 3 years ago

Great you've managed to start taking a look at the forward adjoint already! Tbh, the DH and GL cases were just included initially for some simple comparisions but we've never really used those. I don't think ssht even implements adjoints for those cases. So I don't think we're really be able to do much testing there. The cases of most interest are the MW and MWSS sampling.

mreineck commented 3 years ago

I think I discovered an interesting problem ...

While the inverse and inverse_adjoint operators are pretty unambiguous and demonstrably do the same thing in ssht and ducc0, the same is not generally true for forward and forward_adjoint. As a concrete example, it is not really specified how the forward operator sould behave when it encounters a map that does not have constant values at the polar rings (assuming the spin is zero). And I have the impression that our implementations differ in that respect. So far the unit tests don't show this because the forward operator only encounters sane maps there, i.e. ones with a uniform value in the polar rings.

Any suggestions how to break this degeneracy?

mreineck commented 3 years ago

One option might be to design forward_adjoint in such a way that it only ever generates uniform values on polar rings (for spin=0, higher moments are OK for higher spins), and then derive the exact structure of forward from forward_adjoint. As far as I can tell, the ssht incarnation of forward_adjoint currently doesn't have this property.

jasonmcewen commented 3 years ago

Thanks @mreineck. I'm not sure I completely follow. For a spin=0 signal the ring of pixels at the pole(s) should contain identical values. It's only for spin<>0 signals that the values on the polar ring(s) would differ (due to the spin rotation factor). So I'm not sure it's meaningful to consider maps that for have different values on the polar ring(s). It seems like that might be what you're suggesting in your latest message?

mreineck commented 3 years ago

For a spin=0 signal the ring of pixels at the pole(s) should contain identical values.

If the map is band limited, that will indeed be the case. But even if you run ssht's forward_adjointon a random set of spherical harmonic coefficients, you normally end up with some variation in these pixels. For an example, try:

import pyssht as ssht
import numpy as np

def complex_coeffs(order, rng):
    coeffs = rng.random((order * order, 2), dtype="float64")
    return coeffs @ np.array([1, 1j])

def real_coeffs(order, complex_coeffs):
    pos_index = [ssht.elm2ind(el, m) for el in range(order) for m in range(1, el + 1)]
    neg_index = [ssht.elm2ind(el, -m) for el in range(order) for m in range(1, el + 1)]
    complex_coeffs[neg_index] = np.conjugate(complex_coeffs[pos_index])
    return complex_coeffs

L=5
coeffs=real_coeffs(L,complex_coeffs(L, np.random.default_rng()))
image = ssht.forward_adjoint(coeffs, L, Reality=True, Method="MWSS", Spin=0)

import matplotlib.pyplot as plt
plt.imshow(image)
plt.show()

While the variation in the polar rings is smaller than elsewhere, it is far from being numerical noise. ducc0's forward_adjoint operator produces a similar, but not identical image which also has variations on the polar rings.

What I'm trying to say is: if the forward operator is only defined by the requirement that it should accurately recover the input spherical harmonic coefficients from band limited maps, then this definition is not unique. And if ssht and ducc0 choose two different forward operators (which both fulfill the requirement), then the adjoint counterparts will also be different, and running both on the same set of flm will produce different results.

So, what should we compare?

jasonmcewen commented 3 years ago

Thanks for the clarification @mreineck. Ok, I'm with you now.

Perhaps we can therefore avoid comparing ssht and ducc0 and therefore just check each is self-consistent with the dot product test?

mreineck commented 3 years ago

Perhaps we can therefore avoid comparing ssht and ducc0 and therefore just check each is self-consistent with the dot product test?

Absolutely! This is what I currently do internally, and I'll also enhance the ducc0 tests in ssht accordingly. Unfortunately even with the dot product test I'm not successful when testing forward and forward_adjoint directly against each other; and the same seems to be true for ssht, since you take a detour and actually test inverse against forward_adjoint: https://github.com/astro-informatics/ssht/blob/126e26ee026c7f6871828a3d2cd6d5836626f446/tests/test_pyssht.py#L87-L96. My linear algebra is too rusty to understand why this is necesary ... the inverse/inverse_adjoint pair doesn't require this sort of complication: https://github.com/astro-informatics/ssht/blob/126e26ee026c7f6871828a3d2cd6d5836626f446/tests/test_pyssht.py#L73-L82 Here it is possible to draw a completely random map (which is certainly not band limited and won't have uniform values at the poles), and the test still works fine.

jasonmcewen commented 3 years ago

All of these operations should really consider bandlimited signals. So it makes sense that the forward adjoint test makes a bandlimited test signal f for test purposes. You could take this f and use ssht.forward to compute an equivalent flm but that will be the same as the one at hand and so is not necessary.

As you point out, a bandlimited test signal is not constructed for f_prime for the inverse adjoint test. If that were done, the test would obviously still pass. I think the reason the invese adjoint test still works in this setting is because the inverse adjoint annihilates the non-bandlimited component. Consider an inverse transform that can operate with signals at a higher bandlimit that considered simply by ignoring the higher ell terms. Then the adjoint will simply fill high ell terms with zeros, effectively annihilating any signal content above the bandlimit of interest. Effectively this follows since application of the inverse transform (defined at L) to a non-bandlimited signal will bandlimit the signal. That is not the case for the forward transform. I'd need to give this some further thought to specify this more precisely but I think that is essentially where the difference is coming from.

mreineck commented 3 years ago

I still think that (in principle!) the dot product test should work for any input on both sides, and that's simply because both forward and forward_adjoint can (again very much in principle ;-) be written as matrices ... and if they are adjoint to each other, the test will succeed, no matter what the input is (ignoring numerical inaccuracies for simplicity's sake).

Perhaps the crucial point is that the forward operation for MW and MWSS does not in fact act on a "normal" spherical map but rather on its extension to the Double Fourier Sphere,and that we need to take this into account in some way. A similar problem happened to me when I first implemented the dot product test for the inverse/inverse_adjoint pair: since ducc0 only stores flm with nonnegative m when working on real-valued maps, I never managed to get the test to work until I took the redundant coefficients with negative m into account.

Anyway ... this is something I will investigate on the side over the next months or so. I'll let you know when I have new insights!

jasonmcewen commented 3 years ago

Hmm, yes that is a good point. I will need to give this some further thought.

astro-informatics / ssht

Draft/request for discussion: optional ducc0-based backend for SHTs #55