AdriJD / beamconv

Code library to simulate CMB detector data with realistic optics-related contributions
MIT License
10 stars 10 forks source link

Question on cost of spherical harmonic transforms #7

Closed mreineck closed 4 years ago

mreineck commented 4 years ago

Hi,

I'm currently working on a code very similar to beamconv for the LiteBIRD simulation pipeline and would like to thank you for the insights I got from studying both your paper and the beamconv sources!

There is one single aspect I don't really understand though, and that's the running time you are measuring for the spherical harmonic transforms. In the paper you report CPU times for the SHTs that reach over 1000 seconds for lmax approx. 3000, Nside=2048 and smax = 8. On my laptop, with a a single core of a Core i5-5300U CPU @ 2.30GHz, I see roughly 6 seconds for a spin-0 SHT with these parameters, and around 21 seconds for spin != 0. So in total I would expect a CPU time betwee 100 and 200 seconds, instead of more than 1000.

Is there some peculiarity of the setup I'm missing? I'm assuming that a single SHT for every spin in [0; smax] is required.

AdriJD commented 4 years ago

That's an interesting question!

The number of SWSH transforms that the code does is a bit higher than what you assumed. For the total intensity beam and sky you indeed perform smax + 1 transforms. One s=0 transform and smax spin-weighted transforms. However, for the linearly polarized beams and sky the code performs 2 * smax + 1 transforms: [-smax, .., 0, ..., smax]. I think the reason for this is as follows:

For the linearly polarized case, we need to transform the following spin-weighed coefficients: s_f_lm, which are expressed in terms of the harmonic coefficients of the polarized sky (+2_a_lm) and those of the polarized beam(-2_b_ls) as s_f_lm = +2_a_lm -2_b_ls. See eq. 10 in the paper. Normally we assume that the coefficients of spin-weighted fields obey the "reality condition" (s_f_lm)^ = -s_f_l-m (-1)^(s+m). If this is satisfied we can use this symmetry relation to compute the negative spin transforms for free using the s >=0 transforms (also what healpix assumes). However, for our version of s_f_lm the reality condition is not satisfied: (s_f_flm)^ = -2_a_l-m +2_b_l-s (-1)^(m+s) != -s_f_l-m (-1)^(s+m). So we need to compute the negative spin modes as well. This is at least what I convinced myself of when I was coding this up :) Hopes this makes sense, I can try to clarify it otherwise.

So in the end, the code computes 3 * smax + 2 transforms. So that brings us closer to the values in the plot, but not quite there. I don't really understand the remaining discrepancy.

I redid some of these benchmarks on a single core of my current laptop (i5-8279U, 2.40GHz). I ran the init_spinmaps function forsmax = 5, lmax=1500 and nside=1024. It takes 45 sec. A single call (with OMP_NUM_TREADS=1) to hp.alm2map_spin with those parameters takes 2.4 seconds on my machine. 2.4 * (3 * 5 + 2) = 40.8 is relatively close to the init_spinmaps call. I don't have access to the cluster I ran these benchmarks on for the paper. Perhaps something went wrong there, I'll try to look into that. Thanks for pointing this out.

mreineck commented 4 years ago

Thanks for looking into this! The additional SHTs for the polarization components explain a factor of almost 2, and I also realized that there were two other factors that may have resulted in slower transforms when you ran the original benchmarks:

Concerning the additional SHTs for the linearly polarized beams: it is true that you don't have any convenient symmetry between m and -m when using spin spherical harmonics. However things may be easier when combining spin s and -s coefficients to "E" and "B" coefficients (or perhaps more appropriately named, "Grad" and "Curl" coefficients), as described in section 2.4 of https://arxiv.org/pdf/1010.2084, for example. At least for the Planck mission, it was sufficient to build the full convolution data cube with 3*smax SWSHTs and 3 spin-0 SHTs per detector. I'm not absolutely sure yet if the same holds when there as a HWP, but I think it should still work; it's just no longer possible to add T, E, and B contributions together to save space.

AdriJD commented 4 years ago

Thanks! I had not appreciated those recent improvements to libsharp.

I remember concluding that the approach using E and B would be equivalent to the spin-\pm2 approach. However, it is possible that you're right and that it is more efficient to do it that way. But if you say that you were using 3 * smax SWSHs and 3 spin-0 SHTs per detector it seems to me that we are computing the same amount of transforms. I think your intuition about the HWP is correct, I can only add the contributions together after applying the HWP modulation.

mreineck commented 4 years ago

But if you say that you were using 3 * smax SWSHs and 3 spin-0 SHTs per detector it seems to me that we are computing the same amount of transforms.

Ahhhh ... yes of course, you are right! I'm doing E and B, separately, and you are getting them both in one set of transforms, which, however, is twice as large. Sorry I'm realizing this only now, spin spherical harmonics can do that to your brain :P

Just in case you are interested, the current state of my code is available at https://gitlab.mpcdf.mpg.de/mtr/ducc (or alternatively on Github: https://github.com/litebird/ducc); the relevant subpackage is ducc0.totalconvolve. There are a few simple demos but no real application yet. As you already noted in the paper, polynomial interpolation in (theta, phi) does not really do a good job in improving accuracy, so I'm now using an adaptation of the gridding/degridding algorithms from radio interferometry and non-equidistant FFTs. This is working surprisingly well and can produce the analytical result within 1e-12 error at reasonable cost. However, your approach of just looking up the closest Healpix pixel is much, much faster and will be accurate enough for the largest part of applications, I expect.