libsharp discontinued, alternatives for MPI smoothing?

zonca commented 3 years ago

PySM relies on libsharp to smooth maps over MPI.

Libsharp is not maintained anymore: https://gitlab.mpcdf.mpg.de/mtr/libsharp

Is anyone using PySM over MPI? Should we simplify the code and remove MPI support? or find an alternative for libsharp (@mreineck suggestions?) ?

@NicolettaK @giuspugl @bthorne93

mreineck commented 3 years ago

If you are currently using libsharp and it is doing what you need, I don't think there is a reason to worry; if any bugs are found in the code, I'm still happy to fix them. My main reason to archive the repo was that I don't want people to start new projects using this code. That said, if you have a genuine need for SHTs with MPI, I'd be very interested to hear about the use case. My impression currently is that single compute nodes are strong enough to compute SHTs of any "typical" size very quickly, and that using MPI within a single node is a waste of resources (at least for SHTs). I might of course be wrong, and if so, I'd like to hear about it :-)

zonca commented 3 years ago

We are doing foreground modelling with PySM at N_side 8192, each map in memory in double precision just one component is 6.5 GB, the more demanding model has 6 layers, each with 1 IQU map and 2 auxiliary, so it's about ~200 GB just the inputs. That is a single model, in a complete simulation we would have 4 galactic models, 3 extragalactic, 2 or 3 CMB components.

It's not doable yet on standard compute nodes. And we might need N_side 16382.

mreineck commented 3 years ago

Thanks for the data, that is very helpful! The approach I would suggest (for optimizing SHT performance) in this situation is to

distribute the components over the compute nodes in a round-robin fashion
let every node do the necessary SHT steps (analysis+smothing+synthesis) assigned to it, using multithreading with all threads on the node
communicate the results to the appropriate MPI tasks again

This sould be the fastest way of carrying out this operation. Of course the additional communication makes things more complicated, so I can perfectly understand if this is not your preferred solution. However, doing an SHT, even if it is nside=8192, lmax=16384, with hybrid MPI/OpenMP (or even wose with pure MPI) parallelization over >100 threads is quite inefficient. If you have the chance to do several SHTs simultaneously on fewer threads each, this would be preferable.

mreineck commented 3 years ago

PS. Out of curiosity, if you go to map space for intermediate computations (not for the final data product), have you considered using grids other than Healpix, e.g. Gauss-Legendre? Depending on the operations you have to carry out, this could be advantageous, since the SHTs are potentially significantly faster and also exact. If you require equal-area properties, this won't work of course.

zonca commented 3 years ago

@mreineck I don't know much about Gauss-Legendre, do you have a reference I could look into?

mreineck commented 3 years ago

If your band limit is lmax, the minimal corresponding Gauss-Legendre grid has (lmax+1) nodes in theta (non-equidistant) times (2*lmax+1) nodes in phi (equidistant). SHTs are exact in both direction as long as the band limit holds. This grid has the fastest SHTs you can get, roughly twice as fast as using a Healpix grid with 2*nside=lmax. Basics are briefly mentioned in, e.g., https://hal-insu.archives-ouvertes.fr/file/index/docid/762867/filename/sht.pdf. libsharp and my other SHT libraries support this out of the box.

[Edit: fixed the nside expression]

zonca commented 3 years ago

thanks @mreineck, do you have an example of using:

https://mtr.pages.mpcdf.de/ducc/sht.html#ducc0.sht.experimental.synthesis

to transform a set of IQU Alms into a map in GL pixels and then use analysis to go back to Alms?

It's really hard to understand how to use it from the API reference.

mreineck commented 3 years ago

The good thing about the GL grid is that it is "regular", i.e. it

has the same number of pixels on every ring
every ring starts at the same azimuth This allows using the https://mtr.pages.mpcdf.de/ducc/sht.html#ducc0.sht.experimental.synthesis_2d function and friends, which has a much simper interface.

Its usage is demonstrated, e.g., in https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/python/demos/sht_demo.py. This is only without polarisation; I'll add polarisation and post the modified code here soon!

mreineck commented 3 years ago

Here it is. (Sorry, Github doesn't appear to let me attach it.)

import ducc0
import numpy as np
from time import time

rng = np.random.default_rng(48)

def nalm(lmax, mmax):
    return ((mmax+1)*(mmax+2))//2 + (mmax+1)*(lmax-mmax)

def random_alm(lmax, mmax, spin):
    spin = list(spin)
    ncomp = len(spin)
    res = rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax))) \
     + 1j*rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax)))
    # make a_lm with m==0 real-valued
    res[:, 0:lmax+1].imag = 0.
    # zero a few other values dependent on spin
    for i in range(ncomp):
        ofs=0
        for s in range(spin[i]):
            res[i, ofs:ofs+spin[i]-s] = 0.
            ofs += lmax+1-s
    return res

# just run on one thread
nthreads = 1

# set maximum multipole moment
lmax = 2047
# maximum m.
mmax = lmax

# Number of pixels per ring. Must be >=2*lmax+1, but I'm choosing a larger
# number for which the FFT is faster.
nlon = 2*lmax+2

alm = random_alm(lmax, mmax, [0, 2, 2])
print("testing Gauss-Legendre grid with lmax+1 rings")

# Number of iso-latitude rings required for Gauss-Legendre grid
nlat = lmax+1

# go from a_lm to map
t0 = time()
map = np.empty((3, nlat, nlon))
# unpolarised component
ducc0.sht.experimental.synthesis_2d(
    alm=alm[0:1], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=0,
    geometry="GL", nthreads=nthreads, map=map[0:1])
# polarised component
ducc0.sht.experimental.synthesis_2d(
    alm=alm[1:3], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=2,
    geometry="GL", nthreads=nthreads, map=map[1:3])
print("time for map synthesis: {}s".format(time()-t0))

# transform back to a_lm

t0 = time()
alm2 = np.empty_like(alm)
# unpolarised component
ducc0.sht.experimental.analysis_2d(
    map=map[0:1], lmax=lmax, mmax=mmax, spin=0, geometry="GL", nthreads=nthreads, alm=alm2[0:1])
# polarised component
ducc0.sht.experimental.analysis_2d(
    map=map[1:3], lmax=lmax, mmax=mmax, spin=2, geometry="GL", nthreads=nthreads, alm=alm2[1:3])
print("time for map analysis: {}s".format(time()-t0))

# make sure input was recovered accurately
print("L2 error: ", ducc0.misc.l2error(alm, alm2))

zonca commented 3 years ago

discussion about libsharp is completed, discussion about Gauss-Legendre pixelization in #91

galsci / pysm

libsharp discontinued, alternatives for MPI smoothing? #92