Closed zonca closed 3 years ago
If you are currently using libsharp
and it is doing what you need, I don't think there is a reason to worry; if any bugs are found in the code, I'm still happy to fix them. My main reason to archive the repo was that I don't want people to start new projects using this code.
That said, if you have a genuine need for SHTs with MPI, I'd be very interested to hear about the use case. My impression currently is that single compute nodes are strong enough to compute SHTs of any "typical" size very quickly, and that using MPI within a single node is a waste of resources (at least for SHTs). I might of course be wrong, and if so, I'd like to hear about it :-)
We are doing foreground modelling with PySM at N_side 8192, each map in memory in double precision just one component is 6.5 GB, the more demanding model has 6 layers, each with 1 IQU map and 2 auxiliary, so it's about ~200 GB just the inputs. That is a single model, in a complete simulation we would have 4 galactic models, 3 extragalactic, 2 or 3 CMB components.
It's not doable yet on standard compute nodes. And we might need N_side 16382.
Thanks for the data, that is very helpful! The approach I would suggest (for optimizing SHT performance) in this situation is to
This sould be the fastest way of carrying out this operation. Of course the additional communication makes things more complicated, so I can perfectly understand if this is not your preferred solution. However, doing an SHT, even if it is nside=8192, lmax=16384, with hybrid MPI/OpenMP (or even wose with pure MPI) parallelization over >100 threads is quite inefficient. If you have the chance to do several SHTs simultaneously on fewer threads each, this would be preferable.
PS. Out of curiosity, if you go to map space for intermediate computations (not for the final data product), have you considered using grids other than Healpix, e.g. Gauss-Legendre? Depending on the operations you have to carry out, this could be advantageous, since the SHTs are potentially significantly faster and also exact. If you require equal-area properties, this won't work of course.
@mreineck I don't know much about Gauss-Legendre, do you have a reference I could look into?
If your band limit is lmax, the minimal corresponding Gauss-Legendre grid has (lmax+1)
nodes in theta (non-equidistant) times (2*lmax+1)
nodes in phi (equidistant). SHTs are exact in both direction as long as the band limit holds. This grid has the fastest SHTs you can get, roughly twice as fast as using a Healpix grid with 2*nside=lmax
. Basics are briefly mentioned in, e.g., https://hal-insu.archives-ouvertes.fr/file/index/docid/762867/filename/sht.pdf.
libsharp and my other SHT libraries support this out of the box.
[Edit: fixed the nside
expression]
thanks @mreineck, do you have an example of using:
https://mtr.pages.mpcdf.de/ducc/sht.html#ducc0.sht.experimental.synthesis
to transform a set of IQU Alms into a map in GL pixels and then use analysis to go back to Alms?
It's really hard to understand how to use it from the API reference.
The good thing about the GL grid is that it is "regular", i.e. it
Its usage is demonstrated, e.g., in https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/python/demos/sht_demo.py. This is only without polarisation; I'll add polarisation and post the modified code here soon!
Here it is. (Sorry, Github doesn't appear to let me attach it.)
import ducc0
import numpy as np
from time import time
rng = np.random.default_rng(48)
def nalm(lmax, mmax):
return ((mmax+1)*(mmax+2))//2 + (mmax+1)*(lmax-mmax)
def random_alm(lmax, mmax, spin):
spin = list(spin)
ncomp = len(spin)
res = rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax))) \
+ 1j*rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax)))
# make a_lm with m==0 real-valued
res[:, 0:lmax+1].imag = 0.
# zero a few other values dependent on spin
for i in range(ncomp):
ofs=0
for s in range(spin[i]):
res[i, ofs:ofs+spin[i]-s] = 0.
ofs += lmax+1-s
return res
# just run on one thread
nthreads = 1
# set maximum multipole moment
lmax = 2047
# maximum m.
mmax = lmax
# Number of pixels per ring. Must be >=2*lmax+1, but I'm choosing a larger
# number for which the FFT is faster.
nlon = 2*lmax+2
alm = random_alm(lmax, mmax, [0, 2, 2])
print("testing Gauss-Legendre grid with lmax+1 rings")
# Number of iso-latitude rings required for Gauss-Legendre grid
nlat = lmax+1
# go from a_lm to map
t0 = time()
map = np.empty((3, nlat, nlon))
# unpolarised component
ducc0.sht.experimental.synthesis_2d(
alm=alm[0:1], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=0,
geometry="GL", nthreads=nthreads, map=map[0:1])
# polarised component
ducc0.sht.experimental.synthesis_2d(
alm=alm[1:3], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=2,
geometry="GL", nthreads=nthreads, map=map[1:3])
print("time for map synthesis: {}s".format(time()-t0))
# transform back to a_lm
t0 = time()
alm2 = np.empty_like(alm)
# unpolarised component
ducc0.sht.experimental.analysis_2d(
map=map[0:1], lmax=lmax, mmax=mmax, spin=0, geometry="GL", nthreads=nthreads, alm=alm2[0:1])
# polarised component
ducc0.sht.experimental.analysis_2d(
map=map[1:3], lmax=lmax, mmax=mmax, spin=2, geometry="GL", nthreads=nthreads, alm=alm2[1:3])
print("time for map analysis: {}s".format(time()-t0))
# make sure input was recovered accurately
print("L2 error: ", ducc0.misc.l2error(alm, alm2))
discussion about libsharp is completed, discussion about Gauss-Legendre pixelization in #91
PySM relies on
libsharp
to smooth maps over MPI.Libsharp is not maintained anymore: https://gitlab.mpcdf.mpg.de/mtr/libsharp
Is anyone using PySM over MPI? Should we simplify the code and remove MPI support? or find an alternative for
libsharp
(@mreineck suggestions?) ?@NicolettaK @giuspugl @bthorne93