Closed mweastwood closed 4 years ago
Hi Michael,
I'll start by explaining the current "minimal working API" with the understanding that it can be improved and/or modified to suit your needs.
The philosophy in this repository is to separate the different phases of a full spherical harmonic synthesis/analysis into phases we can understand well:
Once spherical harmonic expansions are transformed to bivariate Fourier series, conceptually many other transforms are easier. In particular, synthesis and analysis is done via DCTs and DSTs. In this sense, yes, it should be possible to have a different number of pixels on each ring. This would just require a zero-padding to the maximal ring sampling.
Storage:
SphArray
to wrap a Vector
, representing spherical harmonic coefficients. However, the transformation to bivariate Fourier series is mathematically rectangular, so storing the extra zeros in the spherical harmonic array is a factor of 2 (that is made up for by being real-to-real).To familiarize yourself, check out the synthesis/analysis test file. If I change the declaration F = zeros(Float64, n, 2n-1); F[1] = sqrt(4π);
and remove the line normalizecolumns!(F)
, then the output is an array of ones, which is consistent with your documentation example.
Warnings:
ThinSphericalHarmonicPlan
plan in the code). They are well-suited for time evolution of PDEs, where the pre-computation is done once and the transform executed multiple times. The "slow" transforms are in fact faster than the "fast" transforms for fewer than 1 million degrees of freedom (with platform-dependence too, I guess).Awesome, thanks for the information! I'll check out the synthesis/analysis test file and let you know if I run into any snags.
P.S. here's a PDF of a presentation I gave this summer https://home.cc.umanitoba.ca/~slevinrm/PDF/SphericalHarmonics2017.pdf (and for more technical details there's the pre-print on the arXiv https://arxiv.org/abs/1705.05448)
P.P.S. I use the Condon--Shortley phase convention bc I first encountered spherical harmonics in electronic structure calculations, but it's just a normalization.
Apologies for the long delay. I've finally gotten around to this.
For map synthesis, FastTransforms.jl and LibHealpix.jl agree very closely. I think the dominant source of error is in interpolating the two representations onto a common grid. For small L, FastTransforms.jl is faster, but above L~1000 (just before the transition between your slow plan and thin plan), LibHealpix.jl is faster (not counting any of the pre-computation).
For analysis, FastTransforms.jl is much much more accurate than LibHealpix.jl. So that's really awesome!
A couple comments:
sph2fourier
/fourier2sph
with the thin plan? That seems to dominate over the cost of the FFTW synthesis and analysis steps.Oops, I take back all my comments about performance. LibHealpix was running with 16 threads. If I restrict it to one thread FastTransforms.jl blows it out of the water. Great stuff!
I wonder what it would take to make FastTransforms.jl multi-threaded. I suppose it would be best done in HierarchicalMatrices.jl.
We could allow an even number of pixels along each ring, but we'd have to choose a convention: for three points, we return an expansion in the basis {1, sin(θ), cos(θ)}, but for four points should the expansion be {1, sin(θ), cos(θ), sin(2θ)}, {1, sin(θ), cos(θ), cos(2θ)} or a weighted average of the two? I think ApproxFun uses sin(2θ) in the above.
The bulk of the work in sph2fourier
is the high-order to low-order conversion, not synthesis/analysis and not Legendre-to-Chebyshev. Finding the right transition point between the SlowPlan and the ThinPlan would be platform dependent: for my computer it was about 1024. If you're finding something different, you can always construct the plan you want by circumventing the wrapper plan_sph2fourier
(or modifying it!).
I have no experience multithreading in Julia, but to thread the slow plan, I think we'd need to speed up the following MWE:
Richards-MBP-2:~ Mikael$ export JULIA_NUM_THREADS=8
Richards-MBP-2:~ Mikael$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.1-pre.0 (2017-06-19 13:06 UTC)
_/ |\__'_|_|_|\__'_| | Commit dcf39a1dda* (145 days old release-0.6)
|__/ | x86_64-apple-darwin16.6.0
julia> A = [rand(1000) for i = 1:1000];
julia> G = [givens(sqrt(2)/2, sqrt(2)/2, i, i+1)[1] for i = 1:999];
julia> @time for i = 1:1000
for j = 1:999
A_mul_B!(G[j], A[i])
end
end
0.080361 seconds (1.98 M allocations: 60.631 MiB, 8.38% gc time)
julia> @time Threads.@threads for i = 1:1000
for j = 1:999
A_mul_B!(G[j], A[i])
end
end
0.071057 seconds (612.59 k allocations: 22.379 MiB, 15.06% gc time)
A tedious but promising strategy is to carefully consider the order of Givens rotations and use vectorized kernels (dlartv.f
in LAPACK vs drot.f
in BLAS). This strategy has been used for symmetric band reduction algorithms via Givens rotations.
L. Kaufman. Band reduction algorithms revisited. ACM Trans. Math. Software, 26:551–567, 2000.
We might also consider throwing away RotationPlan
and computing the sines and the cosines on-the-fly. This might be faster than the memory accesses.
@dlfivefifty, I think hierarchical matrices might already be multithreaded for BlasFloat
types if BLAS is running with multiple threads.
So if I monitor CPU usage, it's definitely making use of all the available BLAS threads, but for some reason it's not actually completing any faster (even with OMP_NUM_THREADS=16
vs OMP_NUM_THREADS=1
). Maybe that could use some benchmarking.
I've attached below my first stab at a slightly more convenient interface which I'm using to convert and compare with LibHealpix.jl.
struct SHT
sph2fourier_plan
synthesis_plan
analysis_plan
lmax :: Int
mmax :: Int
size :: Tuple{Int, Int}
end
function plan_sht(lmax, mmax, size)
alm = zeros(lmax+1, 2mmax+1)
map = zeros(size)
sph2fourier_plan = plan_sph2fourier(alm)
synthesis_plan = FastTransforms.plan_synthesis(map)
analysis_plan = FastTransforms.plan_analysis(map)
SHT(sph2fourier_plan, synthesis_plan, analysis_plan, lmax, mmax, size)
end
struct Alm
lmax :: Int
mmax :: Int
matrix :: Matrix{Float64}
end
function Alm(lmax, mmax)
matrix = zeros(lmax+1, 2mmax+1)
Alm(lmax, mmax, matrix)
end
function Base.getindex(alm::Alm, l, m)
idx = l - m + 1
jdx = 2m + 1
if m == 0
return complex(alm.matrix[idx, jdx])
else
return complex(alm.matrix[idx, jdx], alm.matrix[idx, jdx-1]) / √2
end
end
function Base.setindex!(alm::Alm, value, l, m)
idx = l - m + 1
jdx = 2m + 1
if m == 0
alm.matrix[idx, jdx] = real(value)
return complex(real(value))
else
sqrt2 = √2
alm.matrix[idx, jdx] = real(value) * sqrt2
alm.matrix[idx, jdx-1] = imag(value) * sqrt2
return value
end
end
struct Map <: AbstractMatrix{Float64}
matrix :: Matrix{Float64}
end
Base.getindex(map::Map, idx...) = map.matrix[idx...]
Base.setindex!(map::Map, value, idx...) = map.matrix[idx...] = value
Base.size(map::Map) = size(map.matrix)
function alm2map(sht, alm)
# convert to bivariate Fourier series
@time fourier = sht.sph2fourier_plan*alm.matrix
# pad the Fourier series to the desired map size
padded_fourier = zeros(eltype(fourier), sht.size)
padded_fourier[1:sht.lmax+1, 1:2sht.mmax+1] = fourier
# synthesize the map
@time output = A_mul_B!(zero(padded_fourier), sht.synthesis_plan, padded_fourier)
Map(output)
end
function map2alm(sht, map)
# analyze the map
@time fourier = A_mul_B!(zero(map.matrix), sht.analysis_plan, map.matrix)
# cut the Fourier series down to the right size (for the desired lmax, mmax)
cut_fourier = fourier[1:sht.lmax+1, 1:2sht.mmax+1]
# convert to spherical harmonic coefficients
@time output = sht.sph2fourier_plan\cut_fourier
Alm(sht.lmax, sht.mmax, output)
end
Base.:*(sht::SHT, map::Map) = map2alm(sht, map)
Base.:\(sht::SHT, alm::Alm) = alm2map(sht, alm)
I would suggest adding the sketch = :none
option to your plan:
function plan_sht(lmax, mmax, size)
alm = zeros(lmax+1, 2mmax+1)
map = zeros(size)
sph2fourier_plan = plan_sph2fourier(alm; sketch = :none)
synthesis_plan = FastTransforms.plan_synthesis(map)
analysis_plan = FastTransforms.plan_analysis(map)
SHT(sph2fourier_plan, synthesis_plan, analysis_plan, lmax, mmax, size)
end
This should halve the pre-computation time, independent of threading.
Hi Michael,
it's now possible to create a fast transform with superoptimal complexity pre-computation and storage:
https://arxiv.org/abs/1711.07866
(though I don't know how much this would help on the low-degree end vs. a direct method with threading etc...).
It turns out that OpenMP makes multithreading really easy in C/C++/Fortran once the parallelism is identified. On the branch feat-triangular-harmonics
, I added a package build to dynamically link two functions in C that add multithreading to the O(n3) part of the spherical harmonic transform. Using this gist to compare this branch to master, my quadcore hyperthreaded Intel i7 processor gets a measurable speedup. For instance, for spherical harmonic degree n
, the average times in seconds of three forward and backward transforms are:
Degree n |
one thread | eight threads |
---|---|---|
128 | 0.00469 | 0.00244 |
256 | 0.0241 | 0.0101 |
512 | 0.203 | 0.0727 |
1024 | 1.47 | 0.432 |
2048 | 13.9 | 3.28 |
4096 | 129 | 26.6 |
Note that the Chebyshev--Legendre transform uses its own version of threading via BLAS, but may not be as efficient as a direct matrix-vector product for such small degrees, i.e. we may be hitting Amdahl's law here.
For comparison, pure Julia implementations of multithreading add the following columns. Threads.@threads
use Base's version of distributing the work (chunking), while FastTransforms.@stepthreads
distributes the work of a range in stride lengths equal to the number of threads, which is more suitable to balance the load of the transform.
Degree n |
one thread | eight threads (C) | Threads.@threads | FastTransforms.@stepthreads |
---|---|---|---|---|
128 | 0.00469 | 0.00244 | 0.00306 | 0.00256 |
256 | 0.0241 | 0.0101 | 0.0219 | 0.0149 |
512 | 0.203 | 0.0727 | 0.112 | 0.102 |
1024 | 1.47 | 0.432 | 0.710 | 0.631 |
2048 | 13.9 | 3.28 | 6.24 | 4.41 |
4096 | 129 | 26.6 | 44.4 | 34.4 |
An advantage of a pure Julia implementation is not requiring package builds and tinkering with cross-platform makefiles, etc... (though it's clear even from these micro-benchmarks that in the long run it's worth experimenting with implementations in other languages)
I am working on a project which possibly making use of spherical harmonics transform on Healpix. That's why I am interested in this conversation.
I have a beginner question: 2D Fourier series I see is usually like that
I have a hard time to make sense of
What would be the connection between the f and g? Do you any suggested reference for easier understanding of the g?
Hi @pochoi, there are at least two ways to do this.
If you already have the FFT(W) routine to sample your function F(u,v) on an equiangular grid on S2, it may be easier for you to go to the grid and come back using the methods here (see also the tests here).
Of course, this transformation can be done purely by reorganizing the arrays of coefficients and using identities such as e^{ix} = cos(x) + i sin(x) and 2cos(x) = e^{ix} + e^{-ix}, which shouldn't be too hard.
Probably any integration with LibHealpix.jl would happen outside this repository.
Hi Mikael,
I am trying to understand how difficult it would be to use FastTransforms.jl to compute spherical harmonic transforms of healpix maps. Healpix maps have pixels arranged along rings of constant latitude, which the C++ Healpix library uses to give me "fast" spherical harmonic transforms. I would really like to compare your spherical harmonic transform against the existing one, because I anticipate yours is probably much faster.
Unfortunately I'm having a hard time understanding what format your routines need. Are there any plans to expand and document the synthesis and analysis code? Can it handle the case where there are a different number of pixels on each ring?
Github: https://github.com/mweastwood/LibHealpix.jl
Docs: http://mweastwood.info/LibHealpix.jl/stable/