JuliaApproximation / FastTransforms.jl

:rocket: Julia package for orthogonal polynomial transforms :snowboarder:
Other
265 stars 27 forks source link

integration with LibHealpix.jl #28

Closed mweastwood closed 4 years ago

mweastwood commented 7 years ago

Hi Mikael,

I am trying to understand how difficult it would be to use FastTransforms.jl to compute spherical harmonic transforms of healpix maps. Healpix maps have pixels arranged along rings of constant latitude, which the C++ Healpix library uses to give me "fast" spherical harmonic transforms. I would really like to compare your spherical harmonic transform against the existing one, because I anticipate yours is probably much faster.

Unfortunately I'm having a hard time understanding what format your routines need. Are there any plans to expand and document the synthesis and analysis code? Can it handle the case where there are a different number of pixels on each ring?

Github: https://github.com/mweastwood/LibHealpix.jl

Docs: http://mweastwood.info/LibHealpix.jl/stable/

MikaelSlevinsky commented 7 years ago

Hi Michael,

I'll start by explaining the current "minimal working API" with the understanding that it can be improved and/or modified to suit your needs.

The philosophy in this repository is to separate the different phases of a full spherical harmonic synthesis/analysis into phases we can understand well:

Once spherical harmonic expansions are transformed to bivariate Fourier series, conceptually many other transforms are easier. In particular, synthesis and analysis is done via DCTs and DSTs. In this sense, yes, it should be possible to have a different number of pixels on each ring. This would just require a zero-padding to the maximal ring sampling.

Storage:

To familiarize yourself, check out the synthesis/analysis test file. If I change the declaration F = zeros(Float64, n, 2n-1); F[1] = sqrt(4π); and remove the line normalizecolumns!(F), then the output is an array of ones, which is consistent with your documentation example.

Warnings:

mweastwood commented 7 years ago

Awesome, thanks for the information! I'll check out the synthesis/analysis test file and let you know if I run into any snags.

MikaelSlevinsky commented 7 years ago

P.S. here's a PDF of a presentation I gave this summer https://home.cc.umanitoba.ca/~slevinrm/PDF/SphericalHarmonics2017.pdf (and for more technical details there's the pre-print on the arXiv https://arxiv.org/abs/1705.05448)

P.P.S. I use the Condon--Shortley phase convention bc I first encountered spherical harmonics in electronic structure calculations, but it's just a normalization.

mweastwood commented 6 years ago

Apologies for the long delay. I've finally gotten around to this.

For map synthesis, FastTransforms.jl and LibHealpix.jl agree very closely. I think the dominant source of error is in interpolating the two representations onto a common grid. For small L, FastTransforms.jl is faster, but above L~1000 (just before the transition between your slow plan and thin plan), LibHealpix.jl is faster (not counting any of the pre-computation).

For analysis, FastTransforms.jl is much much more accurate than LibHealpix.jl. So that's really awesome!

A couple comments:

  1. It looks like FastTransforms.jl assumes that the map has an odd number of pixels along each ring, could that restriction be relaxed?
  2. Do you have any idea what the bottleneck is in sph2fourier/fourier2sph with the thin plan? That seems to dominate over the cost of the FFTW synthesis and analysis steps.
mweastwood commented 6 years ago

Oops, I take back all my comments about performance. LibHealpix was running with 16 threads. If I restrict it to one thread FastTransforms.jl blows it out of the water. Great stuff!

dlfivefifty commented 6 years ago

I wonder what it would take to make FastTransforms.jl multi-threaded. I suppose it would be best done in HierarchicalMatrices.jl.

MikaelSlevinsky commented 6 years ago

We could allow an even number of pixels along each ring, but we'd have to choose a convention: for three points, we return an expansion in the basis {1, sin(θ), cos(θ)}, but for four points should the expansion be {1, sin(θ), cos(θ), sin(2θ)}, {1, sin(θ), cos(θ), cos(2θ)} or a weighted average of the two? I think ApproxFun uses sin(2θ) in the above.

The bulk of the work in sph2fourier is the high-order to low-order conversion, not synthesis/analysis and not Legendre-to-Chebyshev. Finding the right transition point between the SlowPlan and the ThinPlan would be platform dependent: for my computer it was about 1024. If you're finding something different, you can always construct the plan you want by circumventing the wrapper plan_sph2fourier (or modifying it!).

I have no experience multithreading in Julia, but to thread the slow plan, I think we'd need to speed up the following MWE:

Richards-MBP-2:~ Mikael$ export JULIA_NUM_THREADS=8
Richards-MBP-2:~ Mikael$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1-pre.0 (2017-06-19 13:06 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit dcf39a1dda* (145 days old release-0.6)
|__/                   |  x86_64-apple-darwin16.6.0

julia> A = [rand(1000) for i = 1:1000];

julia> G = [givens(sqrt(2)/2, sqrt(2)/2, i, i+1)[1] for i = 1:999];

julia> @time for i = 1:1000
           for j = 1:999
               A_mul_B!(G[j], A[i])
           end
       end
  0.080361 seconds (1.98 M allocations: 60.631 MiB, 8.38% gc time)

julia> @time Threads.@threads for i = 1:1000
           for j = 1:999
               A_mul_B!(G[j], A[i])
           end
       end
  0.071057 seconds (612.59 k allocations: 22.379 MiB, 15.06% gc time)

A tedious but promising strategy is to carefully consider the order of Givens rotations and use vectorized kernels (dlartv.f in LAPACK vs drot.f in BLAS). This strategy has been used for symmetric band reduction algorithms via Givens rotations.

L. Kaufman. Band reduction algorithms revisited. ACM Trans. Math. Software, 26:551–567, 2000.

We might also consider throwing away RotationPlan and computing the sines and the cosines on-the-fly. This might be faster than the memory accesses.

@dlfivefifty, I think hierarchical matrices might already be multithreaded for BlasFloat types if BLAS is running with multiple threads.

mweastwood commented 6 years ago

So if I monitor CPU usage, it's definitely making use of all the available BLAS threads, but for some reason it's not actually completing any faster (even with OMP_NUM_THREADS=16 vs OMP_NUM_THREADS=1). Maybe that could use some benchmarking.

I've attached below my first stab at a slightly more convenient interface which I'm using to convert and compare with LibHealpix.jl.

struct SHT
    sph2fourier_plan
    synthesis_plan
    analysis_plan
    lmax :: Int
    mmax :: Int
    size :: Tuple{Int, Int}
end 

function plan_sht(lmax, mmax, size)
    alm = zeros(lmax+1, 2mmax+1)
    map = zeros(size)
    sph2fourier_plan = plan_sph2fourier(alm)
    synthesis_plan = FastTransforms.plan_synthesis(map)
    analysis_plan = FastTransforms.plan_analysis(map)
    SHT(sph2fourier_plan, synthesis_plan, analysis_plan, lmax, mmax, size)
end 

struct Alm
    lmax :: Int
    mmax :: Int
    matrix :: Matrix{Float64}
end 

function Alm(lmax, mmax)
    matrix = zeros(lmax+1, 2mmax+1)
    Alm(lmax, mmax, matrix)
end 

function Base.getindex(alm::Alm, l, m)
    idx = l - m + 1
    jdx = 2m + 1
    if m == 0
        return complex(alm.matrix[idx, jdx])
    else
        return complex(alm.matrix[idx, jdx], alm.matrix[idx, jdx-1]) / √2
    end
end

function Base.setindex!(alm::Alm, value, l, m)
    idx = l - m + 1
    jdx = 2m + 1
    if m == 0
        alm.matrix[idx, jdx] = real(value)
        return complex(real(value))
    else
        sqrt2 = √2
        alm.matrix[idx, jdx]   = real(value) * sqrt2
        alm.matrix[idx, jdx-1] = imag(value) * sqrt2
        return value
    end
end

struct Map <: AbstractMatrix{Float64}
    matrix :: Matrix{Float64}
end

Base.getindex(map::Map, idx...) = map.matrix[idx...]
Base.setindex!(map::Map, value, idx...) = map.matrix[idx...] = value
Base.size(map::Map) = size(map.matrix)

function alm2map(sht, alm)
    # convert to bivariate Fourier series
    @time fourier = sht.sph2fourier_plan*alm.matrix

    # pad the Fourier series to the desired map size
    padded_fourier = zeros(eltype(fourier), sht.size)
    padded_fourier[1:sht.lmax+1, 1:2sht.mmax+1] = fourier

    # synthesize the map
    @time output = A_mul_B!(zero(padded_fourier), sht.synthesis_plan, padded_fourier)
    Map(output)
end

function map2alm(sht, map)
    # analyze the map
    @time fourier = A_mul_B!(zero(map.matrix), sht.analysis_plan, map.matrix)

    # cut the Fourier series down to the right size (for the desired lmax, mmax)
    cut_fourier = fourier[1:sht.lmax+1, 1:2sht.mmax+1]

    # convert to spherical harmonic coefficients
    @time output = sht.sph2fourier_plan\cut_fourier
    Alm(sht.lmax, sht.mmax, output)
end

Base.:*(sht::SHT, map::Map) = map2alm(sht, map)
Base.:\(sht::SHT, alm::Alm) = alm2map(sht, alm)
MikaelSlevinsky commented 6 years ago

I would suggest adding the sketch = :none option to your plan:

function plan_sht(lmax, mmax, size)
    alm = zeros(lmax+1, 2mmax+1)
    map = zeros(size)
    sph2fourier_plan = plan_sph2fourier(alm; sketch = :none)
    synthesis_plan = FastTransforms.plan_synthesis(map)
    analysis_plan = FastTransforms.plan_analysis(map)
    SHT(sph2fourier_plan, synthesis_plan, analysis_plan, lmax, mmax, size)
end 

This should halve the pre-computation time, independent of threading.

MikaelSlevinsky commented 6 years ago

Hi Michael,

it's now possible to create a fast transform with superoptimal complexity pre-computation and storage:

https://arxiv.org/abs/1711.07866

(though I don't know how much this would help on the low-degree end vs. a direct method with threading etc...).

MikaelSlevinsky commented 6 years ago

It turns out that OpenMP makes multithreading really easy in C/C++/Fortran once the parallelism is identified. On the branch feat-triangular-harmonics, I added a package build to dynamically link two functions in C that add multithreading to the O(n3) part of the spherical harmonic transform. Using this gist to compare this branch to master, my quadcore hyperthreaded Intel i7 processor gets a measurable speedup. For instance, for spherical harmonic degree n, the average times in seconds of three forward and backward transforms are:

Degree n one thread eight threads
128 0.00469 0.00244
256 0.0241 0.0101
512 0.203 0.0727
1024 1.47 0.432
2048 13.9 3.28
4096 129 26.6

Note that the Chebyshev--Legendre transform uses its own version of threading via BLAS, but may not be as efficient as a direct matrix-vector product for such small degrees, i.e. we may be hitting Amdahl's law here.

MikaelSlevinsky commented 6 years ago

For comparison, pure Julia implementations of multithreading add the following columns. Threads.@threads use Base's version of distributing the work (chunking), while FastTransforms.@stepthreads distributes the work of a range in stride lengths equal to the number of threads, which is more suitable to balance the load of the transform.

Degree n one thread eight threads (C) Threads.@threads FastTransforms.@stepthreads
128 0.00469 0.00244 0.00306 0.00256
256 0.0241 0.0101 0.0219 0.0149
512 0.203 0.0727 0.112 0.102
1024 1.47 0.432 0.710 0.631
2048 13.9 3.28 6.24 4.41
4096 129 26.6 44.4 34.4

An advantage of a pure Julia implementation is not requiring package builds and tinkering with cross-platform makefiles, etc... (though it's clear even from these micro-benchmarks that in the long run it's worth experimenting with implementations in other languages)

pochoi commented 5 years ago

I am working on a project which possibly making use of spherical harmonics transform on Healpix. That's why I am interested in this conversation.

I have a beginner question: 2D Fourier series I see is usually like that

Screenshot 2019-03-11 20 21 36

I have a hard time to make sense of

Screenshot 2019-03-11 20 26 07

What would be the connection between the f and g? Do you any suggested reference for easier understanding of the g?

MikaelSlevinsky commented 5 years ago

Hi @pochoi, there are at least two ways to do this.

If you already have the FFT(W) routine to sample your function F(u,v) on an equiangular grid on S2, it may be easier for you to go to the grid and come back using the methods here (see also the tests here).

Of course, this transformation can be done purely by reorganizing the arrays of coefficients and using identities such as e^{ix} = cos(x) + i sin(x) and 2cos(x) = e^{ix} + e^{-ix}, which shouldn't be too hard.

MikaelSlevinsky commented 4 years ago

Probably any integration with LibHealpix.jl would happen outside this repository.