Open AshtonSBradley opened 7 years ago
A fast Hermite transform is definitely in the scope of this package.
I just finished reading the paper and I have a couple comments:
k
of the interpolation used (N.B. it's well-known that Newton-Cotes of high degree is also unstable so it's not as simple as just cranking up the order),and questions:
One option is to write a wrapper in FastTransforms.jl
for the C code on Rockmore's website. But sometimes getting a Pkg.build
to work across all platforms is more hassle than it's worth. Another option is to develop a better Hermite transform. @ajt60gaibb, is there any low-rank magic here?
Hi Mikael, thanks for your comments.
I do need a fast FHT for degree 30-300 as it is the main bottleneck of a time evolution algorithm that my group uses quite a bit. In practice we do separate HT over x,y,z using separability. However, I appreciate that there may not be anything faster than O(N^2)
for these values of N
. Precomputation is not an issue as we do the same transform many times during time evolution. I have my own version of basically ApproxFun's
hermitetransform.jl
that gets past overflow in precomutation by using BigFloat
to build the transform matrix. This allows me to push the degree up to about 350.
Mapped Chebyshev expansion sounds very interesting. Does this come down to a Clenshaw-Curtis quadrature to evaluate on a larger grid than the gauss points?
Since Hermite functions are scaled parabolic functions (see http://dlmf.nist.gov/18.11#E3) and parabolic functions have a known trigonometric asymptotic expansion (see http://dlmf.nist.gov/12.10#E18), there should be a fast Hermite transform. That is, in the "large parameter" region of the Hermite transform matrix, the matrix can be well-approximated by a sum of nonuniform DCTs and nonuniform DSTs (low-rank in frequency space). With a few weeks of painful work, this could probably be a fast Hermite transform. (Lot's of details to work out, though.) Of course, to do this one needs to appropriately scale the Hermite polynomials to avoid overflow entirely and hard thresholding entries to 0 that underflow. I think an O(n^{3/2}) transform may come from hard thresholding entries to 0 that underflow...
@MikaelSlevinsky, I imagine the transform matrix is butterflyable too.
In short, a fast Hermite transform seems possible to me.
On and off I discuss a fast Hermite transform with @daanhb, but there was never enough enthusiasm for it for me to put in the effort required.
Unfortunately, there seem to be some claims in the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630232/#!po=88.5000 that are unsubstantiated.
I think Hermite transforms are useless. But what you actually want is weighted Hermite transforms: that is GaussWeight(Hermite())
in ApproxFun.
The observed stability issues should disappear when you work with weighted Hermite, though you still have to deal with underflow
Hermite transforms are almost clearly nonsense when you think about n -> infinity as the tail will be wrong unless your function is a polynomial, and even if it's a polynomial round off error prevents getting the tail exactly right.
This probably also answers the question about Laguerre transforms
I guess you are thinking about the interpolation/approximation problem. I was thinking the transform is: "Given a Hermite expansion, evaluate it at some points"...
Ah good point. I suspect that's also useless though.
👍
Ok interesting, I see I had missed some important parts of ApproxFun
. Thanks all for your input.
Regarding comments above, my function is a sum of Hermite polynomials times exp(-x.^2/2)
. I have tried using GaussWeight()
and it doesn't seem to lead to any extension of the highest order for Hermite
, nor removing the need to use BigFloat
.
For Laguerre
, I can change the call to WeightedLaguerre
but the instability at N=27
seems to persist.
@AshtonSBradley,
Mapped Chebyshev expansion sounds very interesting. Does this come down to a Clenshaw-Curtis quadrature to evaluate on a larger grid than the gauss points?
By this, I mean using Chebyshev polynomials and an affine variable transformation to work on an interval [a, b]
sufficiently large enough.
Unfortunately, I left my laptop at work, but on Monday I will try something along the lines of:
using ApproxFun, FastTransforms
import FastTransforms: Butterfly
P = # Get the actual matrix from the plan in ApproxFun.plan_transform(::Hermite(),rand(256)), say.
BF = Butterfly(P, floor(Int, log2(size(P, 1))-6); sketch = :none, atol = 1e-32);
r = FastTransforms.allranks(BF) # just diagnostic info.
mean(r), std(r)
x = rand(256);
y = zero(x);
z = zero(x);
@time A_mul_B!(y, P, x);
@time A_mul_B!(z, BF, x);
norm(y-z) # Did it work?
I think an O(n^{3/2}) transform may come from hard thresholding entries to 0 that underflow...
this would probably be a decent intermediate solution, as it would come down to calling gemv!
for an n x O(n^{1/2}) matrix.
@MikaelSlevinsky , here is a result (seems to work!)
using ApproxFun, FastTransforms, BenchmarkTools
import FastTransforms: Butterfly
N=256
# Get the actual matrix from the plan in ApproxFun.plan_transform(::Hermite(),rand(256)), say.
PlanHerm = ApproxFun.plan_transform(Hermite(),rand(N))
Px,Py=PlanHerm.plan
P=Px*Py'
BF = Butterfly(P, floor(Int, log2(size(P, 1))-6); sketch = :none, atol = 1e-32);
r = FastTransforms.allranks(BF) # just diagnostic info.
4×3 Array{Int64,2}:
0 1 1
1 1 1
1 1 1
0 1 1
mean(r), std(r)
(0.8333333333333334,0.38924947208076144)
x = rand(N);
y = zero(x);
z = zero(x);
#already ran a warmup
@time A_mul_B!(y, P, x)
0.000147 seconds (4 allocations: 160 bytes)
@time A_mul_B!(z, BF, x)
0.000016 seconds (4 allocations: 160 bytes)
norm(y-z) # Did it work?
2.3778381601473506e-14
is that the actual matrix, though? Seems to me it should have the Vandermonde part to be mathematically full rank:
PlanHerm = ApproxFun.plan_transform(Hermite(),rand(N));
x,w = PlanHerm.plan;
V=ApproxFun.hermitep(0:N-1,x)';
nrm=(V.^2)*w;
P = Diagonal(inv.(nrm))*V*Diagonal(w)
ok, it overflows for N=256 (nrm
explodes for Hermite). For N=128
:
r = FastTransforms.allranks(BF) # just diagnostic info.
2×2 Array{Int64,2}:
15 20
15 0
mean(r), std(r)
(12.5,8.660254037844387)
@time A_mul_B!(y, P, x)
0.000076 seconds (4 allocations: 160 bytes)
@time A_mul_B!(z, BF, x)
0.000017 seconds (4 allocations: 160 bytes)
norm(y-z) # Did it work?
4.626203755857375e-16
My end goal with this is to apply this optimization to the action of three matrices (one for each spatial direction). I guess I should flatten the 3 matrix product on a 3D array into a single matrix multiply on a vector, and then apply this procedure to it?
Great! so if everything's computed stably, the butterfly algorithm should take O(mean(r) n^2) time to pre-compute, and O(mean(r)*N log(N)) to execute.
I wouldn't be surprised if the butterfly pre-computation would do just as well as an O(n^{3/2}) algorithm (for intermediate n) resulting from a lot of tedious analysis, since that's mainly built from hard thresholding which could be accounted for in the butterfly.
awesome! any scope for optimising parameters in Butterfly
?
they're almost all passed through to LowRankApprox
. So if you look at https://github.com/klho/LowRankApprox.jl/blob/master/src/LowRankApprox.jl#L58-#L101, and the repo's readme, you might be able to decipher what they all mean 😉.
By the way, this is why I implemented the butterfly algorithm https://arxiv.org/abs/1705.05448
thanks for the references! I noticed that if I try to act with BF
on a complex vector, I get a method error. Should I be handling complex variables in a special way, or just transform real and imaginary parts separately?
complex methods haven't been implemented yet, so separate transformation is required at the moment.
I doubt you can do better than applying the transform to real and imaginary parts separately.
Is there a way in Julia to get a view of the real/imag part of a complex number? I think it should be possible by playing around with the stride.
My end goal with this is to apply this optimization to the action of three matrices (one for each spatial direction). I guess I should flatten the 3 matrix product on a 3D array into a single matrix multiply on a vector, and then apply this procedure to it?
FastTransforms has an unexported method A_mul_B_col_J!
, https://github.com/MikaelSlevinsky/FastTransforms.jl/blob/master/src/SphericalHarmonics/Butterfly.jl#L187, which applies the matrix A
to the J
th column of matrix B
and stores the result in the provided output matrix. Looping over all columns in a matrix, then this could accelerate matrix-matrix product. It would be straightforward to write a method A_mul_B_col_J_slice_K!
to accelerate applying the matrix A
to every column in a tensor B
.
Since Julia is column-major ordering, it should be just as efficient to then "rotate" the data (in-place) twice, each time applying the transform along the other faces of the cube of data. The alternative is to write methods that implement A_mul_B!
-like methods through rows and tubes of the data, but this involves accessing data in larger and larger strides which would be inefficient, or just as costly as rotating the data, and unnecessarily complicated I think.
I have competing constraints - the advantage is only worthwhile if the transform size is large. For a large basis, I need to do recurrence on the orthonormal eigenfunctions, not the Hermite polynomials themselves, in order to get there with numerical stability. There is still a two-term recurrence, suggesting there would be something to gain, but when I butterfly the matrix, there is very little gain (less than a factor of 2, even for N=512). Is this what you would expect? Does the transform need to be in strict Vandermonde form (maybe it still is, but I haven't looked hard at this) or is it sufficient to have entries related by two-term recurrence?
It's because the butterfly is an asymptotically fast algorithm. The original authors found a break even point of a square matrix of around 256 x 256, and it appears fully quasi-linear only beyond about 20,000 x 20,000.
It would probably require a specialized algorithm (and a superb implementation) for significant improvement for dimensions up to 512.
Cheers,
Mikael
On Jun 11, 2017, at 11:10 PM, Ashton Bradley notifications@github.com<mailto:notifications@github.com> wrote:
I have competing constraints - the advantage is only worthwhile if the transform size is large. For a large basis, I need to do recurrence on the orthonormal eigenfunctions, not the Hermite polynomials themselves, in order to get there with numerical stability. There is still a two-term recurrence, suggesting there would be something to gain, but when I butterfly the matrix, there is very little gain (less than a factor of 2, even for N=512). Is this what you would expect? Does the transform need to be in strict Vandermonde form (maybe it still is, but I haven't looked hard at this) or is it sufficient to have entries related by two-term recurrence?
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MikaelSlevinsky/FastTransforms.jl/issues/17#issuecomment-307685996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHzBpURuMNaEIQh7BJ-UmCdMCnURUrqlks5sDLotgaJpZM4N12gg.
ok interesting. For N=2048 it becomes a factor of 5. For 2D this is now a significant gain.
P.S. I've made some minor changes (on the master branch) that seem to speed up the butterfly matrix-vector product. My commitment is to make it faster over time!
@dlfivefifty added a medium-fast Hermite transform in https://github.com/JuliaApproximation/FastTransforms.jl/pull/74.
I was talking to @dlfivefifty over at
ApproxFun.jl
where they have an implementation of the hermite transform using gausshermite quadrature. While this is clearly a reliable choice, is there any scope for implementing anO(N*logN)
(i.e. faster thanO(N^2)
) method as described in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630232/#!po=88.5000 ? For band limited functions there is a factorization using recursion the relation, akin to FFT vs basic FT. I don't know it well enough to know how practical it is to implement. On the surface it appears anO(N*logN)
method may be possible for any orthogonal polynomials with two-term recursion...EDIT: it seems that the exact two-term recursion factorisation is numerically unstable, and in practice the transform is implemented as an approximate Hermite-Newton-Cotes transform on a linear grid. Probably my question should be closed here.