Cholesky Jacobi matrices are insanely slow

dlfivefifty commented 9 months ago

The culprit is that UX is a lazy multiplication so the following line doesn't do anything:

https://github.com/JuliaApproximation/ClassicalOrthogonalPolynomials.jl/blob/fb28b6fcdc4d2ce06a59dff95f190b0957f49227/src/choleskyQR.jl#L104

Here is the slow code:

julia> r = 0.5
0.5

julia> lmin, lmax = (1-sqrt(r))^2,  (1+sqrt(r))^2
(0.08578643762690492, 2.914213562373095)

julia> U = chebyshevu(lmin..lmax);

julia> Q = OrthogonalPolynomial(x -> 1/(2π) * sqrt((lmax-x)*(x-lmin))/(x*r), U);

julia> J = jacobimatrix(Q);

julia> @time J[1:1000,1:1000]
 36.732166 seconds (178.69 k allocations: 1.624 GiB, 0.29% gc time)
1000×1000 BandedMatrix{Float64} with bandwidths (1, 1):
 1.0       0.707107   ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅   …   ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
 0.707107  1.5       0.707107   ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
  ⋅        0.707107  1.5       0.707107   ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
  ⋅         ⋅        0.707107  1.5       0.707107   ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅        0.707107  1.5       0.707107   ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅        0.707107  1.5       0.707107   ⋅    ⋅    ⋅    ⋅    ⋅   …   ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅         ⋅         ⋅ 
 ⋮                                                 ⋮                                  ⋮         ⋱                 ⋮                             ⋮                                       
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅   …   ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅   0.707107  1.5       0.707107   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅        0.707107  1.5       0.707107   ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅        0.707107  1.5       0.707107   ⋅ 
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅        0.707107  1.5       0.707107
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅         ⋅    ⋅    ⋅    ⋅    ⋅       ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅    ⋅         ⋅         ⋅         ⋅        0.707107  1.5

TSGut commented 9 months ago

Huh, UX should be a cached array. I'll fix it.

TSGut commented 9 months ago

The line

Q = OrthogonalPolynomial(x -> 1/(2π) * sqrt((lmax-x)*(x-lmin))/(x*r), U);

seems like it isn't supported anywhere (i.e. won't run for me), are you implementing new functionality or did you do that in a branch somewhere else?

In any case, I can see that the modification is basically 1/(x*r) and this thus falls under a rational case (I have reproduced the above by calling cholesky_jacobimatrix directly)? If you do this with Cholesky directly it will be doing the Cholesky decomposition of a Clenshaw multiplication operator with >100 degree making everything almost dense. For a significant speedup of this use case we should be using the QL or reverse Cholesky approach (which while adaptive QL is implemented, ql_jacobimatrix is not -- but I can do that if there is now actually a use case for it).

I will still go in and see if I can make the UX data cached because that is an oversight but for non-polynomial weight mods requiring large degrees to approximate I don't expect cholesky_jacobimatrix to be the way to go.

TSGut commented 9 months ago

Actually it looks like the primary culprit is just the resizing bug again? Because it's substantially faster if I just run J[1000,1000] first.

julia> r = 0.5
0.5

julia> lmin, lmax = (1-sqrt(r))^2,  (1+sqrt(r))^2
(0.08578643762690492, 2.914213562373095)

julia> U = Normalized(chebyshevu(lmin..lmax));

julia> w = x -> 1/(x*r)
#7 (generic function with 1 method)

julia> W = U \ (w.(axes(U,1)) .* U);

julia> W = Symmetric(W);

julia> J = cholesky_jacobimatrix(W, U);

julia> @time J[1000,1000]
  5.755462 seconds (256.60 k allocations: 33.267 MiB, 0.25% gc time, 2.16% compilation time)
1.5000000000000002

julia> @time J[1:1000,1:1000]
  0.053048 seconds (124.77 k allocations: 8.092 MiB, 99.20% compilation time)
1000×1000 BandedMatrix{Float64} with bandwidths (1, 1):
 1.0       0.707107   ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅         ⋅         ⋅ 
 0.707107  1.5       0.707107   ⋅         ⋅         ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅        0.707107  1.5       0.707107   ⋅         ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅        0.707107  1.5       0.707107   ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅        0.707107  1.5       0.707107      ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅        0.707107  1.5       …   ⋅         ⋅         ⋅         ⋅         ⋅
 ⋮                                                 ⋮         ⋱  ⋮
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅        …  1.5       0.707107   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅           0.707107  1.5       0.707107   ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅        0.707107  1.5       0.707107   ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅         ⋅        0.707107  1.5       0.707107   
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅        0.707107  1.5

I thought we fixed this but maybe something happens in the resizing here that doesn't trigger the cache resizing fix.

EDIT: Yeah, this breaks out of our resizing fix because of that change we made to have the data stored as bands. That requires some custom resizing which means calling intervals like 1:1000 does 1000 resizes instead of 1. I will think about whether there is a simple fix

dlfivefifty commented 9 months ago

This is on the branch dl/choleskyinterface

In your example J[1000,1000] is ridiculously slow. And even 0.05s is slow for an O(n) algorithm. It should be something like 0.0001s

TSGut commented 9 months ago

Well it's more like $O(nm)$ where $m$ is the bandwidth of the weight modification which is quite high in this example but it's still much slower than it should be, there are way too many allocations. I remember it being much faster in the timings I recorded, I wonder if something in another package caused the allocations to skyrocket. In any case, fix coming soon.

TSGut commented 9 months ago

Multiple things must have broken elsewhere that caused collateral here because some of this makes no sense to me.

Look at these timings for a very simple multiplication matrix, no Cholesky involved:

julia> using BenchmarkTools

julia> W = P \ ((2 .+ axes(P,1).^2) .* P);

julia> @btime W[1:1000,1:1000];
  240.000 μs (12 allocations: 141.50 KiB)

julia> S = Symmetric(W);

julia> @btime S[1:1000,1:1000];
  14.858 ms (24973 allocations: 2.43 MiB)

For high degrees this causes a major bottleneck that the Cholesky stuff runs into. Either something broke elsewhere or I guess we never tested it for very high degree problems because we were intending to use this for a step by step ladder in SemiclassicalOPs.

This simply cannot be correct behavior?

julia> using ClassicalOrthogonalPolynomials, LinearAlgebra

julia> r = 0.5
0.5

julia> lmin, lmax = (1-sqrt(r))^2,  (1+sqrt(r))^2
(0.08578643762690492, 2.914213562373095)

julia> U = Normalized(chebyshevu(lmin..lmax));

julia> w = x -> 1/(x*r)
#11 (generic function with 1 method)

julia> W = U \ (w.(axes(U,1)) .* U);

julia> S = Symmetric(W);

julia> @time W[1:50,1:50];
  0.331621 seconds (9 allocations: 384.359 KiB)

julia> @time S[1:50,1:50];
  8.342962 seconds (12.50 k allocations: 11.112 MiB)

If it is correct behavior then the only way to do your example without ridiculous computational cost will be to implement the rational case.

dlfivefifty commented 9 months ago

use the debugger? Obviously the symmetric getindex is calling getindex on W which is slow.

TSGut commented 9 months ago

Ah, nevermind, the reason I was confused is Clenshaw uses _BandedMatrix(::ClenshawLayout, V::SubArray{<:Any,2}) for its double range indexing. This should be easy to use for a Symmetric overwrite.

TSGut commented 9 months ago

Ok, implemented a getindex for Symmetric Clenshaw so that now

julia> W = P \ (wf.(axes(P,1)) .* P);

julia> @time K1 = W[1:100,1:100];
  0.503210 seconds (9 allocations: 660.547 KiB)

julia> @time K2 = Symmetric(W)[1:100,1:100];
  0.477266 seconds (18 allocations: 1.178 MiB)

But just because Symmetric Clenshaw and Clenshaw are matched now doesn't mean it is fast enough. Non-symmetrized still very slow for large blocks:

julia> W = P \ (wf.(x) .* P);

julia> @time M = W[1:1000,1:1000];
  3.662437 seconds (10 allocations: 5.072 MiB)

This of course slows down adaptive Cholesky:

julia> W = P \ (wf.(x) .* P);

julia> S = Symmetric(W);

julia> U = cholesky(S).U;

julia> @time U[1000,1000];
  3.936038 seconds (46 allocations: 9.704 MiB)

before we get to anything relating to cholesky_jacobimatrix. At the moment this is the main bottleneck that remains before we can look at optimization rather than bug fixes.

At least these changes make the code runnable again so that it just works with [1:1000,1:1000]

julia> r = 0.5
0.5

julia> lmin, lmax = (1-sqrt(r))^2,  (1+sqrt(r))^2
(0.08578643762690492, 2.914213562373095)

julia> P = Normalized(chebyshevu(lmin..lmax))
Normalized(ChebyshevU() affine mapped to 0.08578643762690492 .. 2.914213562373095)

julia> x = axes(P,1)
Inclusion(0.08578643762690492 .. 2.914213562373095)

julia> J = jacobimatrix(P)
ℵ₀×ℵ₀ LazyBandedMatrices.SymTridiagonal{Float64, Fill{Float64, 1, Tuple{InfiniteArrays.OneToInf{Int64}}}, Fill{Float64, 1, Tuple{InfiniteArrays.OneToInf{Int64}}}} with indices OneToInf()×OneToInf():
 1.5       0.707107   ⋅         ⋅         ⋅        …  
 0.707107  1.5       0.707107   ⋅         ⋅
  ⋅        0.707107  1.5       0.707107   ⋅
  ⋅         ⋅        0.707107  1.5       0.707107
  ⋅         ⋅         ⋅        0.707107  1.5
  ⋅         ⋅         ⋅         ⋅        0.707107  …
  ⋅         ⋅         ⋅         ⋅         ⋅
 ⋮                                                 ⋱

julia> wf(x) = 1/(x*r)
wf (generic function with 1 method)

julia> Jchol = cholesky_jacobimatrix(wf, P);

julia> @time Jchol[1:1000,1:1000]
  5.523250 seconds (354.67 k allocations: 36.380 MiB, 0.23% gc time, 4.33% compilation time)
1000×1000 view(::LazyBandedMatrices.SymTridiagonal{Float64, SubArray{Float64, 1, ClassicalOrthogonalPolynomials.CholeskyJacobiData{Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}, SubArray{Float64, 1, ClassicalOrthogonalPolynomials.CholeskyJacobiData{Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}}, 1:1000, 1:1000) with eltype Float64:
 1.0       0.707107   ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅         ⋅         ⋅
 0.707107  1.5       0.707107   ⋅         ⋅         ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅        0.707107  1.5       0.707107   ⋅         ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅        0.707107  1.5       0.707107   ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅        0.707107  1.5       0.707107      ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅        0.707107  1.5       …   ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅        0.707107      ⋅         ⋅         ⋅         ⋅         ⋅
 ⋮                                                 ⋮         ⋱  ⋮
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅           0.707107   ⋅         ⋅         ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅        …  1.5       0.707107   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅           0.707107  1.5       0.707107   ⋅         ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅        0.707107  1.5       0.707107   ⋅
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅         ⋅        0.707107  1.5       0.707107   
  ⋅         ⋅         ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅        0.707107  1.5

JuliaApproximation / ClassicalOrthogonalPolynomials.jl

Cholesky Jacobi matrices are insanely slow #167