Use contiguous recurrence for c-related sum computations

TSGut commented 1 year ago

@ioannisPApapadopoulos @dlfivefifty

Basically just porting over @MikaelSlevinsky's contiguous recurrence implementation in a way that should be reasonably natural for this package. This should be both much faster and hopefully much more accurate at high c.

Previously we had

julia> Wcomp = SemiclassicalJacobiWeight.(1.1,2:2,3:3,3:100);

julia> @time scomp = sum.(Wcomp);
  0.112850 seconds (336.34 k allocations: 14.352 MiB)

Now we get

julia> W = SemiclassicalJacobiWeight.(1.1,2,3,3:100);

julia> @time s = sum.(W);
  0.000048 seconds (28 allocations: 2.156 KiB)

At no cost to accuracy (the old method using bigfloat, the new one just Float64):

julia> s - scomp
98-element Vector{Float64}:
 -2.6020852139652106e-18
 -2.6020852139652106e-18
 -1.734723475976807e-18
 -1.3010426069826053e-18
 -8.673617379884035e-19
  0.0
 -4.336808689942018e-19
 -2.168404344971009e-19
 -2.168404344971009e-19
  ⋮
  5.898059818321144e-17
  5.898059818321144e-17
  6.591949208711867e-17
  7.28583859910259e-17
  7.632783294297951e-17
  7.979727989493313e-17
  8.673617379884035e-17
  9.367506770274758e-17
  1.0408340855860843e-16

It's implemented to work whenever sum is broadcast over a crange vector of weights as seen above, as it is it wont be used if you just call sum on a single weight (mainly because doing that in a clean way would involve a lot of clauses about a, b and c and other contiguous recurrences), maybe we implement that at some point in the future. If you must use it on a single weight just do e.g. sum.(SemiclassicalJacobiWeight.(1.1,2,3,3:3)).

Still need to add and run tests.

TSGut commented 1 year ago

Now also featuring the weighted conversion speedup promised to @ioannisPApapadopoulos. Hopefully this getting faster and faster eventually reaches a point where it is fast enough (I suspect now it would just be code optimization rather than better formulas tho).

Here is the example from https://github.com/JuliaApproximation/SemiclassicalOrthogonalPolynomials.jl/issues/78 reran in current implementation:

julia> using SemiclassicalOrthogonalPolynomials, ClassicalOrthogonalPolynomials

julia> t = 1.1
1.1

julia> Q₀₀ = SemiclassicalJacobi.(t, 0, 0, 0:∞);

julia> Q₁₁ = SemiclassicalJacobi.(t, 1, 1, 0:∞);

julia> L  = (Weighted.(Q₀₀) .\ Weighted.(Q₁₁));

julia> @time L[20]; # before: 0.183541 seconds
  0.060309 seconds (191.13 k allocations: 8.075 MiB)

julia> @time L[30]; # before: 0.396986 seconds
  0.100929 seconds (402.88 k allocations: 15.625 MiB)

julia> @time L[40]; # before: 0.861748 seconds
  0.212241 seconds (758.47 k allocations: 27.565 MiB)

julia> @time L[50]; # before: 1.565395 seconds
  0.291458 seconds (1.26 M allocations: 43.963 MiB)

julia> @time L[60]; # before: 6.235148 seconds (with compile), 2.743544 seconds (without compile)
  0.441863 seconds (1.95 M allocations: 64.999 MiB)

and we also get

julia> @time L[250]
 37.124558 seconds (142.60 M allocations: 3.696 GiB, 2.71% gc time)
(((ℵ₀×ℵ₀ LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Adjoint{Float64, InfiniteLinearAlgebra.AdaptiveCholeskyFactors{Float64, BandedMatrices.BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, LazyBandedMatrices.SymTridiagonal{Float64, LazyArrays.BroadcastVector{Float64, typeof(-), Tuple{Bool, SubArray{Float64, 1, ClassicalOrthogonalPolynomials.QRJacobiData{:Q, Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}}}, LazyArrays.BroadcastVector{Float64, typeof(-), Tuple{SubArray{Float64, 1, ClassicalOrthogonalPolynomials.QRJacobiData{:Q, Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}}}}}}} with indices OneToInf()×OneToInf()) * (ℵ₀×ℵ₀ LinearAlgebra.Diagonal{Float64, FillArrays.Fill{Float64, 1, Tuple{InfiniteArrays.OneToInf{Int64}}}} with indices OneToInf()×OneToInf()) with indices OneToInf()×OneToInf()) * ((ℵ₀×ℵ₀ LinearAlgebra.LowerTriangular{Float64, LinearAlgebra.Adjoint{Float64, InfiniteLinearAlgebra.AdaptiveCholeskyFactors{Float64, BandedMatrices.BandedMatrix{Float64, Matrix{Float64}, Base.OneTo{Int64}}, LazyBandedMatrices.SymTridiagonal{Float64, SubArray{Float64, 1, ClassicalOrthogonalPolynomials.QRJacobiData{:Q, Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}, SubArray{Float64, 1, ClassicalOrthogonalPolynomials.QRJacobiData{:Q, Float64}, Tuple{Base.Slice{InfiniteArrays.OneToInf{Int64}}, Int64}, false}}}}} with indices OneToInf()×OneToInf()) * (ℵ₀×ℵ₀ LinearAlgebra.Diagonal{Float64, FillArrays.Fill{Float64, 1, Tuple{InfiniteArrays.OneToInf{Int64}}}} with indices OneToInf()×OneToInf()) with indices OneToInf()×OneToInf()) with indices OneToInf()×OneToInf()) * (ℵ₀×ℵ₀ LinearAlgebra.Diagonal{Float64, FillArrays.Fill{Float64, 1, Tuple{InfiniteArrays.OneToInf{Int64}}}} with indices OneToInf()×OneToInf()) with indices OneToInf()×OneToInf():
  0.00434421    ⋅            ⋅             ⋅           …  
  0.00428913   0.00608072    ⋅             ⋅
 -3.76572e-5   0.0059776    0.00737181     ⋅
   ⋅          -7.88228e-5   0.00721544    0.00842674
   ⋅            ⋅          -0.000127024   0.00821237
   ⋅            ⋅            ⋅           -0.000180951  …
   ⋅            ⋅            ⋅             ⋅
   ⋅            ⋅            ⋅             ⋅
   ⋅            ⋅            ⋅             ⋅
  ⋮                                                    ⋱

Granted that is not exactly fast but there are a LOT of cholesky decompositions that need to be done to construct this. I am sure there is a lot of code level optimizations left here but let's see if this is good enough first.

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (82bac69) 91.66% compared to head (a26d581) 92.40%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #90 +/- ## ========================================== + Coverage 91.66% 92.40% +0.74% ========================================== Files 3 3 Lines 480 527 +47 ========================================== + Hits 440 487 +47 Misses 40 40 ``` | [Files](https://app.codecov.io/gh/JuliaApproximation/SemiclassicalOrthogonalPolynomials.jl/pull/90?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaApproximation) | Coverage Δ | | |---|---|---| | [src/SemiclassicalOrthogonalPolynomials.jl](https://app.codecov.io/gh/JuliaApproximation/SemiclassicalOrthogonalPolynomials.jl/pull/90?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaApproximation#diff-c3JjL1NlbWljbGFzc2ljYWxPcnRob2dvbmFsUG9seW5vbWlhbHMuamw=) | `88.33% <100.00%> (+1.85%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

TSGut commented 1 year ago

Ok I think this does what I wanted it to now, can be merged.

@ioannisPApapadopoulos give it a try and see if this fixes the issue you saw with the normalization. You should precompute the normalization constants if possible so it can all be done in one go of the recurrence.

ioannisPApapadopoulos commented 1 year ago

Brilliant, thanks @TSGut. I'll give it a spin and report back.

dlfivefifty commented 1 year ago

I don't really understand what L[250] is computing. Why is it quadratic in cost? I would think computing L[m][1:n,1:n] would be O(m*n) operations

TSGut commented 1 year ago

I don't really understand what L[250] is computing. Why is it quadratic in cost? I would think computing L[m][1:n,1:n] would be O(m*n) operations

A fair question. In theory, the way this is computed is by chaining lots of QR or Cholesky decompositions to ladder up. But the cost is not straightforward because this laddering involves some cached arrays meaning that

julia> @time L[250];
 32.742888 seconds (140.99 M allocations: 3.648 GiB, 2.50% gc time)

julia> @time L[250];
 16.266950 seconds (72.01 M allocations: 1.871 GiB, 2.54% gc time)

So at least half of the computing time here is spent filling in some cached jacobimatrices to then perform cholesky decomposition on.

TSGut commented 1 year ago

I think I know where most of the computation time is lost, possibly what causes the N^2 cost. I will see if I can fix it.

TSGut commented 1 year ago

hm, ok it is faster now, especially after caching which is nice but still has not changed the complexity.

julia> @time LL[250];
 25.778077 seconds (106.66 M allocations: 2.762 GiB, 2.96% gc time, 0.09% compilation time)

julia> @time LL[250];
  8.797171 seconds (36.06 M allocations: 959.496 MiB, 2.45% gc time)

TSGut commented 1 year ago

Ok this is more like it...

julia> @time L[250];
 16.329566 seconds (70.79 M allocations: 1.829 GiB, 2.53% gc time)

julia> @time L[250];
  0.000901 seconds (835 allocations: 39.078 KiB)

(no, I didnt just cache this object, but it only works with objects that are cached now instead of erroneously regenerating them over and over) Importantly it precomputes all the stuff needed for the other conversions too on the way. so all the complexity left is filling the cached jacobimatrices up to the required order, it is otherwise blazingly fast.

julia> @time L[250];
  0.000896 seconds (835 allocations: 39.078 KiB)

julia> @time L[249];
  0.074969 seconds (309.57 k allocations: 7.869 MiB)

julia> @time L[249];
  0.000951 seconds (835 allocations: 39.078 KiB)

@ioannisPApapadopoulos you should keep this in mind when using this. Probably don't call L[1:250], call L[250] on its own first, then call L[1:250] otherwise Julias ordering may not make use of this and you will waste time.

dlfivefifty commented 1 year ago

If it’s AbstractCachedVector then L[1:250] should be fine

TSGut commented 1 year ago

retagged as WIP for now because of a potential forward recurrence instability. see the relevant issue in the draft repo

TSGut commented 1 year ago

Now includes sumquotient() used in weighted conversions to avoid the t^c term overflow for high c in computations of sum(wP)/sum(wQ) by cancelling them before using the recurrence.

TSGut commented 1 year ago

While there are some drawbacks here that we discussed, I think we probably still merge this in this form (after resolving conflicts)? The previous implementation is pretty much strictly worse so we can use the recurrence until we revamp the decompositions stuff to keep track of the normalization directly. It also would still want to have the recurrence anyway for cases where we don't want to generate the hierarchy of polynomials.

ioannisPApapadopoulos commented 12 months ago

I think there is a bug when c=0. e.g.

using SemiclassicalOrthogonalPolynomials
import HypergeometricFunctions: _₂F₁general2
using SpecialFunctions

t, a, b, c = 1.1, 1, 1, 0
sum.(SemiclassicalJacobiWeight.(t,a,b,c:c)) # returns 1.0
t^c * beta(1+a,1+b) * _₂F₁general2(1+a,-c,2+a+b,1/t) # returns 0.16666666666666669

TSGut commented 12 months ago

Yep, that's definitely a bug. Will fix.

ioannisPApapadopoulos commented 11 months ago

@TSGut have you made a start? or should I have a go fixing it next week?

TSGut commented 11 months ago

Ah, sorry I just didn't push. It was just a bug in the special casing for the c=0.

TSGut commented 11 months ago

@dlfivefifty ready for review or merge

ioannisPApapadopoulos commented 11 months ago

Thank you!

dlfivefifty commented 11 months ago

Generally you would resolve the conflicts and have tests passing before asking someone to review or merge

TSGut commented 11 months ago

Generally you would resolve the conflicts and have tests passing before asking someone to review or merge

Of course. I just missed that there was a conflict, sorry about that. Should be fixed.

TSGut commented 11 months ago

@dlfivefifty Ok, all passing and conflicts and review resolved.

TSGut commented 11 months ago

Ok other than the recurrence array change I have addressed the reviews. I will check how much work it would be to convert this all to use RecurrenceArrays in the coming few days.

JuliaApproximation / SemiclassicalOrthogonalPolynomials.jl

Use contiguous recurrence for c-related sum computations #90

Codecov Report