Solving tall, sparse linear systems is insanely slow

ojwoodford commented 1 year ago

Solving the normal equations is much faster for tall linear systems:

using LinearSolve, BenchmarkTools
A = randn(1000, 100)
b = randn(1000)
@assert isapprox(solve(LinearProblem(A, b)).u, Symmetric(A' * A) \ (A' * b))
@btime solve(LinearProblem($A, $b)).u;
@btime Symmetric(($A)' * $A) \ (($A)' * $b);

2.471 ms (32 allocations: 1.66 MiB) 179.958 μs (9 allocations: 209.09 KiB)

Is there a reason why LinearSolve's smart algorithm selection doesn't choose to do this?

ojwoodford commented 1 year ago

It's far worse for sparse arrays:

using LinearSolve, SparseArrays, BenchmarkTools, LDLFactorizations
A = sprandn(5000, 100, 0.1)
b = randn(5000)
@assert isapprox(solve(LinearProblem(A, b)).u, ldl(A' * A) \ (A' * b))
@btime solve(LinearProblem($A, $b)).u;
@btime ldl(($A)' * $A) \ (($A)' * $b);

22.392 s (54 allocations: 20.41 MiB) 1.220 ms (47 allocations: 1.37 MiB)

LinearSolve is approximately 20,000x slower here!

ojwoodford commented 1 year ago

I should add I'm on an Apple silicon Mac. Not sure how much difference that will make.

YingboMa commented 1 year ago

That’s a OK way to solve least squares only when we know A has a low condition number a priori. It's not enough to test linear algebra algorithms using random matrices. Consider:

julia> using LinearSolve, LinearAlgebra

julia> vandermonde(xs, basis, m) = [basis(x, i) for x in xs,
                                                 i in 1:m]
vandermonde (generic function with 2 methods)

julia> V = vandermonde(range(0, 1, length=128), (x, i)->i==1 ? one(x) : x^(i-1), 20);

julia> b = rand(128);

julia> x_normal = (V'V) \ (V'b);

julia> x = solve(LinearProblem(V, b)).u;

julia> norm(V * x - b)
2.935841792541393

julia> norm(V * x_normal - b)
3.141337584566186

julia> norm(x - x_normal)
3.881181867110321e12

chriselrod commented 1 year ago

On my M1 mac:

julia> BLAS.set_num_threads(1);

julia> @btime qr($A)\$b;
  1.365 ms (9 allocations: 847.33 KiB)

julia> @btime solve(LinearProblem($A, $b)).u;
  2.607 ms (29 allocations: 1.66 MiB)

julia> BLAS.set_num_threads(4);

julia> @btime qr($A)\$b;
  1.142 ms (9 allocations: 847.33 KiB)

julia> @btime solve(LinearProblem($A, $b)).u;
  2.238 ms (29 allocations: 1.66 MiB)

So qr(A)\b seems like a pretty big improvement over what solve(LinearProblem(A, b)).u is doing?

ChrisRackauckas commented 1 year ago

It should be using a QR in this instance, but it has to runtime prove the non-square assumption. With the assumption it's likely fine.

chriselrod commented 1 year ago

2x overhead sounds like a hefty performance bug.

ChrisRackauckas commented 1 year ago

There's way to reduce it but it would mangle the code a bit and I haven't gotten around to it. But in general I'd hope cases that want performance set the assumptions.

ojwoodford commented 1 year ago

In general I'd hope cases that want performance set the assumptions.

@ChrisRackauckas How can I tell LinearSolve that A is a priori well conditioned?

oscardssmith commented 1 year ago

You don't need to know that it's well conditioned to use QR. You just need to know that the matrix is tall. That said, IMO LinearSolve should really check the shape and use QR if appropriate.

chriselrod commented 1 year ago

You don't need to know that it's well conditioned to use QR.

He wants to say it is well conditioned to use Symmetric((A)' * A) \ ((A)' * b). I suggested qr because it doesn't require A to be well conditioned.

Personally, I think if someone has that kind of information about the problem (i.e. how well conditioned it is), they should just choose the appropriate way to solve the problem themselves.

Conditioning is expensive to calculate (more expensive than solving generally is, i.e. normally requires svd [which could be used to solve]).

ChrisRackauckas commented 1 year ago

IMO LinearSolve should really check the shape and use QR if appropriate.

That's what it's doing, and that's exactly the problem.

He wants to say it is well conditioned to use Symmetric((A)' A) \ ((A)' b).

We should add a OperatorAssumption for conditioning.

chriselrod commented 1 year ago

That's what it's doing, and that's exactly the problem.

How can a shape check take over 1 ms?

chriselrod commented 1 year ago

I think it is literally calling qr! twice: first in init_cacheval, and then again under a sub-solve call.

ChrisRackauckas commented 1 year ago

Oh yeah we need to improve the cache init since there's no ArrayInterface.qr_instance(A) implemented.

ojwoodford commented 1 year ago

I suggested qr because it doesn't require A to be well conditioned.

I actually care about the case where A is well conditioned and sparse, and could be solved 4 orders of magnitude faster using normal equations.

Personally, I think if someone has that kind of information about the problem (i.e. how well conditioned it is), they should just choose the appropriate way to solve the problem themselves.

Currently that's exactly what I'm doing. It's just that LinearSolve claims to:

Fast implementations of linear solving algorithms (Very slow vs QR for tall, sparse A; surely a bug)
Allow for swapping out the linear solver that is used (No normal equation solver?)
A polyalgorithm that smartly chooses between these methods (No way to specify the problem is well conditioned)

and I feel my experience shows it doesn't really deliver on that. The method should be at least as fas as QR for tall, sparse matrices. A normal solver could be added, as could a way to tell the system that A is well conditioned. I appreciate it's a work in progress, so please take this as constructive feedback rather than criticism.

oscardssmith commented 1 year ago

I actually care about the case where A is well conditioned and sparse, and could be solved 4 orders of magnitude faster using normal equations.

This isn't true. QR will be fairly similar speed (2x for dense up to 10x for sparse) slower and a lot more accurate.

For the sparse example you gave, my laptop gives

julia> @btime qr(A) \ b;
  11.515 ms (167 allocations: 23.39 MiB)
julia> @btime  Symmetric(A' * A) \ (A' * b);
  2.184 ms (79 allocations: 1.70 MiB)

ojwoodford commented 1 year ago

@oscardssmith Fair point. I hadn't understood that QR was much faster for the sparse case. Full timings for me:

julia> @btime solve(LinearProblem($A, $b)).u;
  22.228 s (54 allocations: 20.41 MiB)

julia> @btime qr($A) \ $b;
  8.853 ms (159 allocations: 23.40 MiB)

julia> @btime Symmetric(($A)' * $A) \ (($A)' * $b);
  1.288 ms (75 allocations: 1.70 MiB)

julia> @btime ldl(($A)' * $A) \ (($A)' * $b);
  1.206 ms (47 allocations: 1.37 MiB)

So QR is already 2500x faster than what LinearSolve is doing, and accurate with poor conditioning (will test this on @YingboMa's example). This must surely be a bug in LinearSolve.

However, I do care about the further 8x speedup I get using my approach, and would prefer to use it from LinearSolve, and get all the benefits that LinearSolve claims to give.

ChrisRackauckas commented 1 year ago

So QR is already 2500x faster than what LinearSolve is doing, and accurate with poor conditioning (will test this on @YingboMa's example). This must surely be a bug in LinearSolve.

Yes, I've said multiple times here that in this case it's just doing a QR. So there is something odd here. In fact, the profile looks like it's some kind of weird inference bug or something.

That most be the oddest profile I've seen. It spends <1% of the time running code. The other 99% of the time is 🤷. Haven't seen this before, can someone double check the profile?

chriselrod commented 1 year ago

Haven't seen this before, can someone double check the profile?

I've seen things like this before. The fix is to start Julia with a single thread. I profiled it earlier and saw qr! calls from two different places within solve.

ChrisRackauckas commented 1 year ago

Interesting. I found the issue. In https://github.com/SciML/LinearSolve.jl/pull/188 there was https://github.com/SciML/LinearSolve.jl/pull/188/commits/fd05fb1adf22b1dde8f0b892ac736fe0d4ba0134 which versioned qr! to only be used for (at the time) Julia master on sparse arrays. And if you check:

using LinearSolve, SparseArrays, BenchmarkTools
A = sprandn(5000, 100, 0.1)
b = randn(5000)
prob = LinearProblem(A, b)
assumptions = LinearSolve.OperatorAssumptions(false)
@time solve(prob; assumptions)

using LinearAlgebra
@time qr!(A)

The new qr! on sparsearrays is absurdly slow. So the fix is just to use the qr instead.

@Wimmerer is this qr! issue in sparsearrays known? I couldn't find an issue on it.

ChrisRackauckas commented 1 year ago

I profiled it earlier and saw qr! calls from two different places within solve.

That was already fixed.

ChrisRackauckas commented 1 year ago

using LinearSolve, SparseArrays, BenchmarkTools, LDLFactorizations, LinearAlgebra
A = sprandn(5000, 100, 0.1)
b = randn(5000)
@assert isapprox(solve(LinearProblem(A, b)).u, ldl(A' * A) \ (A' * b))
@btime solve(LinearProblem($A, $b)).u; # 16.984 ms (326 allocations: 24.16 MiB)
@btime qr($A) \ $b # 17.135 ms (159 allocations: 23.35 MiB)
@btime ldl(($A)' * $A) \ (($A)' * $b); # 1.186 ms (47 allocations: 1.37 MiB)

ChrisRackauckas commented 1 year ago

https://github.com/SciML/LinearSolve.jl/pull/290 adds the new algorithms for well-conditioned matrices and adds them to the default algorithm. Results:

using LinearSolve, LinearAlgebra, BenchmarkTools
A = randn(1000, 100)
b = randn(1000)
@assert isapprox(solve(LinearProblem(A, b)).u, Symmetric(A' * A) \ (A' * b))
@btime solve(LinearProblem($A, $b)).u; # 3.140 ms (26 allocations: 857.53 KiB)
@btime solve(LinearProblem($A, $b), $(NormalCholeskyFactorization())).u; # 411.000 μs (27 allocations: 163.36 KiB)
@btime solve(LinearProblem($A, $b), $(NormalBunchKaufmanFactorization())).u; # 400.500 μs (27 allocations: 210.09 KiB)
@btime solve(LinearProblem($A, $b), assumptions = $(OperatorAssumptions(false; condition = OperatorCondition.WellConditioned))).u; # 401.000 μs (18 allocations: 162.89 KiB)
@btime Symmetric(($A)' * $A) \ (($A)' * $b); # 400.000 μs (9 allocations: 209.09 KiB)
@btime factorize(Symmetric((A)' * A)) \ (($A)' * $b); # 404.300 μs (12 allocations: 209.19 KiB)

using LinearSolve, SparseArrays, BenchmarkTools, LDLFactorizations, LinearAlgebra
A = sprandn(5000, 100, 0.1)
b = randn(5000)
@assert isapprox(solve(LinearProblem(A, b)).u, ldl(A' * A) \ (A' * b))
@btime solve(LinearProblem($A, $b)).u; # 16.984 ms (326 allocations: 24.16 MiB)
@btime solve(LinearProblem($A, $b), $(NormalCholeskyFactorization())).u; # 1.581 ms (130 allocations: 1.70 MiB)
@btime solve(LinearProblem($A, $b), assumptions = $(OperatorAssumptions(false; condition = OperatorCondition.WellConditioned))).u; # 1.535 ms (124 allocations: 1.70 MiB)
@btime qr($A) \ $b # 17.135 ms (159 allocations: 23.35 MiB)
@btime factorize(Symmetric((A)' * A)) \ (($A)' * $b); # 1.186 ms (47 allocations: 1.37 MiB)
@btime cholesky(Symmetric((A)' * A)) \ (($A)' * $b); # 1.186 ms (47 allocations: 1.37 MiB)
@btime ldlt(((A)' * A)) \ (($A)' * $b); # 1.186 ms (47 allocations: 1.37 MiB)
@btime ldl(($A)' * $A) \ (($A)' * $b); # 1.186 ms (47 allocations: 1.37 MiB)

I'm not entirely sure where the overhead on the sparse case is coming from, but at least it's rather close.

SciML / LinearSolve.jl

Solving tall, sparse linear systems is insanely slow #283