LOBPCG stability: Suggestion of improvements

mfherbst commented 5 years ago

We currently employ the lobpcg solver from this Julia package in our DFTK.jl electronic structure theory code. In our observations, the current Cholesky-QR-based LOBPCG implementation can become numerically unstable and sometimes even produce spurious eigenvalues.

As an illustrative example, run the following julia code

A = [1.25, 1.5, 1.5, 1.25, 1.5, 1.25, 1.5, 0, 1.13, 1.13, 1.5, 1.13, 1.5, 1.5, 1.13]
res = lobpcg(Diagonal(A), false, 5)

In my experiments with this simple test problem, it fails about each 2nd time with a PosDefException from the Cholesky factorization. In the remaining cases, it returns two approximations to the zero eigenvalue at res.λ[1] and res.λ[2]. In all such instances I further observed the first two eigenvectors to be non-orthogonal. That is to say, that

dot(res.X[:, 1], res.X[:, 2])

returns a numerically non-zero answer, order of magnitude 0.0001.

A couple of strategies to improve the numerical stability of LOBPCG have been discussed in the literature, e.g. in https://arxiv.org/abs/1704.07458. As far as I can judge from reading through the code, the suggested basis truncation and selection strategies are not yet present and it might be advantageous to take a look at implementing them.

mfherbst commented 5 years ago

Note: The original scipy implementation does not suffer from this problem. For example

from scipy.sparse.linalg import lobpcg
import numpy as np
lobpcg(np.diag(A), orth(np.random.randn(len(A), 5)), largest=False)

returns the correct result each time I tried it.

mohamed82008 commented 5 years ago

Seems like a duplicate of #223.

mohamed82008 commented 5 years ago

I am traveling this week, but I will give it a look next week. Thanks for the issue.

mfherbst commented 5 years ago

Thanks for looking into this. Indeed, this shows some similarity with #223. E.g. block size 5 is a magic number. 4 or 6 work much better with the A defined as above.

I can assure you, however, we have similarly frequent problems in our application code, where the matrices are much larger compared to the block size than in the shown example. I'll try to come up with a reproducible example for you.

mfherbst commented 5 years ago

I came up with a larger example, that still illustrates the PosDefException problem, see this gist.

On my machine this has a success rate of about 97%. This is not good enough for our needs, since we need in the order of hundred, potentially even thousands of such eigensolves.

mohamed82008 commented 5 years ago

Ok this is interesting. I managed to reproduce this with a specific seed. I will look into it.

lobpcg commented 5 years ago

@mohamed82008 see https://github.com/JuliaMath/IterativeSolvers.jl/issues/246#issuecomment-494429278 but scipy version uses Cholesky, so there's nothing wrong with Cholesky in this example. It looks like there's just a bug in your Julia code, unrelated to your QR "fix". You may want to just compare your code against the scipy latest version rather then putting new QR

mohamed82008 commented 5 years ago

It is definitely possible that it is a bug in my code, but given the number of passed test cases, I am very curious where I might have gone wrong. I can do a close comparison when I get some time.

lobpcg commented 5 years ago

@mohamed82008 this may be a bug in the core LOBPCG algorithm after all, also found by @mfherbst in scipy, see https://github.com/scipy/scipy/issues/10258#issuecomment-498929598

Let me think how to fix it easily...

mfherbst commented 5 years ago

@mohamed82008 (related to scipy/scipy#10258) I added a julia version to the example repository mfherbst/issue_scipy_lobpcg_guess for convenience and out of curiosity. As it turns out and as suspected by @lobpcg, the julia implementation of LOBPCG also fails on this example, but in fact both with and without the guess vectors.

mfherbst commented 5 years ago

The naive julia implementation referenced in scipy/scipy#10258 actually does pretty good in the example of mfherbst/issue_scipy_lobpcg_guess. Even converging down to 1e-14 tolerance in a reasonable number of iterations (some 250 without preconditioner and 100 with) is possible.

mohamed82008 commented 5 years ago

@mfherbst I think full orthogonalization of the basis with QR is an almost sure-fire way to make LOBPCG as stable as the QR alg (there is a loop-hole when using constraints). The only problem is the complexity the QR approach introduces in the form of additional matrix-matrix multiplications e.g. https://gist.github.com/antoine-levitt/f8ac3637c5d5d778a448780172e63e28#file-lobpcg-jl-L30, which is why I suggested this as a final measure in #247 if nothing else works. The QR "fix" in #247 is actually a less extreme measure than full orthogonalization as it only does matrix-matrix multiplications of P and/or R but not X for example.

lobpcg commented 5 years ago

QR is also very bad for parallel execution - the actual reason I avoid it in the original LOBPCG algorithm, which is also implemented in many parallel libraries.

mohamed82008 commented 5 years ago

I think the following approach has potential. Let U be the basis [X P R].

Remove components of U along constraints matrix C
Find the Gram matrix gramB = U' * B * U
Find the "compact" eigenvalue decomposition of gramB eliminating the nullspace basis, gramB = V_c L_c V_c'.
Update U using U = U * V_c * sqrt.(L_c).
Efficiently update A*U and B*U by right-multiplying by V_c * sqrt.(L_c).

Note that the sizes of U, A*U and B*U can decrease in steps 4 and 5. The above guarantees that U' B U == I without introducing new basis vectors that may conflict with the constraints matrix C.

Similarly, if using QR, I think we need to make sure not to include any additional basis from Q that are not spanned by the matrix we are trying to orthogonalize, i.e. only take columns whose corresponding diagonal in R is non-zero. These additional basis can conflict with the constraint matrix C which will backfire at the end.

The eigenvalue decomposition approach above avoids the need for additional matrix-matrix multiplications involving the matrices A or B and is more parallelizable so it should be more efficient without sacrificing stability.

Comments? @lobpcg @antoine-levitt

antoine-levitt commented 5 years ago

As soon as you discard information in this way, you will slow down convergence (essentially, downgrade to LOBPG, ie without the P) for precisions smaller than ~1e-8 (try it, you'll see), which are important for a number of applications.

mohamed82008 commented 5 years ago

Well, if the whole P matrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without the P matrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis using QR for example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.

mohamed82008 commented 5 years ago

Perhaps both options can be provided once the constraint issue for QR is figured out, not that I am volunteering to implement both! If this turns out to be too much work, I may not have the time to do it any time soon; this is a somewhat busy period for me. But let' see.

mohamed82008 commented 5 years ago

The constrained case for QR seems simple to handle. We just need to check that any additional column of Q that we are including in the basis, for which the corresponding diagonal element of R is 0, has an orthogonal component to C, and is C-orthgonalized before being added.

mohamed82008 commented 5 years ago

So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.

lobpcg commented 5 years ago

So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.

We discussed this paper already, e.g., https://github.com/JuliaMath/IterativeSolvers.jl/pull/247#issuecomment-498047061

lobpcg commented 5 years ago

Well, if the whole P matrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without the P matrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis using QR for example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.

Dropping P for the rest of iterations is extreme (I do it in MATLAB under some conditions) and of course results in dramatic slow down, e.g., already discussed in https://github.com/JuliaMath/IterativeSolvers.jl/pull/247#issuecomment-498314583

To fix the current test case, you need to drop P only on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see updated scipy/scipy#10258 (comment)

lobpcg commented 5 years ago

Let me ping an expert @joseeroman in case he has some advice

lobpcg commented 5 years ago

@rc @mfherbst @joseeroman @antoine-levitt @mohamed82008 please see https://github.com/scipy/scipy/issues/10258#issuecomment-500145533

mohamed82008 commented 5 years ago

I have my doubts about this "fix" since it doesn't really change the overall basis. I would be very surprised if it actually fixed all of @mfherbst 's test cases.

lobpcg commented 5 years ago

I have updated my comment: to fix the current test case, you need to drop P on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see scipy/scipy#10258 (comment)

stevengj commented 3 years ago

When I was implementing a related algorithm (a cruder predecessor to LOPCG, since at that time I didn't realize you could perform the line minimizations by solving a Ritz problem) many years ago (https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-8-3-173&id=63584), I found that it was sufficient to re-orthogonalize occasionally (not every step). Maybe you could estimate the condition number from the Cholesky factors (which can be done cheaply, I think?) in order to decide whether to re-orthogonalize?

In that work I used the polar decomposition A=QP. This can be computed relatively straightforwardly in parallel since P = sqrt(A'A) is a small matrix, so you just need a parallel matrix transpose and reduction for the Gram matrix A'A. Maybe this is related to @mohamed82008's suggestion above.

antoine-levitt commented 3 years ago

Oh wow that's an old thread. Since then, we ended up writing our own implementation in https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl. It uses a number of tricks for maximal stability, and we haven't been able to make it break yet (and not for lack of trying :-)), and keeps full convergence rate even when converging to close to machine precision, which I have never been able to do with any other LOBPCG implementation. It also fixes a very tricky issue whereby locking degraded convergence, which took me a good while to figure out; I think no other implementation has that fix. I wanted to add some other tricks (like avoid doing a full diagonalization of X'AX using perturbation theory when close to convergence, which gives quite a nice speedup in certain circumstances) and possibly make a paper out of it, but other things got in the way. If somebody is interested in picking this up again I'd be happy to share.

The choice there was to only use Cholesky for orthogonalization, because it's the fastest in a parallel setting (it's like the polar decomposition, but choleskys are faster than square roots). It's also very unstable so we do two things: 1) we add something to the diagonal when the cholesky fails 2) we re-orthogonalize when needed, indeed using the condition number of the Cholesky factors as you suggest @stevengj (see https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl#L77)

ViralBShah commented 3 years ago

Would it be useful for others to bring your implementation into this package?

mfherbst commented 3 years ago

As I said on the discourse thread our implementation at its current stage is not complete drop-in replacement for the lobpcg of this package (e.g. generalised eigensolves are not tested very thoroughly, only smallest eigenvalues implemented etc). So to get it to fully replace the implementation that exists would be a bit of work. Coexistence might be a little easier short-term I suppose.

Other than that it would lower the burden of maintaining it on our side, so I would not mind helping to get it integrated elsewhere :smile:. What do you think @antoine-levitt?

lobpcg commented 3 years ago

Since my last comment in this thread, I have updated a year ago https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py to make it more stable, e.g., also run in single precision to full attainable accuracy. My tricks are different from those @antoine-levitt describes above, so stability and performance should also differ. It would be interesting for someone to compare, maybe combine some of the tricks...

antoine-levitt commented 3 years ago

I already put it up at https://github.com/antoine-levitt/LOBPCG last year. It's the version that DFTK's one is based on. Anybody is welcome to do what they want with it. It's definitely more stable than the one in IterativeSolvers, and therefore I would recommend it for somebody to use for their first attempt. Generalized eigenvalue problems are OK (definitely more stable than the implementation in IterativeSolvers) as long as B is not horribly badly conditioned; there's an alternative tweak that doesn't assume B to be relatively well conditioned but it needs additional B multiplications for stability so I didn't put it in by default.

I did not benchmark extensively outside of DFTK's operating point so I can't say for sure that the implementation is unambiguously better than the one in IterativeSolvers regarding performance: there might be more matmuls or memory usage in our version; to be checked.

LOBPCG, as most Krylov methods, is a very delicate algorithm that is fundamentally unstable, and implementations have to find a fine line between stability and efficiency, so I think it's fine to have several implementations coexist. If someone does the work of comparing implementations and finds that one (possibly after tweaks) is unambiguously better than the others, then it might make sense to just have one, or possibly two pareto-optimal implementations, one more geared towards stability and one more towards performance. I can help but I don't have time to actually do that non-trivial amount of work.

mohamed82008 commented 3 years ago

Let's get a GSoC on this.

mohamed82008 commented 3 years ago

If there is a write-up somewhere describing the algorithm to be implemented, it would make it easier for the student to do it. Otherwise we would need a more mathematically adept student than the average GSoC to make sense of the arguments and scripts posted here. If a paper can come out of it, perhaps a graduate student might also be interested to volunteer and do all the comparisons and benchmarks.

lobpcg commented 3 years ago

The problem is that there are several papers describing different competing tricks and correspondingly several different actual implementations of LOBPCG in various packages, see https://en.wikipedia.org/wiki/LOBPCG . In exact arithmetic, different implementations should result in exactly same results, although with quite different computational costs and various memory requirements. In double precision, and especially single precision, implementations vary greatly in numerical stability and may just fail depending on the input. There is no universally accepted "best" implementation. Arguably, the Python version https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py is most used and tested by various users. If you just want to modify the existing Julia version to literally follow every detail of this Python version, it is a purely coding exercise, easy for a CS student to do, just basically translating Python into Julia. If the goal is first to determine the "best" version for implementation in Julia, it is a difficult, possibly open-ended, PhD-level math/CS research project.

mohamed82008 commented 3 years ago

Either way, we can put it on the GSoC project list and depending on the level of the student, we can do good old scope creep.

antoine-levitt commented 3 years ago

Agreed with Andrew here: the pure coding exercise is not very interesting, as one can just use PyCall to use scipy's version from julia. The interesting part requires serious numerical skills.

mfherbst commented 3 years ago

A GSoC student for this would be pretty cool. Let me know if there is something I can do to help when it gets to it!

ViralBShah commented 3 years ago

I think the main thing to do is to list the project here and add yourself as a mentor (with any other potential co-mentors):

https://julialang.org/jsoc/projects/

I think it is soon going to be time that students start visiting this list, so best to put it up immediately.

JuliaLinearAlgebra / IterativeSolvers.jl

LOBPCG stability: Suggestion of improvements #246