JuliaLinearAlgebra / IterativeSolvers.jl

Iterative algorithms for solving linear systems, eigensystems, and singular value problems
MIT License
399 stars 105 forks source link

[Do not merge] Fix 246 - QR ortho fallback for LOBPCG #247

Open mohamed82008 opened 5 years ago

mohamed82008 commented 5 years ago

Fixes #246 by implementing QR orthogonalization as a fallback when CholQR fails.

CC: @mfherbst, @haampie, @lobpcg

mohamed82008 commented 5 years ago

@mfherbst may you try this PR for your real application and see if it does the trick or not? With this, I am getting a 100% success rate for your toy problem.

mohamed82008 commented 5 years ago

This method seems to produce unexpected results elsewhere. I will implement the QR orthogonalization.

mfherbst commented 5 years ago

Thanks for looking into this! This time I am travelling for the week :smile:. I'll try to get to it asap.

mohamed82008 commented 5 years ago

@mschauer if tests pass, please give this a second review.

codecov-io commented 5 years ago

Codecov Report

Merging #247 into master will decrease coverage by 1.23%. The diff coverage is 40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
- Coverage   90.52%   89.29%   -1.24%     
==========================================
  Files          17       17              
  Lines        1077     1093      +16     
==========================================
+ Hits          975      976       +1     
- Misses        102      117      +15
Impacted Files Coverage Δ
src/lobpcg.jl 81.33% <40%> (-4.49%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 17ef261...fdd88bf. Read the comment docs.

mschauer commented 5 years ago

Can you say shortly in words what you are doing?

mohamed82008 commented 5 years ago

I added some comments to the PR. In short, the role of orthogonalization (technically orthonormalization) in LOBPCG is to make sure that the matrix X is B-orthonormal. This necessarily means that X should be full-column rank, which nothing in the block version of the algorithm guarantees. Additionally, if the tolerance is too low, X will be getting closer and closer to being rank-deficient as the algorithm converges. During orthonormalization, we also keep track of the matrix A * X and B * X. The current way of achieving this B-orthonormality is to take the Cholesky of X' B X and do X = X * inv(R) where R is the upper triangular matrix. A * X and B * X are also optionally updated if the update flags are true.

However, when X is near rank-deficient, the Cholesky fails. So as a fallback, I am using QR decomposition to get a set of orthonormal vectors spanning a superset of the span of the columns of X. This is the compact Q, i.e. first m columns where m = size(X, 2). After that, for the generalized eigenvalue case, I go ahead and B-normalize the compact matrix Q using the Cholesky approach above. When B = I, this step can be skipped. Finally, A * X and B * X are optionally updated.

mfherbst commented 5 years ago

@mohamed82008 I've done a little testing in our application (DFTK.jl). The solver now runs a lot more stable, but I still seem to get some spurious (zero) eigenvalues. I'll take another, closer look.

lobpcg commented 5 years ago

I added some comments to the PR. In short, the role of orthogonalization (technically orthonormalization) in LOBPCG is to make sure that the matrix X is B-orthonormal. This necessarily means that X should be full-column rank, which nothing in the block version of the algorithm guarantees. Additionally, if the tolerance is too low, X will be getting closer and closer to being rank-deficient as the algorithm converges. During orthonormalization, we also keep track of the matrix A * X and B * X. The current way of achieving this B-orthonormality is to take the Cholesky of X' B X and do X = X * inv(R) where R is the upper triangular matrix. A * X and B * X are also optionally updated if the update flags are true.

However, when X is near rank-deficient, the Cholesky fails. So as a fallback, I am using QR decomposition to get a set of orthonormal vectors spanning a superset of the span of the columns of X. This is the compact Q, i.e. first m columns where m = size(X, 2). After that, for the generalized eigenvalue case, I go ahead and B-normalize the compact matrix Q using the Cholesky approach above. When B = I, this step can be skipped. Finally, A * X and B * X are optionally updated.

@mohamed82008 For a possible future improvement consideration: columns in X in LOBPCG are often badly scaled, with norm ranging from 1e1 to 1e-16, which makes the matrix X' B X extremely ill-conditioned. The original LOBPCG implementation relies on the fact that many Cholesky codes (e.g., in LAPACK) are numerically invariant to scaling. so re-scaling of X has no effect. However, many home-made Cholesky implementations are not scale-invariant and fail - the most common cause of LOBPCG errors. In the latter case, it helps first to try making Q to be the same as X, but with normalized columns - which is cheaper than QR. If this still fails, then the QR, which is implemented in this PR, is needed.

lobpcg commented 5 years ago

@mohamed82008 I've done a little testing in our application (DFTK.jl). The solver now runs a lot more stable, but I still seem to get some spurious (zero) eigenvalues. I'll take another, closer look.

If you run LOBPCG with full internal output, printing the current approximate eigenvalues and residual norms, and post it here, I may be able to help with an advice. I have never seen LOBPCG getting spurious (zero) eigenvalues.

mohamed82008 commented 5 years ago

@lobpcg thanks for your advice. I will try the column normalization method. The nice thing about this approach is that no additional matrix multiplications would be required sine A * X and B * X can be cheaply updated. Perhaps also I can always do the normalization step before performing the Cholesky. The normalization factors can be taken to be sqrt of the diagonal elements of the Gram matrix X' B X. The updated Gram matrix will therefore have a diagonal of 1s.

If done right, this method will only add a tiny overhead since we can symmetrically normalize the tiny Gram matrix using inv(S)' X' B X inv(S), where S = sqrt.(Diagonal(X' B X)). The rows of inv(R) can then be scaled by S before multiplying X, i.e. X = X * (S * inv(R)). R is also a tiny matrix.

mohamed82008 commented 5 years ago

@mfherbst When you say you get spurious zero eigenvalues, why are they spurious? If the residuals' norms result.residual_norms are less than the tolerance, i.e. result.converged is true, then maybe your matrix A is singular? result.trace should give you the convergence history that @lobpcg requested.

lobpcg commented 5 years ago

@lobpcg thanks for your advice. I will try the column normalization method. The nice thing about this approach is that no additional matrix multiplications would be required sine A * X and B * X can be cheaply updated. Perhaps also I can always do the normalization step before performing the Cholesky. The normalization factors can be taken to be sqrt of the diagonal elements of the Gram matrix X' B X. The updated Gram matrix will therefore have a diagonal of 1s.

If done right, this method will only add a tiny overhead since we can symmetrically normalize the tiny Gram matrix using inv(S)' X' B X inv(S), where S = sqrt.(Diagonal(X' B X)). The rows of inv(R) can then be scaled by S before multiplying X, i.e. X = X * (S * inv(R)). R is also a tiny matrix.

If the Cholesky implementation is scale-invariant, re-scaling is just a waste, so you do not want to do it all the time. If Julia uses a single implementation of Cholesky that cannot be modified by users, you can simply test yourself once if it is scale-invariant and decide accordingly. E.g., MATLAB and Octave normally use LAPACK, where Cholesky is scale-invariant (last time I checked a decade ago...), so there is no re-scaling in LOBPCG implementations in MATLAB and Octave.
If Julia allows user-defined libraries, you can add the Cholesky scale-invariance check in the LOBPCG initialization and set the re-scaling flag accordingly.

mohamed82008 commented 5 years ago

Julia uses LAPACK by default for the supported number types. Any funky, e.g. high precision, number type though will dispatch to a fallback implementation in pure Julia. Both seem scale-invariant.

lobpcg commented 5 years ago

Since users can link to external libraries, I recommend adding the Cholesky scale-invariance check in the LOBPCG initialization - that would likely save you from most troubles in the future. We have not done it in https://bitbucket.org/joseroman/blopex/src/master/ to my recollection and I regret it.

antoine-levitt commented 5 years ago

That strategy (cholesky then QR if cholesky fails) sounds reasonable, at least when B is relatively well-conditioned. https://arxiv.org/pdf/1704.07458.pdf recommends the SVQB algorithm, but that's probably more expensive, at least for a relatively large number of eigenpairs.

Diving by R as you do seems dangerous, though. This might be the cause of @mfherbst 's zero eigenvalues. A possibility is to drop columns of R (possibly with pivoting in the QR factorization) to ensure good conditioning.

Another alternative is to try to save the cholesky factorization by simply discarding everything after what went wrong. I played with that some time ago, but don't remember if it was a good idea or not...

lobpcg commented 5 years ago

Spurious eigenvalues are impossible, if the output is separately verified for accuracy, which I believe is the case in this code.

The present Julia implementation of LOBPCG, if I understand correctly, basically follows my https://www.mathworks.com/matlabcentral/fileexchange/48-lobpcg-m probably without "restarts" where the P is simply dropped. Instabilities are practically rare and usually result from users doing something adventurous. https://arxiv.org/pdf/1704.07458.pdf aims at very large block sizes >400, which brings multiple difficulties, still not perfectly addressed.

antoine-levitt commented 5 years ago

https://arxiv.org/pdf/1704.07458.pdf aims at very large block sizes >400, which brings multiple difficulties, still not perfectly addressed

Yet this is the regime in which a number of codes work. In particular some applications in electronic structure need quite stringent tolerances, where it's important to get stability right.

lobpcg commented 5 years ago

LOBPCG code gives an option to calculate, e.g., 1000, eigenstates in smaller blocks, say of size 100, in orthogonal complement to the previously found eigenstates. Not only this is more stable, but more importantly controls the cubic (with the block size) cost increase. Increasing the block size improves LOBPCG efficiency, but only to a certain point.

antoine-levitt commented 5 years ago

It does cut the rayleigh-ritz costs, but not really the orthogonalizations. Also it's trickier in a massively parallel setting. LOBPCG, like pretty much all iterative eigensolvers, is intrinsically an unstable algorithm for what is ultimately a well-conditioned problem (ie assuming a reasonable eigenvalue separation and spectral radius, the output is well-defined and not susceptible to roundoff errors); any high-quality general-purpose implementation should implement countermeasures to remedy this (a library cannot rely on users not "doing something adventurous"). In this case, I'd be very much surprised if dividing by R didn't cause some trouble down the line.

mfherbst commented 5 years ago

As requested by @lobpcg and @mohamed82008 I came up with another test case taken from our DFTK code as an example, see this gist. I just ran it on my machine using the implementation of this PR and got the following output:

Size of matrix:  144
Reference evals: [0.214573, 0.214573, 0.214573, 0.327013, 0.476932, 0.589372, 0.701812, 0.701812, 0.851732, 0.851732]
Success rate: 94.8%

A failing trace:
Iteration    Maximum residual norm 
---------    ---------------------
       1      1.390425e+00
       2      1.014511e+00
       3      6.710388e-01
       4      5.918896e-01
       5      4.456354e-01
       6      3.319617e-01
       7      2.810245e-01
       8      2.517689e-01
       9      2.611551e-01
      10      1.412077e-01
      11      8.710556e-02
      12      6.381408e-02
      13      3.660325e-02
      14      2.460445e-02
      15      1.704506e-02
      16      1.175045e-02
      17      7.822337e-03
      18      6.189870e-03
      19      4.623745e-03
      20      2.951122e-03
      21      2.294375e-03
      22      1.845383e-03
      23      1.382024e-03
      24      1.003886e-03
      25      8.110146e-04
      26      5.001395e-04
      27      3.185128e-04
      28      2.137832e-04
      29      1.600569e-04
      30      1.298272e-04
      31      4.300181e-01
      32      8.826261e-06

The interesting step in the trace is 30 -> 31 where convergence worsens for one iteration only for the algorithm to terminate the following step. This behaviour is reproduced in each failing trace with the example problem. If I investigate the ritz values from the iteration states, it shows that in iteration 30 everything still looks good and on track for convergence, but in iteration 31, a spurious value appears at the lower end of the spectrum, which is then "converged" to a numerical zero in iteration 32.

mfherbst commented 5 years ago

I amended the gist to allow for requesting a varying number of eigenvalues.

For nev==9, where the number of requested eigenpairs cuts through the two-fold degenerate eigenvalue 0.851732, the success rate is much higher (almost 100%). In fact in my experiments, whenever a complete eigenspace is requested, i.e. there is a non-zero spectral gap between the largest requested eigenvalue and the next, the current implementation has trouble.

mohamed82008 commented 5 years ago

Ok I found the source of "instability". The problem is that I am performing inner orthogonalization of R and P but not the full matrix [X P R] from which the Gram matrices are computed for the Rayliegh Ritz (RR). This leads to a rank deficient [X P R]' A [X P R] even when A is full rank which leads to spurious 0 eigenvalues. In the maximization case, it should lead to infinite eigenvalues since [X P R]' B [X P R] will be rank deficient.

The solution to this is simple, detect the rank deficiency of the Gram matrix [X P R]' B [X P R], and perform total orthogonalization of the basis matrix [X P R] when needed. This should make the LOBPCG algorithm in my opinion as stable as the orthogonalization method used, e.g. QR. The implementation will be a bit more involved that the changes in this PR as we need to allocate a matrix to host [X P R]. Then X, P and R can be made as views into this matrix. This avoids the need for double allocation.

mohamed82008 commented 5 years ago

Note that it is possible to prove that X[:,i], P[:,i] and R[:,i] will be independent for each i in exact arithmetic unless exact convergence happens, i.e. X[:,i] doesn't change at all between 2 iterations. However, I couldn't prove the internal independence among the columns of X, P, or R which is the cause of the initial failure, or the cross-independence between for example X[:,i] and P[:,r] where r = setdiff(1:size(P, 2), i) which is a potential cause of the second failure. Another possible cause for the second failure is that after internal QR orthogonalization of P or R, the independence between X[:,i], P[:,i] and R[:,i] may no longer hold.

While these may seem like limitations of the "basic" algorithm, assuming my understanding is correct, I think they are all completely avoidable using careful checks to prepare for the worst case when it happens. DFTK seems like a great test case for this algorithm, so thanks @mfherbst for sharing your pleasantly reproducible examples!

lobpcg commented 5 years ago

@mohamed82008 how the spurious output has passed the tolerance check at the end?

This is a strange crash, that I do not recollect normally seeing in my LOBPCG codes: https://www.mathworks.com/matlabcentral/fileexchange/48-lobpcg-m https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lobpcg.html https://bitbucket.org/joseroman/blopex/src/master/

I am curious to see how this email runs in my codes, e.g., MATLAB, but have no time to check, sorry.

It appears that your code follows https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py If so, please check the latest https://github.com/scipy/scipy/pull/9650 that fixes several silly bugs in the logic of the code, which made the code unnecessary slow and possibly less stable. Also, you may want to look at my original MATLAB/Octave code https://www.mathworks.com/matlabcentral/fileexchange/48-lobpcg-m that implements two extra stability improvements, not coded in scipy: 1) monitoring condition numbers of the Gram matrices and accordingly 2) dropping P altogether in case of trouble (called "restart") instead of true orthogonalization of [X R P] that you think to implement Try/catch that you use was not available in MATLAB back then, so I needed to try to guess a bit in advance using (1) if the code is about to crash to do (2).

lobpcg commented 5 years ago

I amended the gist to allow for requesting a varying number of eigenvalues.

For nev==9, where the number of requested eigenpairs cuts through the two-fold degenerate eigenvalue 0.851732, the success rate is much higher (almost 100%). In fact in my experiments, whenever a complete eigenspace is requested, i.e. there is a non-zero spectral gap between the largest requested eigenvalue and the next, the current implementation has trouble.

@mfherbst Thanks for documenting! This surely is an odd behavior. There is something funny going on.

This is a strange crash, that I do not recollect normally seeing in my LOBPCG codes: https://www.mathworks.com/matlabcentral/fileexchange/48-lobpcg-m https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lobpcg.html https://bitbucket.org/joseroman/blopex/src/master/

I am curious to see how this email runs in my codes, e.g., MATLAB, but have no time to check, sorry.

P.S. Just noticed #246 (comment) by @mfherbst, stating that our scipy version runs fine. But our scipy version of LOBPCG uses Cholesky, so there's nothing wrong with Cholesky and dividing by R in this example. It looks like there's just some bug in the Julia code, unrelated to Cholesky or this proposed QR fix.

mohamed82008 commented 5 years ago

@mohamed82008 how the spurious output has passed the tolerance check at the end?

@lobpcg X is updated using X = [X P R] * V where V is the eigenvectors of the reduced RR subproblem. Due to dependence of [X P R], some of the columns of X are not B-normal but have a near 0 norm. Similarly, A*X is updated using A*X = [A*X A*P A*R] * V and the same for B*X. The residual A*X - B*X*Lambda is small for the spurious vectors, because the corresponding columns in A*X, B*X and Lambda have a tiny norm. So the resulting spurious "eigenvectors" have near 0 norms with near 0 eigenvalues, and very small residual norms, under the tolerance.

I don't have much time during the week to work on this. But I will try to give it some time this weekend, hopefully! Thanks @lobpcg for the links, I need to take a closer look when I get some time.

mohamed82008 commented 5 years ago

@lobpcg your Matlab code is LGPL licensed. The code here is MIT licensed so I was careful not to adapt your Matlab version of the code for legal purposes. I didn't even read it in depth not to be influenced by it while writing my code. If you change that code's license to BSD like the Python version or to MIT like IterativeSolvers.jl then I can study and adapt it more closely. Alternatively, you can give me a license/permission here to look at it and adapt it in IterativeSolvers.jl. The email I get from Github will be my "license".

antoine-levitt commented 5 years ago

dropping P altogether in case of trouble (called "restart") instead of true orthogonalization of [X R P] that you think to implement

FWIW I've seen this slow down convergence significantly (esp. in the high-accuracy regime), so it'd be better to avoid it if possible. I've found again convergence curves from a few years ago in ABINIT (which implemented a LOBPCG with restarts based on your matlab version). The first curve is with the default version, the second with additional orthogonalizations (I don't remember which ones, sorry) to avoid the restarts. (never mind the y axis, the residuals in abinit were squared for some reason). lobpcg_100.pdf lobpcg_orthofix.pdf

lobpcg commented 5 years ago

@lobpcg your Matlab code is LGPL licensed. The code here is MIT licensed so I was careful not to adapt your Matlab version of the code for legal purposes.

I have changed the license to MIT at https://github.com/lobpcg/blopex which includes the MATLAB file

lobpcg commented 5 years ago

dropping P altogether in case of trouble (called "restart") instead of true orthogonalization of [X R P] that you think to implement

FWIW I've seen this slow down convergence significantly (esp. in the high-accuracy regime), so it'd be better to avoid it if possible. I've found again convergence curves from a few years ago in ABINIT (which implemented a LOBPCG with restarts based on your matlab version). The first curve is with the default version, the second with additional orthogonalizations (I don't remember which ones, sorry) to avoid the restarts. (never mind the y axis, the residuals in abinit were squared for some reason). lobpcg_100.pdf lobpcg_orthofix.pdf

Of course. The main issue is that the decision there to remove P is based not on the actual Try/catch error, but on guessing using the condition numbers of the Gram matrices and a threshold, which may be overly conservative. I do not recollect doing additional orthogonalizations in ABINIT and unsure where your figures may come from, but this work was a decade ago...

antoine-levitt commented 5 years ago

I do not recollect doing additional orthogonalizations in ABINIT and unsure where your figures may come from, but this work was a decade ago...

I added those (half a decade ago). Indeed I just checked, and restarts were based on estimating the conditioning. I haven't tried restarting based on failed factorizations, it'd be interesting to see if that fixes stability issues without degrading performance too much.

lobpcg commented 5 years ago

I do not recollect doing additional orthogonalizations in ABINIT and unsure where your figures may come from, but this work was a decade ago...

I added those (half a decade ago). Indeed I just checked, and restarts were based on estimating the conditioning. I haven't tried restarting based on failed factorizations, it'd be interesting to see if that fixes stability issues without degrading performance too much.

@antoine-levitt sorry, until your last message, I did not realize that I was actually familiar with your ABINIT contribution of Chebyshev filter. I did not know that our original LOBPCG code in ABINIT was modified with extra orthogonalizations to improve stability.

lobpcg commented 5 years ago

Let me ping an expert @joseeroman in case he has some advice

lobpcg commented 5 years ago

@mohamed82008 You need to add orthogonalization of R with respect to X prior to making R orthonormal and drop P on iterations where P is linear dependent, as I do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. That is cheap, easy and should remove most stability issues. It is not the "ultimately stable" fix, of course, but should be enough in practice.

please see https://github.com/scipy/scipy/issues/10258#issuecomment-500145533

stevengj commented 3 years ago

It does cut the rayleigh-ritz costs, but not really the orthogonalizations.

I did some arithmetic-count analysis of this a few years back, and IIRC while it doesn't change the complexity it does still improve the constant factor to use a smaller block size. Using a larger block size has other advantages for cache and parallel communications, but at the time and for my particular problem (this was 20 years ago), the optimal block size seemed to be ≈ 11, FWIW.