Open markdewing opened 2 years ago
Why are we not using Generalized Nonsymmetric Eigensolvers instead of invert overlap matrix + dgeev?
I think using a generalized nonsymmetric solver is slower. There is code in QMCFixedSamplingLinearOptimize{Batched} to do so - the three parameter version of getLowestEigenvector uses dggev instead.
Plus there's even fewer implementations of the generalized version.
Scalapack complications: To get the eigenvalues, the following routines get called
dgehrd
- reduce a general matrix to upper Hessenberg form (Hessenberg form is upper triangular + one sub-diagonal)dhseqr
- computes eigenvalues of a Hessenberg matrixVersions of both of these exist in scalapack. Getting the eigenvalues is straightforward.
The eigenvectors are where it gets more complicated.
dgehrd
- also stores the transformations from a general matrix to Hessenberg form (H = Q^T A Q). The matrix Q is compactly stored in the lower part of H (that would normally be all zeros) and another vector of values (tau). (Note that H refers to a matrix in Hessenberg form, not the Hamiltonian matrix)dorghr
dorgqr
- generates Q from its compact storage form (product of reflectors)dhseqr
- in additional to eigenvalues, it can multiply an input matrix (Q) to by the Z matrix from the Schur factorization of the original matrix: A = (QZ) T (QZ)^Tdtrevc3
- computes eigenvectors of quasi-upper triangular matrix (T from previous step). If computing all the eigenvectors, it can multiply by the QZ matrix from the previous step to get the eigenvectors of the original matrix. There is no version of dorghr
in scalapack, but there is a version of dorgqr
.
There is no version of dtrevc3
in scalapack, but there are complex versions (ztrevc
) (The routines dtrevc3
and dtrevc
solve the same problem, just with different algorithms)
A couple of ideas
pztrevc
dtrevc3
can select a subset of eigenvectors (we only need one). In this case, the multiplication by QZ is not done by the routine. Maybe the selection and computation of the eigenvector could be done on one rank, and then the matrix-vector multiply by QZ could be done in parallel.Other notes
dgeev
is issue 1 in the scalapack github repo:https://github.com/Reference-ScaLAPACK/scalapack/issues/1pctrevc
What's the status of the LMY engine? It occurs to me that there's a chance that Eric might have implemented this stuff already somewhere.
Also, how crazy would it be to think about implementing one of these by hand?
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.85.045103
https://aip.scitation.org/doi/full/10.1063/1.5125803
LMY engine has been untouched for a long time, but a discussion with Eric would be productive. In particular the "optimize subsets of parameters" approach is implemented there.
I recall that the output of our previous discussions were that we should get a distributed memory scalapack solver working since that will work everywhere and will greatly relieve memory pressure. @markdewing ?
It would be nice to do a proper study of optimizer performance(convergence, robustness, speed) when all the derivatives are consistently implemented.
The LMY engine code does not work with the batched driver. At the very least, QMCCostFunctionBatched::engine_checkConfigurations
is not implemented.
If the solver in those papers is the SPAM (Subspace Projected Approximation Matrix) method, then it appears to be implemented in the formic
code. Though it is not enabled in the LMY engine constructor from QMCPACK.
There's also the blocked linear method, which solves for a subset of parameters. I think that method that is connected to QMCPACK.
I have written a miniapp/kernel for the linear method solver ( https://github.com/markdewing/qmc_kernels/tree/master/kernels/generalized_eigensolve ) that is useful for exploring the solver portion in isolation.
Painful as it is to note, LMY engine comes with a lot of baggage as well as its better algorithms. I think a clean implementation of any new algorithms without LMY is the preferred option by far. There much to be gained by simply having fewer lines of code for optimization and in the repo overall.
The linear method only needs one eigenvector. Often there are a number of extremely low eigenvalues that are clearly not desirable, so the method has a step that looks through the eigenvalues for the one closest to E_0 (see #4917 for more discussion on this step). It would be useful to take advantage of this rather than solve for all the eigenvalues and eigenvectors. ARPACK ( https://github.com/opencollab/arpack-ng ) is an iterative eigensolver which has a shift-and-invert mode that finds the eigenvalues closest to a target value. This mode can be used in the linear method.
The python script in qmc_kernels
has been updated with an ARPACK version ( qmcpack_linear_method_solver.py ) and there is a C++ version in the arpack
subdirectory.
Some timings (CPU, 28 core Broadwell)
Matrix size | LAPACK time (s) | ARPACK time (s) |
---|---|---|
3055 | 17.4 | 1.5 |
6855 | 124.8 | 7.5 |
The rot_det_speedup
branch contains a prototype implementation of the linear method using ARPACK.
This issue is exploration into accelerating the eigensolver part of the linear method. See #3633 for the umbrella issue with scaling the linear method with respect to the number of parameters.
The nonsymmetric generalized eigenvalue problem scales as N_p^3, so it will eventually dominate the time in each linear method step as the number of parameters increases.
This issue is to record some of the investigations and observations.
The problem is there are multiple orthogonal directions to accelerate: multicore, GPUs, and distributed memory. As far as I know, there are no good solutions that combine all of these.
Options
The next step is to implement a Scalapack version of the generalized eigenvalue kernel and explore the scaling.