martin-ueding commented 7 years ago

Exposing the QPhiX Richardson Solver with CG in Chroma

Rationale

Chroma already has the QPhiX Richardson solver exposed, but only with BiCGStab as a solver. For our Wilson clover simulaton, the BiCGStab seems to be really unstable. Bartek has extended QPhiX such that one can use the CG in the Richardson solver. We want to have this exposed in Chroma in order to use it for our HMC simulation.

Current Implementation

HMC and Propagator Solver

With a git grep QPHIX_CLOVER_ITER_REFINE_BICGSTAB_INVERTER one can find where the Richardson solver is defined within Chroma:

lib/actions/ferm/invert/qphix/syssolver_linop_clover_qphix_iter_refine_w.cc
lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc

In the linop file, there is a comment stating:

Solve a MdagM*psi=chi linear system by CG2

Whereas in the mdagm file, there is a comment stating:

Solve a MdagM*psi=chi linear system by CG2

The corresponding header files enlighten with the following comment:

Solve a M*psi=chi linear system by BiCGStab

At this point, I am completely lost. I guess that one inverts M†M whereas the other only inverts M. My bet is that mdagm solves M†M and linop solves, well, just the linear operator M.

Indeed, there is one call to mixed_solver in the linop version and two invocations in the mdagm variant:

(*mixed_solver)(psi_s[1], chi_s[1], toDouble(invParam.RsdTarget),
                res.n_count, rsd_final, site_flops, mv_apps, my_isign,
                invParam.VerboseP);

Versus:

(*mixed_solver)(tmp_qphix, chi_qphix, toDouble(invParam.RsdTarget),
                n_count1, rsd_final, site_flops1, mv_apps1, -1,
                invParam.VerboseP);
// ...
(*mixed_solver)(psi_qphix, tmp_qphix, toDouble(invParam.RsdTarget),
                n_count2, rsd_final, site_flops2, mv_apps2, +1,
                invParam.VerboseP);

So these two are indeed the solvers for the propagators (M) and the HMC (M†M).

Despite the comments in the code, both cases use the BiCGStab solver as the inner solver. The outer solver always has been the BiCGStab solver, now it is this only by default after Bartek's changes.

“Regular” QPhiX Solver

For the regular QPhiX solver, there is an additional XML tag that can be used to select the CG or BiCGStab solver. It would be nice if something similar could be implemented for the Richardson solver such that the user can specify whether CG or BiCGStab is to be used.

Analogously to the Richardson solver, the regular solver is defined in a HMC and a propagator variant:

lib/actions/ferm/invert/qphix/syssolver_linop_clover_qphix_w.cc
lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_w.cc

In the constructor of the parameter struct, it reads the solver type into an enum variable:

read(paramtop, "SolverType", SolverType);

And then later on there is a switch statement using dynamic polymorphism to create the correct operator.

switch (invParam.SolverType) {
    case CG: {
        QDPIO::cout << "Creating the CG Solver" << std::endl;
        cg_solver = new QPhiX::InvCG<REALT, VecTraits<REALT>::Vec,
                                     VecTraits<REALT>::Soa,
                                     VecTraits<REALT>::compress12>(
            (*M), invParam.MaxIter);
    } break;

    case BICGSTAB: {
        QDPIO::cout << "Creating the BiCGStab Solver" << std::endl;
        bicgstab_solver =
            new QPhiX::InvBiCGStab<REALT, VecTraits<REALT>::Vec,
                                   VecTraits<REALT>::Soa,
                                   VecTraits<REALT>::compress12>(
                (*M), invParam.MaxIter);
    } break;
    default:
        QDPIO::cerr << "UNKNOWN Solver Type" << std::endl;
        QDP_abort(1);
}

For the inversion of M, the actual inversions happen these two functions:

SystemSolverResults_t cgnrSolve(T &psi, const T &chi) const;
SystemSolverResults_t biCGStabSolve(T &psi, const T &chi) const;

This uses the “conjugate gradient normal relation” to invert M using CG whereas for BiCGStab this is just a simple application. For the M†M case in the other file, there are these methods:

SystemSolverResults_t biCGStabSolve(
    T &psi, const T &chi, AbsChronologicalPredictor4D<T> &predictor) const;
SystemSolverResults_t cgSolve(
    T &psi, const T &chi, AbsChronologicalPredictor4D<T> &predictor) const;

There is a switch statement on the operator() which chooses the right function to call based on the runtime parameter.

switch (invParam.SolverType) {
    case CG: {
        res = cgSolve(psi, chi, predictor);
    } break;
    case BICGSTAB: {
        res = biCGStabSolve(psi, chi, predictor);
    } break;
    default:
        QDPIO::cout << "Unknown Solver " << std::endl;
        break;
}

Planning

The regular QPhiX solvers (solvers meaning inverter for M and M†M) have a switch to allow for both CG and BiCGStab. There is no dynamic polymorphism involved because CG might need the normal relation in the M case and BiCGStab needs two applications in the M†M case. Therefore this should be applicable to the Richardson solver as well.

For now we only need the Richardson solver in the HMC as we do our propagators with the sLapH method using the peram_gen code based on tmLQCD and QUDA. Therefore I will start with the M†M implementation and do the M implementation later on.

Implementation

The new feature needs to be developed against Bartek's QPhiX branch. This has to be compiled and installed locally, then Chroma has to build against this. This worked directly, which is a nice surprise.

The regular and Richardson solvers share the SysSolverQPhiXCloverParams struct which means that also has the field SolverType. Therefore the XML parsing does not need to be changed, the option just needs to be honored. In the syssolver_mdagm_clover_qphix_iter_refine_w.h, I add the same switch statement as there is already in syssolver_mdagm_clover_qphix_w.h in the following places:

The constructor now constructs either a CG or a BiCGStab solver to be used as the outer solver.
The operator() which takes the chronological predictor has the switch.

The regular QPhiX solver has member functions cgSolve and biCGStabSolve. Similar functions are going to be implemented in the Richardson case as well. There the operator() is renamed into biCGStabSolve, the cgSolve is added analogously to the regular CG case. This will be easier to implement because I just need to call CG once and I have solved M†M.

Testing

Short HMC on my Laptop

I have compiled Chroma using Bartek's QPhiX branch on my AVX laptop and tested the HMC with Nf = 2 on a 8³×8 lattice from a random starting configuration. I have let it two trajectories with just a single time step on the outermost time scale. I have used two MPI ranks with four threads each (effectively oversubscribing my laptop)

Comparing CG (-) with Richardson-CG (+), one can see that there are changes, but not very large. I would conclude that the Richardson-CG did not break anything.

-              <PE_old>-16985.682690378</PE_old>
+              <PE_old>-16985.6827849023</PE_old>

-              <KE_new>33974.0571754877</KE_new>
+              <KE_new>33974.0638706824</KE_new>

-              <PE_new>-50450.2118875695</PE_new>
+              <PE_new>-50450.2159562068</PE_new>

-            <deltaKE>34020.4653290306</deltaKE>
+            <deltaKE>34020.4720242253</deltaKE>

-            <deltaPE>-33464.5291971914</deltaPE>
+            <deltaPE>-33464.5331713045</deltaPE>

-            <deltaH>555.93613183916</deltaH>
+            <deltaH>555.938852920808</deltaH>

-            <AccProb>3.63082779652542e-242</AccProb>
+            <AccProb>3.62096144730027e-242</AccProb>

diff

You can have a look at the input file hmc-in.xml.txt, the output from plain CG dat-cg.xml.txt, and the output from Richardson-CG dat-richardson-cg.xml.txt.

The time to solution overall seems to be the same, this can probably not be seen on my laptop with a non-thermalized random configuration.

kostrzewa commented 7 years ago

Excellent work Martin. I will pull-request on QPhiX soon such that someone else can test this. As for the time to solution: did you actually do mixed precision or did you have a double-double solver? Also, you oversubscribed, so any benefit might be masked by that...

martin-ueding commented 7 years ago

Thank you for the kind words!

It seems I did a single-single solve. My QDP++ is configured with default precision (like to default to single) and Chroma is configured with default inner-precision (defaults to single). Therefore I have a inner and outer veclen of 8 on my 256-Bit machine. So no conversion was done.

I will recompile QDP++ and Chroma using double-precision on my machine, then I can test it again and also I will use 1 MPI rank and 2×2 threads.

martin-ueding commented 7 years ago

In the case of differing inner and outer precisions, it does not compile any more:

In file included from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:8:0:
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h: In instantiation of 'Chroma::MdagMSysSolverQPhiXCloverIterRefine<T, U>::MdagMSysSolverQPhiXCloverIterRefine(Chroma::Handle<Chroma::LinearOperator<T> >, Chroma::Handle<Chroma::FermState<T, QDP::multi1d<U>, QDP::multi1d<U> > >, const Chroma::SysSolverQPhiXCloverParams&) [with T = QDP::OLattice<QDP::PSpinVector<QDP::PColorVector<QDP::RComplex<double>, 3>, 4> >; U = QDP::OLattice<QDP::PScalar<QDP::PColorMatrix<QDP::RComplex<double>, 3> > >]':
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:39:139:   required from here
/home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h:294:13: error: no matching function for call to 'QPhiX::InvCG<double, 4, 4, true, QPhiX::EvenOddLinearOperator<double, 4, 4, true> >::InvCG(QPhiX::EvenOddCloverOperator<float, 8, 4, true>&, const int&)'
             new QPhiX::InvCG<REALT,
             ^~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::Vec,
                              ~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::Soa,
                              ~~~~~~~~~~~~~~~~~~~~~~
                              VecTraits<REALT>::compress12>(*M_inner, invParam.MaxIter);
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.h:34:0,
                 from /home/mu/Projekte/chroma/lib/actions/ferm/invert/qphix/syssolver_mdagm_clover_qphix_iter_refine_w.cc:8:

So this needs a bit more work to be useful, I work on it.

martin-ueding commented 7 years ago

Seemed to be an easy fix. Now it works with double-single on my laptop, though the speed-up is not significant. I'll set it up on Hazel Hen after getting ice cream.

kostrzewa commented 7 years ago

@bjoo you mentioned that you had observed some bicgstab convergence issues. If this was "physics", these changes may help. Using Richardson-CG for the most chiral determinant ratio monomial may solve your stability problems.

kostrzewa commented 7 years ago

@martin-ueding How come there is a conflict? Does that just mean that a merge must take place?

martin-ueding commented 7 years ago

I have no idea, but it seemed strange to me as well. I probably did not do a git pull on devel before I started working on this feature. I'll look into it, right after I compile Chroma again on Hazel Hen, this time with the right branches. Their no-outgoing-connections policy let me compile it with Chroma devel and QPhiX devel` yesterday, so I'll have to recompile it again. What a :anger: waste of time.

martin-ueding commented 7 years ago

Conflict is gone now, it was a simple issue, Bálint has added a std:: in a line which I touched as well.

I just see that I have added a using namespace std; because I did not want to deal with all those issues. This has to be removed before the merge.

martin-ueding commented 7 years ago

Now it is working on Hazel Hen, this is the performance on the thermalized L = 24 lattice. One can see that Richardson-CG has an advantage over regular CG, though BiCGStab is still faster. But in our situation where BiCGStab is not stable, this already is a nice win:

tts

The details follow, it is the first solve that is done. The other monomials probably have a different speed-up, but I did not further look into it because it seems to work and it seems faster.

Regular CG (double):

QPHIX_CLOVER_CG_SOLVER: 436 iters,  rsd_sq_final=9.97351928069349e-10
QPHIX_CLOVER_CG_SOLVER: || r || / || b || = 9.97351928069349e-10
QPHIX_CLOVER_CG_SOLVER: Solver Time=3.25231695175171 (sec)  Performance=1616.54930373261 GFLOPS
QPHIX_MDAGM_SOLVER: total time: 3.306395 (sec)

Richardson-BiCGStab (double-float):

Starting Y solve
RICHARDSON: Initial || r ||^2 =   3.12362151e+08   Target || r ||^2 =   3.12362151e-10
RICHARDSON: Solver converged at iter 3 || r || =   7.65889119e-10
RICHARDSON: Inner MV Apps=247 Outer MV Apps=5 Inner Site Flops=117336 Outer Site Flops=477
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: 3 iters,  rsd_sq_final=5.86586142643085e-19 ||r||/||b|| (acc) = 7.65889119026
432e-10
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: Solver Time=0.526260137557983 (sec)  Performance=3135.819446211 GFLOPS
Starting X solve
RICHARDSON: Initial || r ||^2 =   1.83896867e+07   Target || r ||^2 =   1.83896867e-11
RICHARDSON: Solver converged at iter 3 || r || =   9.78313482e-10
RICHARDSON: Inner MV Apps=317 Outer MV Apps=5 Inner Site Flops=150936 Outer Site Flops=477
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: 3 iters,  rsd_sq_final=9.57097268641972e-19 ||r||/||b|| (acc) = 9.78313481784
838e-10
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: Solver Time=0.657563924789429 (sec)  Performance=3208.86882332496 GFLOPS
QPHIX_CLOVER_ITER_REFINE_BICGSTAB_SOLVER: total_iters=6 || r || / || b || = 8.20486128341607e-10
QPHIX_MDAGM_SOLVER: total time: 1.241386 (sec)

Richardson-CG (double-float)

RICHARDSON: Solver converged at iter 3 || r || =   9.53805296e-10
RICHARDSON: Inner MV Apps=882 Outer MV Apps=10 Inner Site Flops=105336 Outer Site Flops=477
QPHIX_CLOVER_CG_SOLVER: 3 iters,  rsd_sq_final=9.53805295227943e-10
QPHIX_CLOVER_CG_SOLVER: || r || / || b || = 9.53805295227943e-10
QPHIX_CLOVER_CG_SOLVER: Solver Time=1.95444297790527 (sec)  Performance=2738.32385277162 GFLOPS
QPHIX_MDAGM_ITER_REFINE_SOLVER: total time: 2.017046 (sec)

kostrzewa commented 7 years ago

Excellent, looks good. We'll try with Richardson-BiCGstab for all monomials to begin with and if this proves unstable for the lightest determinant ratio, we'll go with Richardson-CG there. The more iterations, the closer the advantage of mixed-precision gets to the theoretical factor of 2.

martin-ueding commented 7 years ago

I have removed the using namespace std; now again, so this can be merged, I'd think.

kostrzewa commented 7 years ago

Could you adjust the output to read:

QPHIX_CLOVER_ITER_REFINE_CG_SOLVER

for completeness?

martin-ueding commented 7 years ago

Yes, that can be done. While we are at it, the XML name should also be changed in order to remove the BICGSTAB. However, that will break existing restart files. People running them will get an error message with a list of available XML entities, so they could in principle update it.

An alternative way would be to register the solver in the factory twice with the new and the old name. Existing configuration files would only get the correct behavior if there is no SolverType field in the XML file. Speaking of which, I am not sure whether the selected Richardson solver (CG or BiCGStab) will be properly written out into the restart file. So this has to be looked into.

Regarding the breaking of backwards compatibility against having two entries, what would be the preferred option? Or just leave the name as-is and hope that users find out from the C++ code that CG is supported now?

martin-ueding commented 6 years ago

I have changed the name in standard output from QPHIX_CLOVER_CG_SOLVER to QPHIX_CLOVER_ITER_REFINE_CG_SOLVER. That was what you wanted, right?

kostrzewa commented 6 years ago

Yes, that looks good. Currently testing this (without the new name) on Hazel Hen. CG double-single iterative refinement in all light determinant and determinant ratio terms gives a speed-up of roughly 1.8 over CG in double (speed-up measured on total trajectory time). Not bad, I'd say!

I will prepare a pull-request for my QPhiX branch (need to check some things for completeness before proceeding) and then this can be tested by someone else.

A point which might be interesting for backwards-compatiblity reasons is to use BICGSTAB by default, such that input files which don't specify the solver type will continue to function unmodified.

martin-ueding commented 6 years ago

I just had a quick look at the code and I don't see where the solverType XML tag would be optional. So one has to supply a type with all QPhiX solvers it seems. The option was just ignored previously, so people specifying CG got BiCGStab.

This is probably is not hard to fix, I just did not see it right away. I'll be out of the loop for some weeks now, so please just go ahead and change it.

kostrzewa commented 6 years ago

Okay, thanks. I'll take over from here then.

JeffersonLab / chroma

Richardson CG #29

Exposing the QPhiX Richardson Solver with CG in Chroma

Rationale

Current Implementation

HMC and Propagator Solver

“Regular” QPhiX Solver

Planning

Implementation

Testing

Short HMC on my Laptop