using the QUDA MG in the HMC, the solver begins to break down after about 0.2 MD units

kostrzewa commented 2 years ago

When evolving the MG setup using updateMultigridQuda in an HMC run, the setup begins to break down after about 0.2 MD units

This happens regardless of whether the setup is used just for a single or multiple monomials.

It does not seem like this can be cured by using a smaller integrator step size (which is also rather expensive) so perhaps the best solution is to regularly regenerate the setup (5-6 times per MD unti, say).

Probably this should be another configurable parameter.

kostrzewa commented 2 years ago

It should be possible to integrate this into the already established mechanisms in quda_types.h by storing at what gauge_id the setup was generated initially and making sure that the current gauge_id and the point of generation don't differ by more than some configurable threshold.

kostrzewa commented 2 years ago

alternatively (and likely better but much harder to solve) would be to monitor the number of iterations and reset adaptively

kostrzewa commented 2 years ago

Easy "solution": https://github.com/Marcogarofalo/tmLQCD/pull/13

A better solution would be to do something like this: https://github.com/JeffersonLab/chroma/blob/8af1b2d48836724ca2601dbb0c8428c1dc190737/lib/actions/ferm/invert/quda_solvers/syssolver_mdagm_clover_quda_multigrid_w.h#L681

That way, one would not have to guess a good value for MGResetSetupThreshold.

kostrzewa commented 2 years ago

Actually, I think a solid solution will be the following:

from quda.h:

typedef struct QudaMultigridParam_s {
    [...]
    int setup_maxiter_refresh[QUDA_MAX_MG_LEVEL];

    [...]
}

One would need to add one multi-level parameter to tmLQCD's QUDA input parser, for example something like:

MGSetupRefreshMaxSolverIterations = 250, 350

(with iteration limits likely requiring some tuning to get the balance right) and another threshold

MGSetupRefreshThreshold = 0.126

as a companion to the MGResetSetupThreshold.

Then one would modify the logic currently used for resetting the setup to instead refresh it under the right conditions. The recently added 'init_gauge_id' logic can now probably be removed.

In the process, I would propose that another QUDA_MG_SETUP state (quda_types.h) should be added: TM_QUDA_MG_SETUP_REFRESH.

Then, when the solver is set up and we check if we can reuse it or if we need to update or reset it in _setOneFlavourSolverParam, we add another option to refresh:

else if( check_quda_mg_setup_state(...) == TM_QUDA_MG_SETUP_REFRESH ){
  // set refresh iterations
  for(int level = 0; level < (mg_param->n_level-1); level++){
    quda_mg_param.setup_maxiter_refresh[level] =
quda_input.setup_maxiter_refresh[level];
  }

  // update the parameters AND refresh the setup
  updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);

  // reset refresh iterations to zero such that the next call
  // to updateMultigridQuda only updates parameters and coarse
  // operator(s)
  for(int level = 0; level < (mg_param->n_level-1); level++){
    quda_mg_param.setup_maxiter_refresh[level] = 0;
  }
}

kostrzewa commented 2 years ago

In the process one might want to also rename the thresholds to make them refer to MD units, although it's not critically important.

kostrzewa commented 2 years ago

working on this

kostrzewa commented 2 years ago

Unfortunately this is trickier than I had assumed as there is some QUDA-internal logic in the way. While the refresh works, it somehow causes the solver to enter an invalid state:

# TM_QUDA: Refreshing MG Preconditioner Setup for gauge 0.031250
# TM_QUDA: MG Preconditioner Setup Refresh took 1.398 seconds
# TM_QUDA: time spent in reorder_spinor_eo_toQuda: 0.002559 secs
ERROR: Unsupported preconditioner 14
 (rank 1, host cassiopeia, inv_gcr_quda.cpp:192 in GCR())
       last kernel called was (name=N4quda4blas5Norm2IddEE,volume=8x16x16x16,aux=GPU-offline,vol=32768,stride=32768,precision=8,order=2,Ns=4,Nc=3,TwistFlavour=1,nParity=1)
ERROR: Unsupported preconditioner 14
 (rank 0, host cassiopeia, inv_gcr_quda.cpp:192 in GCR())
       last kernel called was (name=N4quda4blas5Norm2IddEE,volume=8x16x16x16,aux=GPU-offline,vol=32768,stride=32768,precision=8,order=2,Ns=4,Nc=3,TwistFlavour=1,nParity=1)
Saving 948 sets of cached parameters to /home/bartek/build/tmLQCD.quda_work_hmc.quda-ndeg_twisted_clover/temp/nf211_tmclover/tunecache_error.tsv
--------------------------------------------------------------------------

kostrzewa commented 2 years ago

PR in #500 with the problem above

kostrzewa commented 2 years ago

QUDA issue asking for help with this problem https://github.com/lattice/quda/issues/1170

etmc / tmLQCD

using the QUDA MG in the HMC, the solver begins to break down after about 0.2 MD units #494