Closed kostrzewa closed 2 years ago
It should be possible to integrate this into the already established mechanisms in quda_types.h
by storing at what gauge_id
the setup was generated initially and making sure that the current gauge_id
and the point of generation don't differ by more than some configurable threshold.
alternatively (and likely better but much harder to solve) would be to monitor the number of iterations and reset adaptively
Easy "solution": https://github.com/Marcogarofalo/tmLQCD/pull/13
A better solution would be to do something like this: https://github.com/JeffersonLab/chroma/blob/8af1b2d48836724ca2601dbb0c8428c1dc190737/lib/actions/ferm/invert/quda_solvers/syssolver_mdagm_clover_quda_multigrid_w.h#L681
That way, one would not have to guess a good value for MGResetSetupThreshold.
Actually, I think a solid solution will be the following:
from quda.h:
typedef struct QudaMultigridParam_s {
[...]
int setup_maxiter_refresh[QUDA_MAX_MG_LEVEL];
[...]
}
One would need to add one multi-level parameter to tmLQCD's QUDA input parser, for example something like:
MGSetupRefreshMaxSolverIterations = 250, 350
(with iteration limits likely requiring some tuning to get the balance right) and another threshold
MGSetupRefreshThreshold = 0.126
as a companion to the MGResetSetupThreshold
.
Then one would modify the logic currently used for resetting the setup to instead refresh it under the right conditions. The recently added 'init_gauge_id' logic can now probably be removed.
In the process, I would propose that another QUDA_MG_SETUP
state (quda_types.h) should be added: TM_QUDA_MG_SETUP_REFRESH
.
Then, when the solver is set up and we check if we can reuse it or if we need to update or reset it in _setOneFlavourSolverParam
, we add another option to refresh:
else if( check_quda_mg_setup_state(...) == TM_QUDA_MG_SETUP_REFRESH ){
// set refresh iterations
for(int level = 0; level < (mg_param->n_level-1); level++){
quda_mg_param.setup_maxiter_refresh[level] =
quda_input.setup_maxiter_refresh[level];
}
// update the parameters AND refresh the setup
updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);
// reset refresh iterations to zero such that the next call
// to updateMultigridQuda only updates parameters and coarse
// operator(s)
for(int level = 0; level < (mg_param->n_level-1); level++){
quda_mg_param.setup_maxiter_refresh[level] = 0;
}
}
In the process one might want to also rename the thresholds to make them refer to MD units, although it's not critically important.
working on this
Unfortunately this is trickier than I had assumed as there is some QUDA-internal logic in the way. While the refresh works, it somehow causes the solver to enter an invalid state:
# TM_QUDA: Refreshing MG Preconditioner Setup for gauge 0.031250
# TM_QUDA: MG Preconditioner Setup Refresh took 1.398 seconds
# TM_QUDA: time spent in reorder_spinor_eo_toQuda: 0.002559 secs
ERROR: Unsupported preconditioner 14
(rank 1, host cassiopeia, inv_gcr_quda.cpp:192 in GCR())
last kernel called was (name=N4quda4blas5Norm2IddEE,volume=8x16x16x16,aux=GPU-offline,vol=32768,stride=32768,precision=8,order=2,Ns=4,Nc=3,TwistFlavour=1,nParity=1)
ERROR: Unsupported preconditioner 14
(rank 0, host cassiopeia, inv_gcr_quda.cpp:192 in GCR())
last kernel called was (name=N4quda4blas5Norm2IddEE,volume=8x16x16x16,aux=GPU-offline,vol=32768,stride=32768,precision=8,order=2,Ns=4,Nc=3,TwistFlavour=1,nParity=1)
Saving 948 sets of cached parameters to /home/bartek/build/tmLQCD.quda_work_hmc.quda-ndeg_twisted_clover/temp/nf211_tmclover/tunecache_error.tsv
--------------------------------------------------------------------------
PR in #500 with the problem above
QUDA issue asking for help with this problem https://github.com/lattice/quda/issues/1170
When evolving the MG setup using
updateMultigridQuda
in an HMC run, the setup begins to break down after about 0.2 MD unitsThis happens regardless of whether the setup is used just for a single or multiple monomials.
It does not seem like this can be cured by using a smaller integrator step size (which is also rather expensive) so perhaps the best solution is to regularly regenerate the setup (5-6 times per MD unti, say).
Probably this should be another configurable parameter.