Closed aniketsen closed 9 months ago
I tested this on a 32c64 lattice and it seems to work correctly as far as I can tell. However, I think the time to solution for the lowest eigenvalue can be reduced significantly. On the machine in question (single quad-A100 node), the solver takes around 200 seconds for the lowest eigenvalue and around 10 seconds for the largest one (precision was too low, it's actually over 450 seconds in total) . The CPU solver, on just 2x16 AMD EPYC2 cores, takes "only" 1000 seconds by comparison. I thus suspect that one can speed up the calculation of the lowest eigenvalue by another factor of 10 or so at least. If you have the time and the inclination, @aniketsen, it would be great if you could explore the polynomial acceleration a little. The documentation of the eigensolver on the QUDA Wiki is not bad as far as I can tell.
My guess is that appropriate settings for a_min
and a_max
(depending on whether one is looking for the smallest or the largest eigenvalue) will speed this up massively. I'm not sure about the order of the polynomial that one should use (poly_deg
).
# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 412 restart steps and 19824 OP*x operations.
# QUDA: Eval[0000] = (+1.1364231284241534e-04,-5.9654390341346649e-22) ||+1.1364231284241534e-04|| Residual = +1.3907589993294887e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 4.568580e+02 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Using double-half refinement in mshift-solver!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 1.272554e-02 s level: 3 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda/loadCloverQuda
# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.180968e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
smallest: 457 seconds largest: 5 seconds
TOTAL: 462 seconds (compared to over 1000 seconds on the CPU for this particular ensemble and machine)
(note that poly acceleration exchanges SR <-> LR)
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum LR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718927e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum SR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.709653e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
It seems that with the chosen settings
eig_param.use_poly_acc = QUDA_BOOLEAN_TRUE;
eig_param.poly_deg = 128;
eig_param.a_min = 1e-3;
eig_param.a_max = 4;
that the solve for the largest eigenvalue does not work. Likely one needs to disable poly accel for that. (since the solve time is super short, that's perfectly fine)
00000000 2.41792e-05 2.41792e-05 8.00000e-06 1.00000e+00
The timing reflects this fact:
smallest: 77 seconds "largest": 77 seconds
TOTAL: 154 seconds
double eigsolveQuda(..){
[...]
eig_param.use_poly_acc = maxmin == 1 ? QUDA_BOOLEAN_FALSE : QUDA_BOOLEAN_TRUE;
eig_param.poly_deg = 128;
eig_param.a_min = 1e-3;
eig_param.a_max = 4;
[...]
and we might need some new parameters to make the degree, a_min
and a_max
user-configurable.
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718751e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.185351e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
smallest: 77 seconds largest: 5 seconds
TOTAL: 82 seconds
I'm pretty sure that this can be improved further by lowering the polynomial degree and optimising a_min
, as described here: https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#using-chebyshev-in-your-computations
If you have some time, maybe you can take a look at / merge #564 and try to make the parameters user-configurable and/or test around with some production ensembles at various lattice spacings if the hard-coded defaults are okay for now. We can certainly live with 80ish seconds for the eigensolver in the HMC.
@aniketsen slowly getting back into finalizing this and other things
As far as I can tell there is currently no way to actually build test_eigsolveQuda
(I don't see any changes in the build system). In test_eigensolveQuda.c
I'm also a bit confused by the set of monomials listed as ComputeEVFreq
can only be set for a subset of these (from read_input.l
):
2582 <NDPOLYMONOMIAL,CLPOLYMONOMIAL,NDRATMONOMIAL,NDRATCORMONOMIAL,NDCLRATMONOMIAL,NDCLRATCORMONOMIAL,RATMONOMIAL,RATCORMONOMIAL,CLRATMONOMIAL,CLRATCORMONOMIAL>
[...]
2597 {SPC}*ComputeEVFreq{EQL}{DIGIT}+ {
2598 sscanf(yytext, " %[a-zA-Z] = %d", name, &a);
2599 mnl->rec_ev = a;
2600 if(myverbose!=0) printf(" Frequency for computing EV's set to %d in line %d monomial %d\n", mnl->rec_ev, line_of_file, current_monomial);
2601 }
If you want we can just leave test_eigsolveQuda
out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:
361 for(j = 0; j < no_monomials; j++) {
362 if( (monomial_list[j].type == NDPOLY) || (monomial_list[j].type == NDDETRATIO)
363 || (monomial_list[j].type == NDCLOVER) || (monomial_list[j].type == NDRAT)
364 || (monomial_list[j].type == NDCLOVERRAT) || (monomial_list[j].type == NDRATCOR)
365 || (monomial_list[j].type == NDCLOVERRATCOR) || (monomial_list[j].type == NDCLOVERDETRATIO) ) {
366 if( (monomial_list[j].rec_ev != 0) ) {
367 monomial * mnl = &monomial_list[j];
you don't actually set any of the global parameters based on the monomial and instead immediately call eigsolveQuda
. This means that things like g_kappa
and g_mu
will simply be unset, right?
If you look at something like ndrat_monomial.c
, this is what happens before the eigensolver driver is called:
205 void ndrat_heatbath(const int id, hamiltonian_field_t * const hf) {
206 monomial * mnl = &monomial_list[id];
207 tm_stopwatch_push(&g_timers, __func__, mnl->name);
208 nd_set_global_parameter(mnl);
209 mnl->iter1 = 0;
210 if(mnl->type == NDCLOVERRAT) {
211 init_sw_fields();
212 sw_term((const su3**)hf->gaugefield, mnl->kappa, mnl->c_sw);
213 sw_invert_nd(mnl->mubar*mnl->mubar - mnl->epsbar*mnl->epsbar);
214 copy_32_sw_fields();
215 }
216 // we measure before the trajectory!
217 if((mnl->rec_ev != 0) && (hf->traj_counter%mnl->rec_ev == 0)) {
218 if(mnl->type != NDCLOVERRAT) phmc_compute_ev(hf->traj_counter-1, id, &Qtm_pm_ndbipsi);
219 else phmc_compute_ev(hf->traj_counter-1, id, &Qsw_pm_ndbipsi);
220 }
where nd_set_global_parameter
is:
52 void nd_set_global_parameter(monomial * const mnl) {
53
54 g_mubar = mnl->mubar;
55 g_epsbar = mnl->epsbar;
56 g_kappa = mnl->kappa;
57 g_c_sw = mnl->c_sw;
58 boundary(g_kappa);
59 phmc_cheb_evmin = mnl->EVMin;
60 phmc_invmaxev = mnl->EVMaxInv;
61 phmc_cheb_evmax = mnl->EVMax;
62 phmc_Cpol = 1.;
63 // used for preconditioning in cloverdetrat
64 g_mu3 = 0.;
65
66 return;
67 }
As far as I can tell there is currently no way to actually build
test_eigsolveQuda
(I don't see any changes in the build system). Intest_eigensolveQuda.c
I'm also a bit confused by the set of monomials listed asComputeEVFreq
can only be set for a subset of these (fromread_input.l
):
yes, I did not push the changes in the build system. I had added all the monomials, for a more general approach, but yes, ComputeEVFreq
is not defined for all of them. But in these cases the variable rec_ev
is set to zero by default, and the eigensolver is simply not called.
If you want we can just leave
test_eigsolveQuda
out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:
yes, this is a mistake. For some reason I thought init_monomials
also initialized the global parameters. But of course
one needs to call nd_set_global_parameter
if one wants to cycle through all the different monomials.
You can just leave this script out. I agree that this does not really perform a very useful function at the moment, other than just calling the eigensolver for different monomials. But you can also easily get the same job done by setting ComputeEVFreq
to 1 in the standard hmc.c
script.
eigsolveQuda
acts as interface for calling the eigensolver on QUDA. Inphmc.c
, this interface is called based on input parameterUseExternalEigSolver
which is stored inmonomial
parameterexternal_eigsolver
. The interface also initializes QUDA and initializes the two flavor solver.