etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Interface for eigensolver on QUDA #561

Closed aniketsen closed 9 months ago

aniketsen commented 1 year ago

eigsolveQuda acts as interface for calling the eigensolver on QUDA. In phmc.c, this interface is called based on input parameter UseExternalEigSolver which is stored in monomial parameter external_eigsolver. The interface also initializes QUDA and initializes the two flavor solver.

kostrzewa commented 1 year ago

I tested this on a 32c64 lattice and it seems to work correctly as far as I can tell. However, I think the time to solution for the lowest eigenvalue can be reduced significantly. On the machine in question (single quad-A100 node), the solver takes around 200 seconds for the lowest eigenvalue and around 10 seconds for the largest one (precision was too low, it's actually over 450 seconds in total) . The CPU solver, on just 2x16 AMD EPYC2 cores, takes "only" 1000 seconds by comparison. I thus suspect that one can speed up the calculation of the lowest eigenvalue by another factor of 10 or so at least. If you have the time and the inclination, @aniketsen, it would be great if you could explore the polynomial acceleration a little. The documentation of the eigensolver on the QUDA Wiki is not bad as far as I can tell.

My guess is that appropriate settings for a_min and a_max (depending on whether one is looking for the smallest or the largest eigenvalue) will speed this up massively. I'm not sure about the order of the polynomial that one should use (poly_deg).

kostrzewa commented 1 year ago

No poly accel:

# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 412 restart steps and 19824 OP*x operations.
# QUDA: Eval[0000] = (+1.1364231284241534e-04,-5.9654390341346649e-22) ||+1.1364231284241534e-04|| Residual = +1.3907589993294887e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 4.568580e+02 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Using double-half refinement in mshift-solver!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 1.272554e-02 s level: 3 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda/loadCloverQuda
# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.180968e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

smallest: 457 seconds largest: 5 seconds

TOTAL: 462 seconds (compared to over 1000 seconds on the CPU for this particular ensemble and machine)

untuned poly accel:

(note that poly acceleration exchanges SR <-> LR)

# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum LR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718927e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum SR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.709653e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

It seems that with the chosen settings

  eig_param.use_poly_acc = QUDA_BOOLEAN_TRUE;
  eig_param.poly_deg = 128; 
  eig_param.a_min = 1e-3;
  eig_param.a_max = 4;                                                                                                                     

that the solve for the largest eigenvalue does not work. Likely one needs to disable poly accel for that. (since the solve time is super short, that's perfectly fine)

00000000 2.41792e-05 2.41792e-05 8.00000e-06 1.00000e+00

The timing reflects this fact:

smallest: 77 seconds "largest": 77 seconds

TOTAL: 154 seconds

Proposal for default

double eigsolveQuda(..){
[...]
  eig_param.use_poly_acc = maxmin == 1 ? QUDA_BOOLEAN_FALSE : QUDA_BOOLEAN_TRUE;                                                           
  eig_param.poly_deg = 128; 
  eig_param.a_min = 1e-3;
  eig_param.a_max = 4; 
[...]

and we might need some new parameters to make the degree, a_min and a_max user-configurable.

# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718751e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.185351e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

smallest: 77 seconds largest: 5 seconds

TOTAL: 82 seconds

I'm pretty sure that this can be improved further by lowering the polynomial degree and optimising a_min, as described here: https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#using-chebyshev-in-your-computations

kostrzewa commented 1 year ago

If you have some time, maybe you can take a look at / merge #564 and try to make the parameters user-configurable and/or test around with some production ensembles at various lattice spacings if the hard-coded defaults are okay for now. We can certainly live with 80ish seconds for the eigensolver in the HMC.

kostrzewa commented 9 months ago

@aniketsen slowly getting back into finalizing this and other things

As far as I can tell there is currently no way to actually build test_eigsolveQuda (I don't see any changes in the build system). In test_eigensolveQuda.c I'm also a bit confused by the set of monomials listed as ComputeEVFreq can only be set for a subset of these (from read_input.l):

2582 <NDPOLYMONOMIAL,CLPOLYMONOMIAL,NDRATMONOMIAL,NDRATCORMONOMIAL,NDCLRATMONOMIAL,NDCLRATCORMONOMIAL,RATMONOMIAL,RATCORMONOMIAL,CLRATMONOMIAL,CLRATCORMONOMIAL>
[...]
2597   {SPC}*ComputeEVFreq{EQL}{DIGIT}+ {                                                                                                  
2598     sscanf(yytext, " %[a-zA-Z] = %d", name, &a); 
2599     mnl->rec_ev = a;
2600     if(myverbose!=0) printf("  Frequency for computing EV's set to %d in line %d monomial %d\n", mnl->rec_ev, line_of_file,                current_monomial);
2601   }

If you want we can just leave test_eigsolveQuda out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:

361   for(j = 0; j < no_monomials; j++) {
362     if( (monomial_list[j].type == NDPOLY) || (monomial_list[j].type == NDDETRATIO)
363     || (monomial_list[j].type == NDCLOVER) || (monomial_list[j].type == NDRAT)
364     || (monomial_list[j].type == NDCLOVERRAT) || (monomial_list[j].type == NDRATCOR)
365     || (monomial_list[j].type == NDCLOVERRATCOR) || (monomial_list[j].type == NDCLOVERDETRATIO) ) {
366       if( (monomial_list[j].rec_ev != 0) ) {
367         monomial * mnl = &monomial_list[j];

you don't actually set any of the global parameters based on the monomial and instead immediately call eigsolveQuda. This means that things like g_kappa and g_mu will simply be unset, right?

If you look at something like ndrat_monomial.c, this is what happens before the eigensolver driver is called:

205 void ndrat_heatbath(const int id, hamiltonian_field_t * const hf) {
206   monomial * mnl = &monomial_list[id];
207   tm_stopwatch_push(&g_timers, __func__, mnl->name);
208   nd_set_global_parameter(mnl);
209   mnl->iter1 = 0;
210   if(mnl->type == NDCLOVERRAT) {
211     init_sw_fields();
212     sw_term((const su3**)hf->gaugefield, mnl->kappa, mnl->c_sw);
213     sw_invert_nd(mnl->mubar*mnl->mubar - mnl->epsbar*mnl->epsbar);
214     copy_32_sw_fields();
215   }
216   // we measure before the trajectory! 
217   if((mnl->rec_ev != 0) && (hf->traj_counter%mnl->rec_ev == 0)) {                                                                      
218     if(mnl->type != NDCLOVERRAT) phmc_compute_ev(hf->traj_counter-1, id, &Qtm_pm_ndbipsi);
219     else phmc_compute_ev(hf->traj_counter-1, id, &Qsw_pm_ndbipsi);
220   }

where nd_set_global_parameter is:

 52 void nd_set_global_parameter(monomial * const mnl) {
 53 
 54   g_mubar = mnl->mubar;
 55   g_epsbar = mnl->epsbar;
 56   g_kappa = mnl->kappa;
 57   g_c_sw = mnl->c_sw;
 58   boundary(g_kappa);
 59   phmc_cheb_evmin = mnl->EVMin;
 60   phmc_invmaxev = mnl->EVMaxInv;
 61   phmc_cheb_evmax = mnl->EVMax;
 62   phmc_Cpol = 1.;
 63   // used for preconditioning in cloverdetrat
 64   g_mu3 = 0.;
 65 
 66   return;
 67 }
aniketsen commented 9 months ago

As far as I can tell there is currently no way to actually build test_eigsolveQuda (I don't see any changes in the build system). In test_eigensolveQuda.c I'm also a bit confused by the set of monomials listed as ComputeEVFreq can only be set for a subset of these (from read_input.l):

yes, I did not push the changes in the build system. I had added all the monomials, for a more general approach, but yes, ComputeEVFreq is not defined for all of them. But in these cases the variable rec_ev is set to zero by default, and the eigensolver is simply not called.

If you want we can just leave test_eigsolveQuda out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:

yes, this is a mistake. For some reason I thought init_monomials also initialized the global parameters. But of course one needs to call nd_set_global_parameter if one wants to cycle through all the different monomials.

You can just leave this script out. I agree that this does not really perform a very useful function at the moment, other than just calling the eigensolver for different monomials. But you can also easily get the same job done by setting ComputeEVFreq to 1 in the standard hmc.c script.