etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Hb solver #546

Closed kostrzewa closed 1 year ago

kostrzewa commented 1 year ago

@simone-romiti opened a pull request to review this, please see the inline comments

Marcogarofalo commented 1 year ago

regarding the performance:

There is a slight improvement overall but mainly we are moving the time for MG_Preconditioner_Setup between monomials

kostrzewa commented 1 year ago

Can you provide some more details on where these numbers were obtained and what is the ensemble in question?

kostrzewa commented 1 year ago

The only problematic case is the heatbath of the cloverdetratio2light monomial (at least with our current choice of mass preconditioning). You should be looking at the solve time rather than setup+solve as the setup has to be done anyway (as you say).

The problem is most acute on M100, where the heatbath solve in the cloverdetratio2light takes several hundred seconds (much less on other machines). Also don't forget that the setup is only run once per job (when acceptance is sufficiently high).

See the cA211.08.64 run:

19:51 bkostrze@login02 /m100_work/INF22_lqcd123_0/romiti/runs/cA211.08.64_therm/jobscript/logs 
 $ grep "Time for cloverdetratio_heatbath" log_cA211.08.64_therm_7288769_4294967294.out | grep ratio2
# : Time for cloverdetratio_heatbath 1.062495e+03 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 6.671943e+02 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath

First call is with the setup, the second call without. On a 32c64 test lattice (or another test lattice), the problem will not be this massive, of course. Similary, on Juwels Booster the MG performs much better, so even if it does 300-400 iterations, the impact on runtime is not too bad (using CG would still be faster though).

Marcogarofalo commented 1 year ago

Sorry I mean that test was on the same ensamble of the previous comment

/qbigwork/garofalo/test_tmLQCD_QUDA/cA211.08.32

There is a subdirectory add_actions which is my reference version with mg hb, please look at the last log file.

For the cg hb there is a directory HB_solver and i was quoting the second last log file(cg hb only in cloverdet2light) and the last log file with cg hb also in cloverdet3light.

kostrzewa commented 1 year ago

Sorry I mean that test was on the same ensamble of the previous comment

This is really only for correctness testing. On a single node and with the small lattice, the effect is not so visible. You should still see a difference of about a factor of 5 or so in pure solve time for the cloverdetratio2light heatbath solve, however. Of course, here, this factor of 5 will be a tiny fraction of the total time (something like 30 seconds -> 6 seconds or so).

kostrzewa commented 1 year ago

@simone-romiti Did you have a chance to test how this behaves on M100?

kostrzewa commented 1 year ago

I've just brought this up to date with quda_work.

kostrzewa commented 1 year ago

TODO

kostrzewa commented 1 year ago

Beyond the runs on Meluxina I've verified that this works nicely also in runs on QBIG (with a tenfold improvement in solve time for the HB step of cloverdetratio2light and, of course, with a shift of the MG setup time to cloverdetratio3light). Once documented this should be merged.

kostrzewa commented 1 year ago

@simone-romiti could you please finish the documentation for this so we can merge it?

kostrzewa commented 1 year ago

Please don't forget to also document the other HB_ parameters beyond just HB_solver.

kostrzewa commented 1 year ago

Thanks. Can you also check what's going on with the integration tests?

simone-romiti commented 1 year ago

Thanks. Can you also check what's going on with the integration tests?

On meluxina I did a test here: /mnt/tier1/project/p200094/romiti/tmLQCD_runs/iwa_1.745-csw1.7112-L32/test_HB_solver. I get the following by grepping "Time for cloverdetratio_heatbath" ad piping to grep cloverdetratio2light:

yes_HB: using the HB solver for cloverdetratio2light:

# : Time for cloverdetratio_heatbath 1.379339e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339430e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339117e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339778e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.341887e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath

no_HB: using the HB solver for cloverdetratio2light:

# : Time for cloverdetratio_heatbath 4.108312e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.089902e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 9.869443e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.000580e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.104594e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath

The 1st is trajectory HB takes ~40s due to the MG setup time. Since this has to be done anyway at some point, using the last 4 trajectories one finds that the HB_solver with CG uses ~13% of the time for the MG time in the heatbath.

kostrzewa commented 1 year ago

The problem with the github action was due to the github runner moving to Ubuntu 22.04 and there, the MPI libs are distributed separately from the compiler wrapper and utilities.

There's still another problem with the DDalphaAMG build because DDalphaAMG has issues with GCC 11.3 and how it organizes linking. I don't have the time to fix that problem right now.

I'll try to check the documentation tomorrow (not sure if I'll manage) and will pull this in then.

kostrzewa commented 1 year ago

I've improved the github actions through "artifact upload" (https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts) which makes the various config.log (and output.data) accessible through the web interface post-build.

kostrzewa commented 1 year ago

alright, finally fixed our CI ...

kostrzewa commented 1 year ago

Damn, there's an issue with the initialisation of HB_maxiter for the "default" case (no HB solver defined).

With

BeginMonomial CLOVERDETRATIO
  Timescale = 2
  kappa =    0.1400086
  2KappaMu = 0.000215613244
  rho = 0.0
  rho2 = 0.0015
  CSW = 1.7112
  AcceptancePrecision =  1.e-20
  ForcePrecision = 1.e-18
  Name = cloverdetratio3light
  solver = cg
  MaxSolverIterations = 100000
  UseExternalInverter = quda
  UseSloppyPrecision = single
EndMonomial

which should thus allow for 100k iterations in the HB, acceptance and derivative steps, I get:

# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Time for reorder_gauge_toQuda 1.359407e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/reorder_gauge_toQuda
# TM_QUDA: Time for loadGaugeQuda 2.442663e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/loadGaugeQuda
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 1.201410e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/loadCloverQuda
# TM_QUDA: mu = 0.000770000000, kappa = 0.140008600000, csw = 1.711200000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.586891e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
# QUDA: WARNING: Exceeded maximum iterations 5000
# QUDA: CG: Convergence at 5000 iterations, L2 relative residual: iterated = 2.350305e-09, true = 2.342661e-09 (requested = 1.000000e-10)
kostrzewa commented 1 year ago

@simone-romiti can you find the logic bug? It seems to me that it should work correctly:

  1. during read_input, add_monomial is called and sets the default params
  2. in the remaining read_input for this monomial, MaxSolverIterations (in other words mnl->maxiter) is set
  3. during init_monomials, for DETRATIO and CLOVERDETRATIO, the parameters from the "default" solver are taken over for the HB_solver. In particular, this should include mnl->HB_maxiter = mnl->maxiter
simone-romiti commented 1 year ago

The flow I see is the following, let me know it you agree on that:

  1. As you said, for each monomial in the input file, at line 2147 of read_input.l, add_monomial is called and sets the default params (definition at line 59 of monomial.c).
  2. After that, the "flex blocks" with the name aliases of the monomial start. At lines 2174/2175 of read_input.l I see that CLOVERDETRATIO and DETRATIO correspond respectively to CLDETRATMONOMIAL and DETRATMONOMIAL.
  3. To me it seems MaxSolverIterations is initalized for CLDETRATMONOMIAL but not if we have DETRATMONOMIAL. So I think we should add the latter to line 2403 of read_input.l.
  4. Still, we are using CLOVERDETRATIO here. I think the problem may be at lines 257 and 292 of monomial.c, where it should be monomial_list[i].HB_maxiter =monomial_list[i].maxiter;.
kostrzewa commented 1 year ago

Thanks for checking, I agree with your conclusions. When I reviewed monomial.c, I missed that no_monomials was used instead of i for setting HB_maxiter.

I also agree that DETRATMONOMIAL should be added to line 2403 in read_input.l.