Closed kostrzewa closed 1 year ago
regarding the performance:
mg heatbath
With the cg in cloverdetratio2light heatbath we move the time to setup the mg in the cloverdetratio3light
With the cg in cloverdetratio2light and cloverdetratio3light heatbath
There is a slight improvement overall but mainly we are moving the time for MG_Preconditioner_Setup
between monomials
Can you provide some more details on where these numbers were obtained and what is the ensemble in question?
The only problematic case is the heatbath of the cloverdetratio2light monomial (at least with our current choice of mass preconditioning). You should be looking at the solve time rather than setup+solve as the setup has to be done anyway (as you say).
The problem is most acute on M100, where the heatbath solve in the cloverdetratio2light takes several hundred seconds (much less on other machines). Also don't forget that the setup is only run once per job (when acceptance is sufficiently high).
See the cA211.08.64 run:
19:51 bkostrze@login02 /m100_work/INF22_lqcd123_0/romiti/runs/cA211.08.64_therm/jobscript/logs
$ grep "Time for cloverdetratio_heatbath" log_cA211.08.64_therm_7288769_4294967294.out | grep ratio2
# : Time for cloverdetratio_heatbath 1.062495e+03 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 6.671943e+02 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
First call is with the setup, the second call without. On a 32c64 test lattice (or another test lattice), the problem will not be this massive, of course. Similary, on Juwels Booster the MG performs much better, so even if it does 300-400 iterations, the impact on runtime is not too bad (using CG would still be faster though).
Sorry I mean that test was on the same ensamble of the previous comment
/qbigwork/garofalo/test_tmLQCD_QUDA/cA211.08.32
There is a subdirectory add_actions which is my reference version with mg hb, please look at the last log file.
For the cg hb there is a directory HB_solver and i was quoting the second last log file(cg hb only in cloverdet2light) and the last log file with cg hb also in cloverdet3light.
Sorry I mean that test was on the same ensamble of the previous comment
This is really only for correctness testing. On a single node and with the small lattice, the effect is not so visible. You should still see a difference of about a factor of 5 or so in pure solve time for the cloverdetratio2light heatbath solve, however. Of course, here, this factor of 5 will be a tiny fraction of the total time (something like 30 seconds -> 6 seconds or so).
@simone-romiti Did you have a chance to test how this behaves on M100?
I've just brought this up to date with quda_work
.
TODO
Beyond the runs on Meluxina I've verified that this works nicely also in runs on QBIG (with a tenfold improvement in solve time for the HB step of cloverdetratio2light
and, of course, with a shift of the MG setup time to cloverdetratio3light
). Once documented this should be merged.
@simone-romiti could you please finish the documentation for this so we can merge it?
Please don't forget to also document the other HB_
parameters beyond just HB_solver
.
Thanks. Can you also check what's going on with the integration tests?
Thanks. Can you also check what's going on with the integration tests?
On meluxina I did a test here:
/mnt/tier1/project/p200094/romiti/tmLQCD_runs/iwa_1.745-csw1.7112-L32/test_HB_solver
.
I get the following by grepping "Time for cloverdetratio_heatbath"
ad piping to grep cloverdetratio2light
:
yes_HB: using the HB solver for cloverdetratio2light
:
# : Time for cloverdetratio_heatbath 1.379339e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339430e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339117e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.339778e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.341887e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
no_HB: using the HB solver for cloverdetratio2light
:
# : Time for cloverdetratio_heatbath 4.108312e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.089902e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 9.869443e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.000580e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
# : Time for cloverdetratio_heatbath 1.104594e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
The 1st is trajectory HB takes ~40s
due to the MG setup time. Since this has to be done anyway at some point, using the last 4 trajectories one finds that the HB_solver
with CG
uses ~13%
of the time for the MG
time in the heatbath.
The problem with the github action was due to the github runner moving to Ubuntu 22.04 and there, the MPI libs are distributed separately from the compiler wrapper and utilities.
There's still another problem with the DDalphaAMG build because DDalphaAMG has issues with GCC 11.3 and how it organizes linking. I don't have the time to fix that problem right now.
I'll try to check the documentation tomorrow (not sure if I'll manage) and will pull this in then.
I've improved the github actions through "artifact upload" (https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts) which makes the various config.log (and output.data) accessible through the web interface post-build.
alright, finally fixed our CI ...
Damn, there's an issue with the initialisation of HB_maxiter
for the "default" case (no HB solver defined).
With
BeginMonomial CLOVERDETRATIO
Timescale = 2
kappa = 0.1400086
2KappaMu = 0.000215613244
rho = 0.0
rho2 = 0.0015
CSW = 1.7112
AcceptancePrecision = 1.e-20
ForcePrecision = 1.e-18
Name = cloverdetratio3light
solver = cg
MaxSolverIterations = 100000
UseExternalInverter = quda
UseSloppyPrecision = single
EndMonomial
which should thus allow for 100k iterations in the HB, acceptance and derivative steps, I get:
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Time for reorder_gauge_toQuda 1.359407e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/reorder_gauge_toQuda
# TM_QUDA: Time for loadGaugeQuda 2.442663e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/loadGaugeQuda
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 1.201410e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/loadCloverQuda
# TM_QUDA: mu = 0.000770000000, kappa = 0.140008600000, csw = 1.711200000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.586891e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
# QUDA: WARNING: Exceeded maximum iterations 5000
# QUDA: CG: Convergence at 5000 iterations, L2 relative residual: iterated = 2.350305e-09, true = 2.342661e-09 (requested = 1.000000e-10)
@simone-romiti can you find the logic bug? It seems to me that it should work correctly:
read_input
, add_monomial
is called and sets the default paramsread_input
for this monomial, MaxSolverIterations
(in other words mnl->maxiter
) is setinit_monomials
, for DETRATIO
and CLOVERDETRATIO
, the parameters from the "default" solver are taken over for the HB_solver. In particular, this should include mnl->HB_maxiter = mnl->maxiter
The flow I see is the following, let me know it you agree on that:
2147
of read_input.l
, add_monomial
is called and sets the default params (definition at line 59
of monomial.c
).2174
/2175
of read_input.l
I see that CLOVERDETRATIO
and DETRATIO
correspond respectively to CLDETRATMONOMIAL
and DETRATMONOMIAL
.MaxSolverIterations
is initalized for CLDETRATMONOMIAL
but not if we have DETRATMONOMIAL
. So I think we should add the latter to line 2403
of read_input.l
.CLOVERDETRATIO
here. I think the problem may be at lines 257
and 292
of monomial.c
, where it should be monomial_list[i].HB_maxiter =monomial_list[i].maxiter;
.Thanks for checking, I agree with your conclusions. When I reviewed monomial.c
, I missed that no_monomials
was used instead of i
for setting HB_maxiter
.
I also agree that DETRATMONOMIAL
should be added to line 2403
in read_input.l
.
@simone-romiti opened a pull request to review this, please see the inline comments