etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

change logic for setting MG param coarse_grid_solution_type #544

Closed kostrzewa closed 2 years ago

kostrzewa commented 2 years ago

When I was adding the HMC additions, I accidentally messed up the logic for coarse_grid_solution_type and was injecting a full field on all intermediate and coarse levels :facepalm:

This PR significantly improves MG performance (4 Booster nodes, 64c128 lattice, physical point) and will also restore analysis performance to previous levels:

old

GCR: Convergence at 53 iterations, L2 relative residual: iterated = 7.315033e-10, true = 7.315033e-10 (requested = 1.000000e-09)
# TM_QUDA: Time for invertQuda 1.596819e+01 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/invertQuda
# TM_QUDA: Done: 53 iter / 15.7261 secs = 25234.6 Gflops
# TM_QUDA: Time for reorder_spinor_fromQuda 6.752734e-02 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/reorder_spinor_fromQuda
# TM_QUDA: Time for invert_eo_quda 2.134693e+01 s level: 2 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda
# Inversion done in 53 iterations, squared residue = 9.589332e-13!

new

# TM_QUDA: Time for invertQuda 3.727005e+00 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/invertQuda
# TM_QUDA: Done: 33 iter / 3.47978 secs = 24770.2 Gflops
# TM_QUDA: Time for reorder_spinor_fromQuda 7.029620e-02 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/reorder_spinor_fromQuda
# TM_QUDA: Time for invert_eo_quda 3.914286e+00 s level: 2 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda
# Inversion done in 33 iterations, squared residue = 1.686466e-12!

Just need to make sure that it doesn't break anything, but this will help us quite a bit...

kostrzewa commented 2 years ago

One problem that this still has (and in fact exacerbates) is an excessive iteration count for the heatbath of determinant ratios with larger mass shifts. This is of course completely unreasonable and will have to be fixed somehow... The code with full field injection behaves more benignly in this regard with the iteration count only going up to about 60 or so instead of over 350...

cloverdetratio3light

# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 4.000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.656285e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_deg
enerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.850317e-11, true = 2.850317e-11 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 3.041126e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert
_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 3.097843e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_d
egenerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 33 iter / 2.90188 secs = 23224.8 Gflops
# TM_QUDA: Time for invert_eo_degenerate_quda 3.098766e+00 s level: 3 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_de
generate/invert_eo_degenerate_quda
# : Time for solve_degenerate 3.113027e+00 s level: 2 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate
# : Time for cloverdetratio_heatbath 6.547380e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath

cloverdetratio2light

# TM_QUDA: Time for MG_Preconditioner_Setup_Update 5.104773e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/sol
ve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.470451e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_deg
enerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
GCR: Convergence at 358 iterations, L2 relative residual: iterated = 3.082474e-11, true = 3.082474e-11 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 7.751718e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert
_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 3.110367e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_d
egenerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 358 iter / 77.3822 secs = 23896.5 Gflops
# TM_QUDA: Time for invert_eo_degenerate_quda 8.297669e+01 s level: 3 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_de
generate/invert_eo_degenerate_quda
# : Time for solve_degenerate 8.299070e+01 s level: 2 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate
# : Time for cloverdetratio_heatbath 8.642201e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath

This increase to over 80 seconds spent in the heatbath of this heavier monomial does not negate the gain from the derivative, however. Still, fixing this will be a nice optimisation!

kostrzewa commented 2 years ago

@marcuspetschlies if you remember when you were testing analysis workloads with the quda_work_add_actions branch you were finding crazy slow behaviour of the MG compared to the quda_work branch. We suspected an issue in QUDA (and there were also a couple of issues there), but THIS was the actual and main reason...

Finkenrath commented 2 years ago

Ok, that looks good. I will try to check it (need to re-compile QUDA for that). The heatbath problem seems that this indicates that the rho-shifts are not linear. However in case of DDalphaAMG one looses also performance/increase of iteration count if the shifts are too large, but much later rho ~ 0.1.

kostrzewa commented 2 years ago

There's another problem though: somewhere between QUDA commit a1121d4597e60183021c70fc678c5bdfa1c0db8c and 9bae409b742765563b28939e93fa5c43b03da20b the cost for the setup update has increased by a factor of 10! This almost negates all gains from this PR (at least in the HMC).

kostrzewa commented 2 years ago

@Finkenrath when you test, use QUDA commit 227ff8c8bdeec565aa82ba307d1a2539c8bb8664, it does not suffer from the performance regression in the setup update (and refresh) noted in the previous commit, yet it contains everything to be compatible with this branch of tmLQCD.

kostrzewa commented 2 years ago

See https://github.com/lattice/quda/issues/1287 for updates on the updateMultigridQuda regression.