Closed kostrzewa closed 2 years ago
One problem that this still has (and in fact exacerbates) is an excessive iteration count for the heatbath of determinant ratios with larger mass shifts. This is of course completely unreasonable and will have to be fixed somehow... The code with full field injection behaves more benignly in this regard with the iteration count only going up to about 60 or so instead of over 350...
# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 4.000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.656285e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_deg
enerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.850317e-11, true = 2.850317e-11 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 3.041126e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert
_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 3.097843e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_d
egenerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 33 iter / 2.90188 secs = 23224.8 Gflops
# TM_QUDA: Time for invert_eo_degenerate_quda 3.098766e+00 s level: 3 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_de
generate/invert_eo_degenerate_quda
# : Time for solve_degenerate 3.113027e+00 s level: 2 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate
# : Time for cloverdetratio_heatbath 6.547380e+00 s level: 1 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 5.104773e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/sol
ve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.470451e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_deg
enerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
GCR: Convergence at 358 iterations, L2 relative residual: iterated = 3.082474e-11, true = 3.082474e-11 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 7.751718e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert
_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 3.110367e-02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_d
egenerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 358 iter / 77.3822 secs = 23896.5 Gflops
# TM_QUDA: Time for invert_eo_degenerate_quda 8.297669e+01 s level: 3 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_de
generate/invert_eo_degenerate_quda
# : Time for solve_degenerate 8.299070e+01 s level: 2 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate
# : Time for cloverdetratio_heatbath 8.642201e+01 s level: 1 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath
This increase to over 80 seconds spent in the heatbath of this heavier monomial does not negate the gain from the derivative, however. Still, fixing this will be a nice optimisation!
@marcuspetschlies if you remember when you were testing analysis workloads with the quda_work_add_actions
branch you were finding crazy slow behaviour of the MG compared to the quda_work
branch. We suspected an issue in QUDA (and there were also a couple of issues there), but THIS was the actual and main reason...
Ok, that looks good. I will try to check it (need to re-compile QUDA for that). The heatbath problem seems that this indicates that the rho-shifts are not linear. However in case of DDalphaAMG one looses also performance/increase of iteration count if the shifts are too large, but much later rho ~ 0.1.
There's another problem though: somewhere between QUDA commit a1121d4597e60183021c70fc678c5bdfa1c0db8c and 9bae409b742765563b28939e93fa5c43b03da20b the cost for the setup update has increased by a factor of 10! This almost negates all gains from this PR (at least in the HMC).
@Finkenrath when you test, use QUDA commit 227ff8c8bdeec565aa82ba307d1a2539c8bb8664, it does not suffer from the performance regression in the setup update (and refresh) noted in the previous commit, yet it contains everything to be compatible with this branch of tmLQCD.
See https://github.com/lattice/quda/issues/1287 for updates on the updateMultigridQuda
regression.
When I was adding the HMC additions, I accidentally messed up the logic for
coarse_grid_solution_type
and was injecting a full field on all intermediate and coarse levels :facepalm:This PR significantly improves MG performance (4 Booster nodes, 64c128 lattice, physical point) and will also restore analysis performance to previous levels:
old
new
Just need to make sure that it doesn't break anything, but this will help us quite a bit...