Closed kostrzewa closed 1 year ago
This damn thing has broken because of an Ubuntu problem on the runner...
quda_work_hmc_refresh_setup
and quda_work_hmc
should be reviewed and merged here before any more work is done
The next step here is to check if the new changes for the HMC have not affected the exisiting functionality for using tmLQCD as an interface to QUDA (as done, for example, by CVC)
The merge with the current GK branch seems to have introduced a correctness regression in our twisted-clover HMC. Investigating...
I think something got messed up before (possibly on the tmLQCD side) as I seem to be able to reproduce this also with ndeg-twisted-clover before the last merge with GK.
Seems like I introduced some issue with multiple clover determinant ratios (proably related to tm_rho
). Hopefully I can figure this out next week...
Issue with tm_rho
fixed in be4f1de. I had forgotten to make sure to reset it appropriately...
I've also introduced a stack-based timer which, given some more work to replace some of the existing measurements, should allow for nested profiling.
Alright, got 3-level MG working with the generic kernel branch (it works, but I'm not sure that it's completely correct, waiting for Kate to confirm in the GK PR):
diff --git a/lib/coarse_op.cuh b/lib/coarse_op.cuh
index e6eee80e8..161768ded 100644
--- a/lib/coarse_op.cuh
+++ b/lib/coarse_op.cuh
@@ -421,7 +421,7 @@ namespace quda {
errorQuda("add_coarse_staggered_mass not enabled for non-staggered coarsenings");
#endif
} else if (type == COMPUTE_TMDIAGONAL) {
-#if defined(WILSONCOARSE)
+#if defined(WILSONCOARSE) || defined(COARSECOARSE)
launch_device<add_coarse_tm>(tp, stream, arg);
#else
errorQuda("add_coarse_tm not enabled for non-wilson coarsenings");
I've pushed to feature/ndeg-twisted-clover anyway so that we can test some realistic runs with small quark mass.
I've pushed in many more uses of the stack-based timers. The result is far from perfect (or complete) and I'm not quite happy with the readability. See, for example, a call of ndrat_derivative
from update_momenta
.
# solve_mms_nd: Time for gamma5 2.498160e-03 s level: 3 proc_id: 0
# TM_QUDA: Using half prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: mu = 0.306748466258, epsilon = 0.153374233129 kappa = 0.163000000000, csw = -1.000000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.381409e-03 s level: 4 proc_id: 0
# QUDA: MultiShift CG: Converged after 42 iterations
# QUDA: shift=0, 42 iterations, relative residual: iterated = 1.069589e-05
# QUDA: shift=1, 42 iterations, relative residual: iterated = 4.737972e-07
# QUDA: shift=2, 31 iterations, relative residual: iterated = 3.226969e-07
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 57 iterations, L2 relative residual: iterated = 2.270846e-11, true = 2.270846e-11 (requested = 2.632426e-11)
# QUDA: Refining shift 1: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 41 iterations, L2 relative residual: iterated = 2.351888e-11, true = 2.351888e-11 (requested = 2.632426e-11)
# QUDA: Refining shift 2: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 30 iterations, L2 relative residual: iterated = 2.480254e-11, true = 2.480254e-11 (requested = 2.632426e-11)
# TM_QUDA: Time for invertMultiShiftQuda 9.982428e-01 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 5.027221e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.961261e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.694299e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 2.302024e-02 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 170 iter / 0.870356 secs = 173.208 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.031182e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 3.330184e-03 s level: 3 proc_id: 0
# : Time for solve_mms_nd 1.037577e+00 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.423422e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 1.989946e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.680145e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.661034e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 1.957391e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.741884e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.733338e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.365956e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 1.996137e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.685861e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.674535e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 2.059529e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 2.021802e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.726222e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.357695e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 2.010726e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.702856e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.672618e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 1.969201e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.680372e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.661798e-02 s level: 2 proc_id: 0
# ndrat_5_7: Time for ndrat_derivative 1.495478e+00 s level: 1 proc_id: 0
# : Time for update_momenta 1.506387e+00 s level: 0 proc_id: 0
Things become even worse when one attempts to keep track of what happens in the FG integrator...
In particular, I think the partial context provided in a line like
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.423422e-02 s level: 2 proc_id: 0
which is at the same level as the call to deriv_Sb
in the same function
# : Time for deriv_Sb 1.702856e-02 s level: 2 proc_id: 0
is confusing. Providing context to the latter involves changing a lot of stuff (or instrumenting all calls to deriv_Sb
explicitly, which I don't really want tot do...). I might thus just remove the context at the beginning of the line (i.e. ndrat_derivative:
) and only keep it in places where the context is user-configured, for example, such as in:
# ndrat_5_7: Time for ndrat_derivative 1.495478e+00 s level: 1 proc_id: 0
I've removed the context in 413ce1e and it can easily be reintroduced. To me this looks cleaner:
# TM_QUDA: Using half prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Clover field and inverse already loaded for gauge 0.000000
# TM_QUDA: mu = 0.306748466258, epsilon = 0.153374233129 kappa = 0.163000000000, csw = 1.000000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.183512e-03 s level: 4 proc_id: 0
# QUDA: MultiShift CG: Converged after 75 iterations
# QUDA: shift=0, 75 iterations, relative residual: iterated = 1.390547e-06
# QUDA: shift=1, 75 iterations, relative residual: iterated = 4.269281e-07
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 70 iterations, L2 relative residual: iterated = 2.355300e-11, true = 2.355300e-11 (requested = 2.531149e-11)
# QUDA: Refining shift 1: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 73 iterations, L2 relative residual: iterated = 2.167418e-11, true = 2.167418e-11 (requested = 2.531149e-11)
# TM_QUDA: Time for invertMultiShiftQuda 1.404832e+00 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.730562e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 5.002763e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 1.539710e-02 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 218 iter / 1.31038 secs = 184.99 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.429501e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 2.134861e-03 s level: 3 proc_id: 0
# solve_mms_nd residual check: shift 0 (1.105573e-04), res. 8.658786e-16
# solve_mms_nd residual check: shift 1 (1.116214e-03), res. 7.332464e-16
# : Time for solve_mms_nd 1.658463e+00 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 6.060014e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.850301e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.946674e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.998732e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.692045e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.901408e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.859284e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.559874e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.857285e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.974024e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.872104e-03 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 5.505562e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.738234e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.760590e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.742311e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.712331e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.807833e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.818175e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 7.193194e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.880055e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.813154e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.952154e-03 s level: 2 proc_id: 0
# : Time for sw_all 3.189702e-01 s level: 2 proc_id: 0
# ndrat_8_9: Time for ndrat_derivative 2.583201e+00 s level: 1 proc_id: 0
In a run on Booster, this is what the overheads look like, for example in a derivative of an ndcloverrat monomial where the derivative takes 8 seconds as discussed on Monday and the solver itself accounts for only 0.56 seconds while the call to invertMultiShiftQuda
takes about 1.2 seconds in total (so this is dominated by host-device transfers of the output fields, I guess), plus another 0.2 seconds to reorder all the fields on the CPU side.
All the remainder is spent in various CPU functions, all of which will need to run on the GPU in the future.
# : Time for su3_zero 2.838294e-02 s level: 2 proc_id: 0
# : Time for sw_term 4.119006e-01 s level: 2 proc_id: 0
# : Time for sw_invert_nd 1.720084e-01 s level: 2 proc_id: 0
# solve_mms_nd: Time for gamma5 9.351093e-03 s level: 3 proc_id: 0
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Clover field and inverse already loaded for gauge 0.009375
# TM_QUDA: mu = 0.124686399987, epsilon = 0.131505200008 kappa = 0.139426700000, csw = 1.690000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.321822e-02 s level: 4 proc_id: 0
MultiShift CG: Converged after 53 iterations
shift=0, 53 iterations, relative residual: iterated = 4.002840e-05
shift=1, 53 iterations, relative residual: iterated = 2.889575e-08
shift=2, 28 iterations, relative residual: iterated = 2.094232e-08
shift=3, 14 iterations, relative residual: iterated = 2.250499e-08
shift=4, 7 iterations, relative residual: iterated = 1.034281e-08
Refining shift 0: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 58 iterations, L2 relative residual: iterated = 9.644183e-09, true = 9.644183e-09 (requested = 1.000000e-08)
Refining shift 1: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 8 iterations, L2 relative residual: iterated = 8.038671e-09, true = 8.038671e-09 (requested = 1.000000e-08)
Refining shift 2: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 4 iterations, L2 relative residual: iterated = 8.880854e-09, true = 8.880854e-09 (requested = 1.000000e-08)
Refining shift 3: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 2 iterations, L2 relative residual: iterated = 6.822438e-09, true = 6.822438e-09 (requested = 1.000000e-08)
Refining shift 4: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 1 iterations, L2 relative residual: iterated = 2.585498e-09, true = 2.585498e-09 (requested = 1.000000e-08)
# TM_QUDA: Time for invertMultiShiftQuda 1.190052e+00 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.411113e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.346242e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.339547e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.328414e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.339242e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 1.785818e-01 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 126 iter / 0.561773 secs = 28021 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.415733e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 4.290332e-02 s level: 3 proc_id: 0
# : Time for solve_mms_nd 1.468456e+00 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.259685e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.120288e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.598881e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.210845e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.095201e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.198362e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.272405e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.712106e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.744416e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745592e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745045e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.244489e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.092240e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.553661e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.345508e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.085700e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.200438e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.194646e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.697295e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.726760e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.743715e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.735236e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.241286e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.098317e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.524991e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.199216e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.098054e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.225873e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.243308e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.701527e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.743087e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745948e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.744659e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.244854e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.103811e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.830236e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.174395e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.107424e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.157294e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.216093e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.704141e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.729130e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.729717e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.738000e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.252583e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.094722e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.461486e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.203536e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.106153e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.226443e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.231228e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.730029e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.732716e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.747000e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.737936e-02 s level: 2 proc_id: 0
# : Time for sw_deriv_nd 7.219962e-02 s level: 2 proc_id: 0
# : Time for sw_all 1.348470e+00 s level: 2 proc_id: 0
# ndcloverrat1: Time for ndrat_derivative 7.924570e+00 s level: 1 proc_id: 0
I try to test these changes on m100 but I got:
ERROR: Gauge force has not been built (rank 0, host r246n18, gauge_force.cu:62 in gaugeForce())
last kernel called was (name=N4quda14ExtractGhostExINS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=8x8x8x12,aux=GPU-offline,vol=6144,stride=3072,precision=8,geometry=4,Nc=3,r=0002,inject,dim3)
I activate quda for the gauge derivative with
BeginMonomial GAUGE
Type = Iwasaki
beta = 1.726
Timescale = 0
UseExternalLibrary = quda
EndMonomial
is it something wrong I did or it is still a work in progress?
I try to test these changes on m100 but I got:
You need to compile QUDA with the gauge force enabled: -DQUDA_FORCE_GAUGE=ON
and need to use the latest commit of https://github.com/lattice/quda/pull/1121
Should we merge this in @Marcogarofalo ? It would certainly reduce the number of branches :)
Would be good to confirm #543 first though
I'm going to merge this into quda_work now.
three build types:
1) plain 2) with DDalphaAMG 3) with QPhiX
For QPhiX, the integration tests don't pass and should be checked manually by someone else by running
sample-hmc-qphix-tmcloverdetratio.input
and comparing the respectivesample-output/hmc-qphix-tmcloverdetratio/*
using the master branch of tmLQCD. There might be issues with the machine that the sample-output files were generated on that manifest only in that case. Alternatively, the changes to thequda_work
branch might have affected the correct functioning of external solvers. (in fact, I'm almost certain that's the case for DDalphaAMG as discussed here: https://github.com/etmc/tmLQCD/pull/460#discussion_r605981344).