etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

extension of the QUDA interface to use QUDA solvers in the HMC #490

Closed kostrzewa closed 1 year ago

kostrzewa commented 3 years ago

three build types:

1) plain 2) with DDalphaAMG 3) with QPhiX

For QPhiX, the integration tests don't pass and should be checked manually by someone else by running sample-hmc-qphix-tmcloverdetratio.input and comparing the respective sample-output/hmc-qphix-tmcloverdetratio/* using the master branch of tmLQCD. There might be issues with the machine that the sample-output files were generated on that manifest only in that case. Alternatively, the changes to the quda_work branch might have affected the correct functioning of external solvers. (in fact, I'm almost certain that's the case for DDalphaAMG as discussed here: https://github.com/etmc/tmLQCD/pull/460#discussion_r605981344).

kostrzewa commented 2 years ago

This damn thing has broken because of an Ubuntu problem on the runner...

kostrzewa commented 2 years ago

quda_work_hmc_refresh_setup and quda_work_hmc should be reviewed and merged here before any more work is done

kostrzewa commented 2 years ago

The next step here is to check if the new changes for the HMC have not affected the exisiting functionality for using tmLQCD as an interface to QUDA (as done, for example, by CVC)

kostrzewa commented 2 years ago

The merge with the current GK branch seems to have introduced a correctness regression in our twisted-clover HMC. Investigating...

kostrzewa commented 2 years ago

I think something got messed up before (possibly on the tmLQCD side) as I seem to be able to reproduce this also with ndeg-twisted-clover before the last merge with GK.

kostrzewa commented 2 years ago

Seems like I introduced some issue with multiple clover determinant ratios (proably related to tm_rho). Hopefully I can figure this out next week...

kostrzewa commented 2 years ago

Issue with tm_rho fixed in be4f1de. I had forgotten to make sure to reset it appropriately...

kostrzewa commented 2 years ago

I've also introduced a stack-based timer which, given some more work to replace some of the existing measurements, should allow for nested profiling.

kostrzewa commented 2 years ago

Alright, got 3-level MG working with the generic kernel branch (it works, but I'm not sure that it's completely correct, waiting for Kate to confirm in the GK PR):

diff --git a/lib/coarse_op.cuh b/lib/coarse_op.cuh
index e6eee80e8..161768ded 100644
--- a/lib/coarse_op.cuh
+++ b/lib/coarse_op.cuh
@@ -421,7 +421,7 @@ namespace quda {
         errorQuda("add_coarse_staggered_mass not enabled for non-staggered coarsenings");
 #endif
       } else if (type == COMPUTE_TMDIAGONAL) {
-#if defined(WILSONCOARSE)
+#if defined(WILSONCOARSE) || defined(COARSECOARSE)
         launch_device<add_coarse_tm>(tp, stream, arg);
 #else
         errorQuda("add_coarse_tm not enabled for non-wilson coarsenings");
kostrzewa commented 2 years ago

I've pushed to feature/ndeg-twisted-clover anyway so that we can test some realistic runs with small quark mass.

kostrzewa commented 2 years ago

I've pushed in many more uses of the stack-based timers. The result is far from perfect (or complete) and I'm not quite happy with the readability. See, for example, a call of ndrat_derivative from update_momenta.

# solve_mms_nd: Time for gamma5 2.498160e-03 s level: 3 proc_id: 0
# TM_QUDA: Using half prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: mu = 0.306748466258, epsilon = 0.153374233129 kappa = 0.163000000000, csw = -1.000000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.381409e-03 s level: 4 proc_id: 0
# QUDA: MultiShift CG: Converged after 42 iterations
# QUDA:  shift=0, 42 iterations, relative residual: iterated = 1.069589e-05
# QUDA:  shift=1, 42 iterations, relative residual: iterated = 4.737972e-07
# QUDA:  shift=2, 31 iterations, relative residual: iterated = 3.226969e-07
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 57 iterations, L2 relative residual: iterated = 2.270846e-11, true = 2.270846e-11 (requested = 2.632426e-11)
# QUDA: Refining shift 1: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 41 iterations, L2 relative residual: iterated = 2.351888e-11, true = 2.351888e-11 (requested = 2.632426e-11)
# QUDA: Refining shift 2: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 30 iterations, L2 relative residual: iterated = 2.480254e-11, true = 2.480254e-11 (requested = 2.632426e-11)
# TM_QUDA: Time for invertMultiShiftQuda 9.982428e-01 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 5.027221e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.961261e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.694299e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 2.302024e-02 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 170 iter / 0.870356 secs = 173.208 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.031182e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 3.330184e-03 s level: 3 proc_id: 0
# : Time for solve_mms_nd 1.037577e+00 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.423422e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 1.989946e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.680145e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.661034e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 1.957391e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.741884e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.733338e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.365956e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 1.996137e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.685861e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.674535e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 2.059529e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 2.021802e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.726222e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.357695e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_tm_ndpsi 2.010726e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.702856e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.672618e-02 s level: 2 proc_id: 0
# ndrat_derivative: Time for H_eo_[sw,tm]_ndpsi 1.969201e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.680372e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.661798e-02 s level: 2 proc_id: 0
# ndrat_5_7: Time for ndrat_derivative 1.495478e+00 s level: 1 proc_id: 0
# : Time for update_momenta 1.506387e+00 s level: 0 proc_id: 0

Things become even worse when one attempts to keep track of what happens in the FG integrator...

kostrzewa commented 2 years ago

In particular, I think the partial context provided in a line like

# ndrat_derivative: Time for Q_tau1_sub_const_ndpsi 4.423422e-02 s level: 2 proc_id: 0

which is at the same level as the call to deriv_Sb in the same function

# : Time for deriv_Sb 1.702856e-02 s level: 2 proc_id: 0

is confusing. Providing context to the latter involves changing a lot of stuff (or instrumenting all calls to deriv_Sb explicitly, which I don't really want tot do...). I might thus just remove the context at the beginning of the line (i.e. ndrat_derivative:) and only keep it in places where the context is user-configured, for example, such as in:

# ndrat_5_7: Time for ndrat_derivative 1.495478e+00 s level: 1 proc_id: 0
kostrzewa commented 2 years ago

I've removed the context in 413ce1e and it can easily be reintroduced. To me this looks cleaner:

# TM_QUDA: Using half prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Clover field and inverse already loaded for gauge 0.000000
# TM_QUDA: mu = 0.306748466258, epsilon = 0.153374233129 kappa = 0.163000000000, csw = 1.000000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.183512e-03 s level: 4 proc_id: 0
# QUDA: MultiShift CG: Converged after 75 iterations
# QUDA:  shift=0, 75 iterations, relative residual: iterated = 1.390547e-06
# QUDA:  shift=1, 75 iterations, relative residual: iterated = 4.269281e-07
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 70 iterations, L2 relative residual: iterated = 2.355300e-11, true = 2.355300e-11 (requested = 2.531149e-11)
# QUDA: Refining shift 1: L2 residual inf / 3.162278e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 73 iterations, L2 relative residual: iterated = 2.167418e-11, true = 2.167418e-11 (requested = 2.531149e-11)
# TM_QUDA: Time for invertMultiShiftQuda 1.404832e+00 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 4.730562e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 5.002763e-03 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 1.539710e-02 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 218 iter / 1.31038 secs = 184.99 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.429501e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 2.134861e-03 s level: 3 proc_id: 0
# solve_mms_nd residual check: shift 0 (1.105573e-04), res. 8.658786e-16
# solve_mms_nd residual check: shift 1 (1.116214e-03), res. 7.332464e-16
# : Time for solve_mms_nd 1.658463e+00 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 6.060014e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.850301e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.946674e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.998732e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.692045e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.901408e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.859284e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.559874e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.857285e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.974024e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.872104e-03 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 5.505562e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.738234e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.760590e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.742311e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 2.712331e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.807833e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 1.818175e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 7.193194e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.880055e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.813154e-03 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 6.952154e-03 s level: 2 proc_id: 0
# : Time for sw_all 3.189702e-01 s level: 2 proc_id: 0
# ndrat_8_9: Time for ndrat_derivative 2.583201e+00 s level: 1 proc_id: 0
kostrzewa commented 2 years ago

In a run on Booster, this is what the overheads look like, for example in a derivative of an ndcloverrat monomial where the derivative takes 8 seconds as discussed on Monday and the solver itself accounts for only 0.56 seconds while the call to invertMultiShiftQuda takes about 1.2 seconds in total (so this is dominated by host-device transfers of the output fields, I guess), plus another 0.2 seconds to reorder all the fields on the CPU side.

All the remainder is spent in various CPU functions, all of which will need to run on the GPU in the future.

# : Time for su3_zero 2.838294e-02 s level: 2 proc_id: 0
# : Time for sw_term 4.119006e-01 s level: 2 proc_id: 0
# : Time for sw_invert_nd 1.720084e-01 s level: 2 proc_id: 0
# solve_mms_nd: Time for gamma5 9.351093e-03 s level: 3 proc_id: 0
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Clover field and inverse already loaded for gauge 0.009375
# TM_QUDA: mu = 0.124686399987, epsilon = 0.131505200008 kappa = 0.139426700000, csw = 1.690000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.321822e-02 s level: 4 proc_id: 0
MultiShift CG: Converged after 53 iterations
 shift=0, 53 iterations, relative residual: iterated = 4.002840e-05
 shift=1, 53 iterations, relative residual: iterated = 2.889575e-08
 shift=2, 28 iterations, relative residual: iterated = 2.094232e-08
 shift=3, 14 iterations, relative residual: iterated = 2.250499e-08
 shift=4, 7 iterations, relative residual: iterated = 1.034281e-08
Refining shift 0: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 58 iterations, L2 relative residual: iterated = 9.644183e-09, true = 9.644183e-09 (requested = 1.000000e-08)
Refining shift 1: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 8 iterations, L2 relative residual: iterated = 8.038671e-09, true = 8.038671e-09 (requested = 1.000000e-08)
Refining shift 2: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 4 iterations, L2 relative residual: iterated = 8.880854e-09, true = 8.880854e-09 (requested = 1.000000e-08)
Refining shift 3: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 2 iterations, L2 relative residual: iterated = 6.822438e-09, true = 6.822438e-09 (requested = 1.000000e-08)
Refining shift 4: L2 residual inf / 1.000000e-08, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
CG: Convergence at 1 iterations, L2 relative residual: iterated = 2.585498e-09, true = 2.585498e-09 (requested = 1.000000e-08)
# TM_QUDA: Time for invertMultiShiftQuda 1.190052e+00 s level: 4 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.411113e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.346242e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.339547e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.328414e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.339242e-02 s level: 5 proc_id: 0
# TM_QUDA: Time for multishift_output_overhead 1.785818e-01 s level: 4 proc_id: 0
# TM_QUDA: QpQm solve done: 126 iter / 0.561773 secs = 28021 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 1.415733e+00 s level: 3 proc_id: 0
# solve_mms_nd: Time for mshift_mul_r_gamm5 4.290332e-02 s level: 3 proc_id: 0
# : Time for solve_mms_nd 1.468456e+00 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.259685e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.120288e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.598881e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.210845e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.095201e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.198362e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.272405e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.712106e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.744416e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745592e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745045e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.244489e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.092240e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.553661e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.345508e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.085700e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.200438e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.194646e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.697295e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.726760e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.743715e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.735236e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.241286e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.098317e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.524991e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.199216e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.098054e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.225873e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.243308e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.701527e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.743087e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.745948e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.744659e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.244854e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.103811e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.830236e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.174395e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.107424e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.157294e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.216093e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.704141e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.729130e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.729717e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.738000e-02 s level: 2 proc_id: 0
# : Time for Qsw_tau1_sub_const_ndpsi 2.252583e-01 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.094722e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.461486e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.203536e-02 s level: 2 proc_id: 0
# : Time for H_eo_sw_ndpsi 1.106153e-01 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.226443e-02 s level: 2 proc_id: 0
# : Time for deriv_Sb 7.231228e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.730029e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.732716e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.747000e-02 s level: 2 proc_id: 0
# : Time for sw_spinor_eo 2.737936e-02 s level: 2 proc_id: 0
# : Time for sw_deriv_nd 7.219962e-02 s level: 2 proc_id: 0
# : Time for sw_all 1.348470e+00 s level: 2 proc_id: 0
# ndcloverrat1: Time for ndrat_derivative 7.924570e+00 s level: 1 proc_id: 0
Marcogarofalo commented 2 years ago

I try to test these changes on m100 but I got:

ERROR: Gauge force has not been built (rank 0, host r246n18, gauge_force.cu:62 in gaugeForce())
       last kernel called was (name=N4quda14ExtractGhostExINS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb1EL19QudaGhostExchange_sn2147483648ELb0EEEEE,volume=8x8x8x12,aux=GPU-offline,vol=6144,stride=3072,precision=8,geometry=4,Nc=3,r=0002,inject,dim3)

I activate quda for the gauge derivative with

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.726
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

is it something wrong I did or it is still a work in progress?

kostrzewa commented 2 years ago

I try to test these changes on m100 but I got:

You need to compile QUDA with the gauge force enabled: -DQUDA_FORCE_GAUGE=ON and need to use the latest commit of https://github.com/lattice/quda/pull/1121

kostrzewa commented 1 year ago

Should we merge this in @Marcogarofalo ? It would certainly reduce the number of branches :)

kostrzewa commented 1 year ago

Would be good to confirm #543 first though

kostrzewa commented 1 year ago

I'm going to merge this into quda_work now.