deriv_mg_tune: when switching the configuration, the initial MG setup is used for the first inversion

When the MG autotuner has found a good setup on the first gauge configuration:

tuning_iteration: 300/300
cur_tuning_lvl: 0
cur_tuning_dir: mg_smoother_tol
steps_done_in_cur_dir: 0

             mg_mu_factor: (1.000000, 2.250000, 90.000000) -> (1.000000, 2.250000, 90.000000)
 mg_coarse_solver_maxiter: (45, 50, 25) -> (45, 50, 25)
     mg_coarse_solver_tol: (0.100000, 0.400000, 0.100000) -> (0.100000, 0.400000, 0.100000)
               mg_nu_post: (3, 1, 2) -> (3, 1, 2)
                mg_nu_pre: (0, 0, 0) -> (0, 0, 0)
          mg_smoother_tol: (0.200000, 0.100000, 0.200000) -> (0.300000, 0.100000, 0.200000)
                 mg_omega: (0.900000, 0.850000, 0.850000) -> (0.900000, 0.850000, 0.850000)

# TM_QUDA: Time for updateMultigridQuda 3.959178e+00 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/updateMultigridQuda
GCR: Convergence at 80 iterations, L2 relative residual: iterated = 3.030996e-11, true = 3.030996e-11 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 1.192162e+01 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

QUDA-MG param tuner: BEST SET OF PARAMETERS
-------------------------------------------
             mg_mu_factor: (1.000000, 2.250000, 90.000000)
 mg_coarse_solver_maxiter: (45, 50, 25)
     mg_coarse_solver_tol: (0.100000, 0.400000, 0.100000)
               mg_nu_post: (3, 1, 2)
                mg_nu_pre: (0, 0, 0)
          mg_smoother_tol: (0.200000, 0.100000, 0.200000)
                 mg_omega: (0.900000, 0.850000, 0.850000)
Timing: 11.767856, Iters: 80
-------------------------------------------

and the configuration is switched:

# Trying to read gauge field from file conf.0240 in double precision.
# Constructing LEMON reader for file conf.0240 ...
found header xlf-info, will now read the message
found header ildg-format, will now read the message
found header ildg-binary-data, will now read the message
# Time spent reading 309 Gb was 27.9 s.
# Reading speed: 11.1 Gb/s (43.3 Mb/s per MPI process).
found header scidac-checksum, will now read the message
# Scidac checksums for gaugefield conf.0240:
#   Calculated            : A = 0x6ce26943 B = 0x57720413.
#   Read from LIME headers: A = 0x6ce26943 B = 0x57720413.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 128, LY = 128, LZ = 128, LT = 256.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 128, LY = 128, LZ = 128, LT = 256.
# Finished reading gauge field.
# Computed plaquette value: 0.583358774141.

The first inversion will again be done with the initial MG setup with which the tuner was started:

# TM_QUDA: mu = 0.000540000000, kappa = 0.137972174000, csw = 1.611200000000
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.000149009948
# TM_QUDA: MG level 0, extent of (xyzt) dim 0: 32
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG level 0, extent of (xyzt) dim 1: 32
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG level 0, extent of (xyzt) dim 2: 32
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG level 0, extent of (xyzt) dim 3: 64
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG setting coarse mu scaling factor on level 0 to 1.000000
# TM_QUDA: MG level 1, extent of (xyzt) dim 0: 8
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG level 1, extent of (xyzt) dim 1: 8
# TM_QUDA: MG aggregation size set to: 4
# TM_QUDA: MG level 1, extent of (xyzt) dim 2: 8
# TM_QUDA: MG aggregation size set to: 2
# TM_QUDA: MG level 1, extent of (xyzt) dim 3: 16
# TM_QUDA: MG aggregation size set to: 2
# TM_QUDA: MG setting coarse mu scaling factor on level 1 to 1.000000
# TM_QUDA: MG setting coarse mu scaling factor on level 2 to 30.000000
# TM_QUDA: Destroying MG Preconditioner Setup
# TM_QUDA: Performing MG Preconditioner Setup for gauge_id: 3.000000
# TM_QUDA: Generating MG Setup with mu = 0.000540000000 instead of 0.000540000000
# TM_QUDA: Time for MG_Preconditioner_Setup 3.509506e+02 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup
# TM_QUDA: Time for reorder_spinor_eo_toQuda 5.094417e-02 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
GCR: Convergence at 350 iterations, L2 relative residual: iterated = 2.314415e-04, true = 2.314415e-04 (requested = 3.162278e-11)
# TM_QUDA: Time for invertQuda 1.227894e+03 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

and only then will the tuned setup be applied:

             mg_mu_factor: (1.000000, 2.250000, 90.000000) -> (1.000000, 2.250000, 90.000000)
 mg_coarse_solver_maxiter: (45, 50, 25) -> (45, 50, 25)
     mg_coarse_solver_tol: (0.100000, 0.400000, 0.100000) -> (0.100000, 0.400000, 0.100000)
               mg_nu_post: (3, 1, 2) -> (3, 1, 2)
                mg_nu_pre: (0, 0, 0) -> (0, 0, 0)
          mg_smoother_tol: (0.200000, 0.100000, 0.200000) -> (0.200000, 0.100000, 0.200000)
                 mg_omega: (0.900000, 0.850000, 0.850000) -> (0.900000, 0.850000, 0.850000)

# TM_QUDA: Time for updateMultigridQuda 3.961939e+00 s level: 4 proc_id: 0 /DERIV_MG_TUNE/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/updateMultigridQuda
GCR: Convergence at 81 iterations, L2 relative residual: iterated = 2.871780e-11, true = 2.871780e-11 (requested = 3.162278e-11)

The correct behaviour would be for the tuned setup to be already used for the very first inversion on the new config as the current behaviour can be extremely wasteful if the initial setup does not converge or is very slow.

etmc / tmLQCD

deriv_mg_tune: when switching the configuration, the initial MG setup is used for the first inversion #609