etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
GNU General Public License v3.0
32 stars 47 forks source link

automatic tuning of (QUDA)-MG parameters [WIP, DO NOT MERGE] #537

Open kostrzewa opened 2 years ago

kostrzewa commented 2 years ago

started work on a simple algorithm to automatically tune the (QUDA)-MG parameters which can be tuned without rebuilding the setup

kostrzewa commented 2 years ago

The preliminary idea for the input is as follows but this has to be fine-tuned depending how the algorithm will turn out in the end:

BeginExternalInverter QUDA
  Pipeline = 24
  gcrNkrylov = 24
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.000224102400
  MGVerbosity = summarize, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MGSmootherType = cagcr, cagcr, cagcr
  MGBlockSizesX = 4,3
  MGBlockSizesY = 4,3
  MGBlockSizesZ = 3,2
  MGBlockSizesT = 4,2

  MGCoarseMuFactor = 1.0, 1.0, 20.0
  MGCoarseMaxSolverIterations = 50, 50, 50
  MgCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGSmootherPostIterations = 2, 2, 2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherTolerance = 0.1, 0.1, 0.1
  MGOverUnderRelaxationFactor = 0.85, 0.85, 0.85


BeginTuneMGParams QUDA
  MGCoarseMuFactorSteps = 10, 10, 10
  MGCoarseMuFactorDelta = 0.1, 0.2, 5

  MGCoarseMaxSolverIterationsSteps = 10, 10, 10
  MGCoarseMaxSolverIterationsDelta = -5, -5, -5

  MGCoarseSolverToleranceSteps = 10, 10, 10
  MGCoarseSolverToleranceDelta = 0.05, 0.05, 0.05

  MGSmootherPreIterationsSteps = 4, 4, 4
  MGSmootherPreIterationsDelta = 1, 1, 1

  MGSmootherPostIterationsSteps = 4, 4, 4
  MGSmootherPostIterationsDelta = 1, 1, 1

  MGSmootherToleranceSteps = 4, 4, 4
  MGSmootherToleranceDelta = 0.1, 0.1, 0.1

  MGOverUnderRelaxationFactorSteps = 4, 4, 4
  MGOverUnderRelaxationFactorDelta = 0.05, 0.05, 0.05

  MGTuningIterations = 1000

  # when in a particular tuning step the improvement is less than 1%, we
  # move on to the next parameter to be tuned
  MGTuningTolerance = 0.99

There may be some adaptive process added to dynamically reduce the search space if certain parameter changes don't affect the tts.

kostrzewa commented 2 years ago

I will probably change the input format such that one doesn't specify min/max and a number of steps but a "delta" for each parameter and level and a number of steps that this delta should be applied for

The current "algorithm" (I use the word very cautiously) can start with a completely useless setup which doesn't converge and finds something which does. Unfortunately, it doesn't yet find a better minimum than I can find by hand. However, I've tested this only on small lattices (16c32 and 24c48, albeit at the physical point) and I suspect that it will work better on larger lattices.

kostrzewa commented 2 years ago

Funnily enough, this actually works and seems to find parameter sets that I would have never considered. For example, on cA211.12.48, this is a parameter set that it evolves to:

             mg_mu_factor: (1.000000, 3.000000, 27.000000)
 mg_coarse_solver_maxiter: (20, 10, 50)
     mg_coarse_solver_tol: (0.200000, 0.400000, 0.200000)
               mg_nu_post: (6, 6, 8)
                mg_nu_pre: (0, 4, 2)
          mg_smoother_tol: (0.200000, 0.200000, 0.100000)
                 mg_omega: (0.950000, 1.050000, 0.850000)
Timing: 1.989135, Iters: 51
kostrzewa commented 1 year ago

First experience on a large volume (64c128) at the physical point suggests that this tuner, surprisingly, really seems to work.


BeginTuneMGParams QUDA
  MGCoarseMuFactorSteps = 10, 10, 11
  MGCoarseMuFactorDelta = 0.25, 0.5, 5

  MGCoarseMaxSolverIterationsSteps = 10, 10, 10
  MGCoarseMaxSolverIterationsDelta = 5, 5, 5

  MGCoarseSolverToleranceSteps = 10, 10, 10
  MGCoarseSolverToleranceDelta = 0.05, 0.05, 0.05

  MGSmootherPreIterationsSteps = 2, 2, 2
  MGSmootherPreIterationsDelta = 1, 1, 1

  MGSmootherPostIterationsSteps = 2, 2, 2
  MGSmootherPostIterationsDelta = 2, 2, 2

  MGSmootherToleranceSteps = 4, 4, 4
  MGSmootherToleranceDelta = 0.1, 0.1, 0.1

  MGOverUnderRelaxationFactorSteps = 3, 3, 3
  MGOverUnderRelaxationFactorDelta = 0.05, 0.05, 0.05

  MGTuningIterations = 1000

  # when in a particular tuning step the improvement is less than 1%, we
  # move on to the next parameter to be tuned
  MGTuningTolerance = 0.99

and starting from

BeginExternalInverter QUDA
  Pipeline = 24
  gcrNkrylov = 24
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.000215613244
  MGVerbosity = silent, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MGSmootherType = cagcr, cagcr, cagcr
  MGBlockSizesX = 4,2
  MGBlockSizesY = 4,2
  MGBlockSizesZ = 4,2
  MGBlockSizesT = 4,2
  MGResetSetupMDUThreshold = 1.0
  MGRefreshSetupMDUThreshold = 0.0263
  MGRefreshSetupMaxSolverIterations = 30, 30

  MGCoarseMuFactor = 1.0, 1.0, 20.0
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGSmootherPostIterations = 2, 2, 2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherTolerance = 0.1, 0.1, 0.1
  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90  

the tuner takes the solver from non-convergence through a successful solve in around 9 seconds (on Meluxina)

             mg_mu_factor: (1.000000, 1.000000, 65.000000)
 mg_coarse_solver_maxiter: (15, 15, 15)
     mg_coarse_solver_tol: (0.100000, 0.100000, 0.100000)
               mg_nu_post: (2, 2, 2)
                mg_nu_pre: (0, 0, 0)
          mg_smoother_tol: (0.100000, 0.100000, 0.100000)
                 mg_omega: (0.900000, 0.900000, 0.900000)
Timing: 8.628203, Iters: 112

down to a solve in 2.5 seconds with parameters that I would not have thought to choose by hand:

             mg_mu_factor: (1.000000, 4.000000, 120.000000)
 mg_coarse_solver_maxiter: (15, 25, 30)
     mg_coarse_solver_tol: (0.100000, 0.200000, 0.150000)
               mg_nu_post: (2, 6, 10)
                mg_nu_pre: (0, 0, 6)
          mg_smoother_tol: (0.200000, 0.200000, 0.200000)
                 mg_omega: (0.900000, 0.900000, 0.950000)
Timing: 2.501800, Iters: 64
kostrzewa commented 1 year ago

Using these parameters in practice and comparing between the "hand-tuned" setup on the left and the auto-tuned setup on the right:

MGCoarseMuFactor = 1.0, 1.0, 80.0              ->  MGCoarseMuFactor = 1.0, 4.0, 120.0                                                                  
MGCoarseMaxSolverIterations = 30, 30, 30       ->  MGCoarseMaxSolverIterations = 15, 25, 30
MGCoarseSolverTolerance = 0.3, 0.2, 0.15       ->  MGCoarseSolverTolerance = 0.1, 0.2, 0.15
MGSmootherPostIterations = 4, 4, 6             ->  MGSmootherPostIterations = 2, 6, 10
MGSmootherPreIterations = 0, 0, 1              ->  MGSmootherPreIterations = 0, 0, 6
MGSmootherTolerance = 0.2, 0.2, 0.2            ->  MGSmootherTolerance = 0.2, 0.2, 0.2 
MGOverUnderRelaxationFactor = 1.00, 0.90, 0.90 ->  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.95  

I seem to obtain very stable timings so far (red is the auto-tuned MG setup):


kostrzewa commented 1 year ago

After some more runtime, extracting the time to solution of the two MG setups, I get the following histograms after resampling to get the same number of solver calls in both cases (logarithmic count axis):


kostrzewa commented 1 year ago

Doing the same on a L=48 simulation at the physical point similarly leads to a very nice improvement. Below, untuned refers to a hand-selected MG setup. mk1tuned refers to the auto-tuning result after about 100 tuning iterations and mk2tuned the setup which was reached at the end of the tuning procedure.

The two "peaks" correspond to inversions related to cloverdetratio2light (below and around 1 second in the tuned setups) and cloverdetratio3light (from 1.5 seconds and up) and both timings from the HB/ACC steps as well as from the derivative are included in the histograms.


The final setup is:

  MGCoarseMuFactor = 1.0, 2.5, 105.0
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGCoarseSolverTolerance = 0.1, 0.35, 0.25
  MGSmootherPostIterations = 2, 2, 4
  MGSmootherPreIterations = 0, 0, 1
  MGSmootherTolerance = 0.2, 0.1, 0.2
  MGOverUnderRelaxationFactor = 0.90, 0.90, 1.00  
kostrzewa commented 1 year ago

note to self from meeting just now: it should be possible to integrate this directly in the HMC