etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Quda regenerate mg setup #536

Closed simone-romiti closed 2 years ago

simone-romiti commented 2 years ago

Reply to the issue https://github.com/etmc/tmLQCD/issues/527 .

Edits

I edited the quda_interface.c file as follows:

Note

We can think about adding a global parameter, to be specified in the input file, which limits the number of regenerations of the MG setup. Now the latter is implicitly set to 1.

Marcogarofalo commented 2 years ago

Maybe you want to merge this to quda_work_add_action instead of master branch

simone-romiti commented 2 years ago

Maybe you want to merge this to quda_work_add_action instead of master branch

Yes, thanks

kostrzewa commented 2 years ago

Thanks. I've cleaned things up a bit and added a local static variable to break out of the recursion. This seems to work as expected now. I've explicitly let an MG setup deteriorate by not evolving it with the gauge field and as the number of iterations begins to increase, the mechanism is triggered. The first time around, it is able to save the run. At the second occurence it fails and the program terminates:

GCR: Convergence at 48 iterations, L2 relative residual: iterated = 2.507958e-10, true = 2.507958e-10 (requested = 3.162278e-10)
GCR: Convergence at 52 iterations, L2 relative residual: iterated = 2.348466e-10, true = 2.348466e-10 (requested = 3.162278e-10)
GCR: Convergence at 52 iterations, L2 relative residual: iterated = 2.990532e-10, true = 2.990532e-10 (requested = 3.162278e-10)
GCR: Convergence at 59 iterations, L2 relative residual: iterated = 2.469045e-10, true = 2.469045e-10 (requested = 3.162278e-10)
GCR: Convergence at 67 iterations, L2 relative residual: iterated = 2.912129e-10, true = 2.912129e-10 (requested = 3.162278e-10)
GCR: Convergence at 75 iterations, L2 relative residual: iterated = 2.611657e-10, true = 2.611657e-10 (requested = 3.162278e-10)
GCR: Convergence at 96 iterations, L2 relative residual: iterated = 2.987194e-10, true = 2.987194e-10 (requested = 3.162278e-10)
GCR: Convergence at 110 iterations, L2 relative residual: iterated = 2.938478e-10, true = 2.938478e-10 (requested = 3.162278e-10)
GCR: Convergence at 147 iterations, L2 relative residual: iterated = 3.078089e-10, true = 3.078089e-10 (requested = 3.162278e-10)
GCR: Convergence at 169 iterations, L2 relative residual: iterated = 2.863158e-10, true = 2.863158e-10 (requested = 3.162278e-10)
GCR: Convergence at 126 iterations, L2 relative residual: iterated = 3.055091e-10, true = 3.055091e-10 (requested = 3.162278e-10)
GCR: Convergence at 151 iterations, L2 relative residual: iterated = 2.725082e-10, true = 2.725082e-10 (requested = 3.162278e-10)
GCR: Convergence at 161 iterations, L2 relative residual: iterated = 2.883943e-10, true = 2.883943e-10 (requested = 3.162278e-10)
GCR: Convergence at 207 iterations, L2 relative residual: iterated = 3.118657e-10, true = 3.118657e-10 (requested = 3.162278e-10)
### -> setup is regenerated here
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.830836e-10, true = 2.830836e-10 (requested = 3.162278e-10)
GCR: Convergence at 37 iterations, L2 relative residual: iterated = 1.898408e-10, true = 1.898408e-10 (requested = 3.162278e-10)
GCR: Convergence at 28 iterations, L2 relative residual: iterated = 2.033467e-10, true = 2.033467e-10 (requested = 3.162278e-10)
GCR: Convergence at 29 iterations, L2 relative residual: iterated = 2.099869e-10, true = 2.099869e-10 (requested = 3.162278e-10)
### -> no convergence possible any more
GCR: Convergence at 300 iterations, L2 relative residual: iterated = 3.994108e-02, true = 3.994108e-02 (requested = 3.162278e-10)
GCR: Convergence at 300 iterations, L2 relative residual: iterated = 6.097005e-01, true = 6.097005e-01 (requested = 3.162278e-10)
GCR: Convergence at 300 iterations, L2 relative residual: iterated = 1.555925e-01, true = 1.555925e-01 (requested = 3.162278e-10)
GCR: Convergence at 300 iterations, L2 relative residual: iterated = 8.021428e-01, true = 8.021428e-01 (requested = 3.162278e-10)
### program terminates

The fact that the solves just after the setup regeneration work and the next four solves fail suggests to me that there is still a bit of a logic problem present. To be fair, however, this was a run with two monomials using the MG and we know that it's problematic when these have very different rho values.

kostrzewa commented 2 years ago

Okay, I fixed the issue that existed with at the end of my test there.

I've also replaced resetting the setup with refreshing it instead, which should be sufficient to make the MG work properly when it deteriorates. We'll see. Refreshing is much faster than resetting, of course.

kostrzewa commented 2 years ago

@simone-romiti will you be able to run some tests of this? I don't know if I caught all possible ways this can go wrong...

simone-romiti commented 2 years ago

I've tested your solution and it works as expected also for me. For reference, I did it with the help of ad hoc static int variables here and there as follows: