Closed kostrzewa closed 2 years ago
found the bugger, had to do with the gamma basis used for single parity solves, which was not reset for standard solves
I'm afraid this is back. Not quite sure what's causing it now...
@simone-romiti Would you be willing to investigate this?
Please see https://github.com/etmc/tmLQCD/pull/519
@simone-romiti Were you able to confirm the issue with the residue in the meantime?
I wasn't able to reproduce that residue issue, but I have the feeling it came from multiple inconsistent definitions of c_sw in the input file.
Can you explain what you did to attempt a reproduction? In my tests the residual in an online measurement during the HMC is wrong. Are you saying that you have an output file of an HMC run where the residue
for the online measurement is such that the solve appears to have converged correctly?
I have the feeling it came from multiple inconsistent definitions of c_sw in the input file.
We've already excluded that this is a culprit as I was never doing this.
I think that this may be solved by #525. For a very small lattice
T=8
L=4
Measurements = 4
Startcondition = hot
InitialStoreCounter = 0
I compare a run on the host using cg and stored in onlinemeas.000003_ref with a run on the device using mg.
garofalo@qbig:/qbigwork/garofalo/tmLQCD/build$ sdiff onlinemeas.000003 onlinemeas.000003_ref
1 1 0 5.566760e+01 0.000000e+00 1 1 0 5.566760e+01 0.000000e+00
1 1 1 6.254904e+00 6.067758e+00 1 1 1 6.254904e+00 6.067758e+00
1 1 2 1.112194e+00 8.851545e-01 1 1 2 1.112194e+00 8.851545e-01
1 1 3 1.655930e-01 1.519360e-01 1 1 3 1.655930e-01 1.519360e-01
1 1 4 5.293340e-02 0.000000e+00 1 1 4 5.293340e-02 0.000000e+00
2 1 0 -6.644760e-01 0.000000e+00 2 1 0 -6.644760e-01 0.000000e+00
2 1 1 2.595605e+00 -3.244143e+00 2 1 1 2.595605e+00 -3.244143e+00
2 1 2 5.945404e-01 -4.227463e-01 2 1 2 5.945404e-01 -4.227463e-01
2 1 3 7.036322e-02 -6.507322e-02 2 1 3 7.036322e-02 -6.507322e-02
2 1 4 5.109316e-03 0.000000e+00 2 1 4 5.109316e-03 0.000000e+00
6 1 0 -2.165718e+00 0.000000e+00 6 1 0 -2.165718e+00 0.000000e+00
6 1 1 5.403535e-02 -1.337601e-01 6 1 1 5.403535e-02 -1.337601e-01
6 1 2 -1.611790e-02 1.086402e-02 | 6 1 2 -1.611791e-02 1.086402e-02
6 1 3 7.483443e-03 6.764100e-04 | 6 1 3 7.483443e-03 6.764102e-04
6 1 4 -1.734317e-03 0.000000e+00 6 1 4 -1.734317e-03 0.000000e+00
Also, the residue looks ok to me
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.044000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.376449e-02 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for reorder_spinor_toQuda 2.507400e-05 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/reorder_spinor_toQuda
Source: 768
Prepared source = 673.325
Prepared solution = 0
Prepared source post mass rescale = 673.325
Creating a GCR solver
GCR: 0 iterations, <r,r> = 6.733246e+02, |r|/|b| = 1.000000e+00
GCR: 1 iterations, <r,r> = 8.976795e-01, |r|/|b| = 3.651307e-02
GCR: 2 iterations, <r,r> = 3.241824e-03, |r|/|b| = 2.194232e-03
GCR: 3 iterations, <r,r> = 1.035675e-05, |r|/|b| = 1.240222e-04
GCR: 4 iterations, <r,r> = 3.436469e-08, |r|/|b| = 7.144041e-06
GCR (restart): 1 iterations, <r,r> = 3.438203e-08, |r|/|b| = 7.145843e-06
GCR: 5 iterations, <r,r> = 9.722416e-11, |r|/|b| = 3.799923e-07
GCR: 6 iterations, <r,r> = 2.979178e-13, |r|/|b| = 2.103468e-08
GCR: 7 iterations, <r,r> = 9.627888e-16, |r|/|b| = 1.195785e-09
GCR: number of restarts = 1
GCR: Convergence at 7 iterations, L2 relative residual: iterated = 1.195750e-09, true = 1.195750e-09 (requested = 3.853787e-09)
Solution = 1781.02
Reconstructed solution: 2251.46
# TM_QUDA: Time for invertQuda 1.998980e-02 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/invertQuda
Also, the residue looks ok to me
Awesome. Do you still have the next line(s) of the output which should contain tmLQCD's residual check (rather than QUDA's residual, which always appeared to be correct).
Maybe tmLQCD compute the squared residue
Reconstructed solution: 2251.46
# TM_QUDA: Time for invertQuda 1.998980e-02 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/invertQuda
# TM_QUDA: Done: 7 iter / 0.018373 secs = 78.5877 Gflops
# TM_QUDA: Time for reorder_spinor_fromQuda 2.966800e-05 s level: 3 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda/reorder_spinor_fromQuda
# TM_QUDA: Time for invert_eo_quda 4.488409e-02 s level: 2 proc_id: 0 /HMC/correlators_measurement/invert_eo_quda
# Inversion done in 7 iterations, squared residue = 7.625053e-16!
# Inversion done in 4.74e-02 sec.
# : Time for correlators_measurement 4.907240e-02 s level: 1 proc_id: 0 /HMC/correlators_measurement
Maybe tmLQCD compute the squared residue
yes, and it's always the residual by itself (not the relative one). This looks good thanks!
I'm afraid this is still a problem for me (in the sense that neither MG nor CG converge in the online measurement as part of an nf=2+1+1 HMC)...
This is CG (which converges according to QUDA but seemingly to the wrong result according to the residual check):
$ tail -f log_1645123984.out | grep residue
# Inversion done in 14635 iterations, squared residue = 6.294779e+04!
# Inversion done in 10410 iterations, squared residue = 5.710302e+04!
# Inversion done in 10097 iterations, squared residue = 5.745964e+04!
# Inversion done in 8723 iterations, squared residue = 5.550602e+04!
Alright, as discussed, here's a minimal reproducer. No MG, just CG. ~The problem appears when the NDCLOVERRAT
monomial is added, so there must be some leftover parameter switch which we don't take into account.~ nope, this is not the reason
It's independent of the order in which the monomials are specified and also independent of use_even_odd
for the CLOVER
operator used in the online measurement.
T=16
L=4
Measurements = 50
Startcondition = hot
InitialStoreCounter = 0
#Startcondition = continue
#InitialStoreCounter = readin
2KappaMu = 0.0015846837
CSW = 1.76
kappa = 0.15
NSave = 1
ThetaT = 1.0
UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 3
ompnumthreads = 6
BeginIntegrator
Type0 = 2MN
Type1 = 2MN
IntegrationSteps0 = 1
IntegrationSteps1 = 2
tau = 0.1
Lambda0 = 0.19
Lambda1 = 0.20
NumberOfTimescales = 2
MonitorForces = no
EndIntegrator
BeginMonomial GAUGE
Type = Wilson
beta = 5.60
Timescale = 0
UseExternalLibrary = quda
EndMonomial
BeginOperator CLOVER
CSW = 1.76
kappa = 0.15
2kappamu = 0.0015846837
SolverPrecision = 1e-14
MaxSolverIterations = 10000
solver = cg
UseEvenOdd = yes
useexternalinverter = quda
usesloppyprecision = single
EndOperator
BeginMeasurement CORRELATORS
Frequency = 1
EndMeasurement
BeginMonomial CLOVERDET
Timescale = 1
kappa = 0.15
2KappaMu = 0.0015846837
CSW = 1.76
rho = 0.09353509
MaxSolverIterations = 10000
AcceptancePrecision = 1.e-19
ForcePrecision = 1.e-15
Name = cloverdetlight
solver = cg
useexternalinverter = quda
usesloppyprecision = half
EndMonomial
Evolving an HMC for 49 trajectories on a 4c16 lattice works nicely using the following integrator:
BeginIntegrator
Type0 = 2MN
Type1 = 2MN
IntegrationSteps0 = 1
IntegrationSteps1 = 2
tau = 0.1
Lambda0 = 0.19
Lambda1 = 0.20
NumberOfTimescales = 2
MonitorForces = no
EndIntegrator
Running this trajectory, once using tmLQCD to solve for the online measurement and once using QUDA, I get the following correlators at trajectory 49 (tmLQCD left, QUDA right):
1 1 0 4.905795e+01 0.000000e+00 | 1 1 0 5.030253e+01 0.000000e+00
1 1 1 8.515074e+00 8.015892e+00 | 1 1 1 1.330971e+01 8.459627e+00
1 1 2 1.783359e+00 1.638909e+00 | 1 1 2 5.356934e+00 1.913101e+00
1 1 3 7.024739e-01 3.899849e-01 | 1 1 3 2.644079e+00 5.004493e-01
1 1 4 1.835661e-01 1.451663e-01 | 1 1 4 9.142225e-01 1.763050e-01
1 1 5 3.883558e-02 4.926942e-02 | 1 1 5 3.148795e-01 7.100453e-02
1 1 6 1.413912e-02 1.563275e-02 | 1 1 6 1.260886e-01 2.194228e-02
1 1 7 7.428324e-03 5.555699e-03 | 1 1 7 5.051833e-02 1.223092e-02
1 1 8 4.436583e-03 0.000000e+00 | 1 1 8 2.127422e-02 0.000000e+00
2 1 0 6.288558e-01 0.000000e+00 | 2 1 0 -4.111925e+00 0.000000e+00
2 1 1 9.463874e-01 -1.931518e+00 | 2 1 1 -5.112575e+00 2.777394e+00
2 1 2 5.985951e-01 -3.280192e-01 | 2 1 2 -2.059360e+00 4.171696e-01
2 1 3 1.917542e-01 -1.381770e-01 | 2 1 3 -5.619905e-01 1.655978e-01
2 1 4 5.317374e-02 -4.361940e-02 | 2 1 4 -3.101735e-01 5.678142e-02
2 1 5 1.265858e-02 -1.612433e-02 | 2 1 5 -9.021375e-02 2.120072e-02
2 1 6 3.855478e-03 -3.966485e-03 | 2 1 6 -1.895364e-02 7.009182e-03
2 1 7 2.302083e-03 -7.334878e-04 | 2 1 7 -1.174924e-02 4.104665e-04
2 1 8 2.625197e-04 0.000000e+00 | 2 1 8 -6.862359e-03 0.000000e+00
6 1 0 -2.080063e+00 0.000000e+00 | 6 1 0 -4.903313e+00 0.000000e+00
6 1 1 3.507670e-01 -1.665105e-01 | 6 1 1 -3.677945e-01 -2.357396e-01
6 1 2 -1.166227e-02 3.013275e-03 | 6 1 2 1.177670e-01 -6.271045e-02
6 1 3 1.286590e-02 -4.614345e-03 | 6 1 3 5.833267e-03 -4.283882e-02
6 1 4 2.265724e-03 -9.804648e-06 | 6 1 4 4.505630e-02 5.888295e-03
6 1 5 1.200649e-03 2.680927e-03 | 6 1 5 3.778743e-03 2.597529e-03
6 1 6 -3.611065e-05 1.501504e-04 | 6 1 6 9.812108e-03 -5.020652e-04
6 1 7 6.634757e-04 1.901951e-04 | 6 1 7 -1.538161e-04 -5.964235e-04
6 1 8 1.173816e-04 0.000000e+00 | 6 1 8 1.044908e-03 0.000000e+00
While the trajectories were reproduced exactly (note that all derivatives were still computed via QUDA in both cases).
Okay, I've found the bugger, now for real.
The issue was the following: the solver interface(s) for the monomials set inv_param.dagger = QUDA_DAG_YES
when certain solvers are used (CG, for example). This also explains why your example, @Marcogarofalo, worked, while my example above (https://github.com/etmc/tmLQCD/issues/495#issuecomment-1048663230) does not: when the MG is used in the monomial, inv_param.dagger = QUDA_DAG_NO
is set and this corresponds to what is required also for the operator solve for the online measurement.
See #528
Resolved via #528
There seems to be a bug right now which messes up the QUDA parameters when used in both the HMC for the single parity solve as well as for the online measurement. Probably just some silly oversight.