etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

MonitorForces not working with QUDA #517

Closed sbacchio closed 1 year ago

sbacchio commented 2 years ago

When MonitorForces = yes an error occurs in QUDA. More investigation required.

The error is the following:

MG level 0 (GPU): ERROR: Spinor volume 351232 doesn't match gauge volume 0 (rank 0, host jwb0134.juwels, dirac.cpp:125 in checkParitySpinor())
MG level 0 (GPU):        last kernel called was (name=N4quda4blas7axpbyz_IfEE,volume=28x56x56x4,aux=GPU-offline,vol=351232,precision=4,order=4,Ns=4,Nc=3,TwistFlavour=1)

For more details see $SCRATCH_fssh/bacchio1/C56/logs/log_trial_4811972.out on Juwels Booster

kostrzewa commented 2 years ago

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

pittlerf commented 2 years ago

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

pittlerf commented 2 years ago

We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).

I saw similar kind of issue using hot start: like MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

sunpho84 commented 2 years ago

Hi I'm seeing a similar error, I attach the relevant part of the valgrind inspection. My understanding is that the check which control whether the sloppy gauge must be allocated,

https://github.com/sunpho84/quda/blob/5431b168b09343503d0d676425069dc895879c92/lib/interface_quda.cpp#L670-L674

is not working, I have to say I don't understand the logic.

Below is also the relevant part of the input file, maybe somebody can spot a parameter not properly set?

==93804==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93804==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93804==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93804==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93804==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803== Invalid read of size 8
==93803==    at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93803==    by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93803==    by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93803==    by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93803==    by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x1490916F: invertQuda (interface_quda.cpp:3011)
==93804==    by 0x1004C08F: invert_eo_degenerate_quda (quda_interface.c:2099)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==  Address 0x15b7deaa8 is 8 bytes inside a block of size 3,112 free'd
==93804==    at 0x4086234: free (vg_replace_malloc.c:540)
==93804==    by 0x149CF2BB: quda::host_free_(char const*, char const*, int, void*) (malloc.cpp:475)
==93804==    by 0x14949E8F: operator delete (object.h:24)
==93804==    by 0x14949E8F: quda::cudaGaugeField::~cudaGaugeField() (cuda_gauge_field.cpp:111)
==93804==    by 0x148E88BB: freeSloppyGaugeQuda() (interface_quda.cpp:1046)
==93804==    by 0x148E8C57: freeGaugeQuda (interface_quda.cpp:1104)
==93804==    by 0x100483BB: _loadGaugeQuda (quda_interface.c:587)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804==    by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804==    by 0x1007F59B: monitor_forces (monitor_forces.c:58)
==93804==    by 0x1003701B: update_tm (update_tm.c:134)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804==  Block was alloc'd at
==93804==    at 0x408484C: malloc (vg_replace_malloc.c:309)
==93803==    by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804==    by 0x149CFA6F: quda::safe_malloc_(char const*, char const*, int, unsigned long) (malloc.cpp:282)
==93804==    by 0x1490D45B: operator new (object.h:22)
==93804==    by 0x1490D45B: loadGaugeQuda (interface_quda.cpp:673)
==93804==    by 0x100482EB: _loadGaugeQuda (quda_interface.c:595)
==93804==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93803==    by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804==    by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==93804==    by 0x100366A3: update_tm (update_tm.c:130)
==93804==    by 0x1000758F: main (hmc_tm.c:402)
==93804== 
BeginExternalInverter QUDA
  Pipeline = 24
  gcrNkrylov = 24
  MGCoarseMuFactor = 1.0, 1.0, 50.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.000336154560
  MGVerbosity = silent, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGBlockSizesX = 2,2
  MGBlockSizesY = 2,2
  MGBlockSizesZ = 2,2
  MGBlockSizesT = 2,2
  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90
  MGResetSetupMDUThreshold = 1.0
  # tau = 1.0 / 17 = 0.05882353 -> Threshold = 0.058
  MGRefreshSetupMDUThreshold = 0.058
  MGRefreshSetupMaxSolverIterations = 20, 20
EndExternalInverter

BeginOperator CLOVER
  CSW = 1.76
  kappa = 0.15
  2kappamu = 0.0015846837
  SolverPrecision = 1e-14
  MaxSolverIterations = 1000
#  solver = cg
  solver = mg
  UseEvenOdd = yes
  useexternalinverter = quda
  usesloppyprecision = single  
EndOperator

BeginMonomial CLOVERDET
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  CSW = 1.76
  rho = 0.09353509
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-15
  Name = cloverdetlight
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial

BeginMonomial CLOVERDETRATIO
  Timescale = 1
  kappa = 0.15
  2KappaMu = 0.0015846837
  rho = 0.01039279
  rho2 = 0.09353509
  CSW = 1.76
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-19
  ForcePrecision = 1.e-16
  Name = cloverdetratio1light
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial
sunpho84 commented 2 years ago

I attach @Marcogarofalo since he is observing the same issue

sunpho84 commented 2 years ago

I notice

  UseEvenOdd = yes

is not add to the cloverdet, while it is used in the clover which is working smoothly (I believe). Is this related to your opinion? I'll do a test...

sunpho84 commented 2 years ago

Ok I understand that the

 UseEvenOdd = yes

is not needed and is not understood at all

kostrzewa commented 2 years ago

The clover monomials should always be EO (it's an unholy mess for historical reasons)...

sunpho84 commented 2 years ago

ok, the global flag is set to yes, this is ruled out

For some reason the heatbath part of the momomial is working, but the force calculation is not...

sunpho84 commented 2 years ago

in other words when the monomial is created, the sloppy gauge field is initialized, then when the force is computed, the sloppy field is freed, is not recreated, but later is addressed by the solver

sunpho84 commented 2 years ago

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

I see only now this comment by Ferenz. It looks to me like this might be related to PR https://github.com/etmc/tmLQCD/pull/522, where we observed another problem related to gauge_state. Possibly the PR https://github.com/etmc/tmLQCD/pull/523/ might fix the issue?

pittlerf commented 2 years ago

Hi @sunpho84, I tried the PR #523, however I still get the issue when the monitor forces is turned on: MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())

simone-romiti commented 2 years ago

Hi, yes, in case I do in the beginning of monitor_force-s update_tm_gauge_id(&g_gauge_state, 0.1); and update_tm_gauge_id(&g_gauge_state, -0.1); the problem actually disappers

Just for reference, I report here another workaround that makes the problem disappear. One should add:

updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);

after this line: https://github.com/etmc/tmLQCD/blob/23003f1d66d5cdde2e2c6b2c046e0c4df1d16643/quda_interface.c#L2101

kostrzewa commented 2 years ago

@sbacchio can you give #525 a try for the problem that you've encountered?

kostrzewa commented 2 years ago

@sbacchio did the changes solve the issue with MonitorForces ?

Marcogarofalo commented 2 years ago

I still see this issue in /m100_work/INF22_lqcd123_0/hmc/cA211.12.48/start_from_0186/new_3nodes/logs/log_cA211.12.48_5918403.out

kostrzewa commented 2 years ago

@Marcogarofalo in your input file, can you specify

BeginOperator CLOVER
  CSW = 1.74
  kappa = 0.140065
  2KappaMu = 0.0003361560
  solver = mg
  SolverPrecision = 1e-18
  MaxSolverIterations = 70000
  useevenodd = yes                                                                                                                         
  useexternalinverter = quda
  usesloppyprecision = single ## <-- add this
EndOperator

to see if this resolves the problem? I think there might be an issue with trying to do full double-precision MG. Doing so is not recommended anyway, but I suspect that this is the reason for what you're seeing in the online measurement.

Strictly speaking we should of course support full double-precision MG, but it's not a high priority as it will be slow.

kostrzewa commented 2 years ago

note that you can also reduce the maximum number of iterations there to at most 500 or so.

Marcogarofalo commented 2 years ago

Yes sorry, besically the error I am seeing is #530. I thought that I had fixed the input. Thank you.