Closed sbacchio closed 1 year ago
We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).
Hi, yes, in case I do
in the beginning of monitor_force-s
update_tm_gauge_id(&g_gauge_state, 0.1);
and
update_tm_gauge_id(&g_gauge_state, -0.1);
the problem actually disappers
We've reproduced this, albeit with a different error: we get a precision mismatch. This indicates that one of the parameter structs is not properly initialized (or overwritten somehow).
I saw similar kind of issue using hot start:
like
MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())
Hi I'm seeing a similar error, I attach the relevant part of the valgrind inspection. My understanding is that the check which control whether the sloppy gauge must be allocated,
is not working, I have to say I don't understand the logic.
Below is also the relevant part of the input file, maybe somebody can spot a parameter not properly set?
==93804== at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93804== by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93804== by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93804== by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93804== by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803== Invalid read of size 8
==93803== at 0x14932B84: quda::Dirac::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac.cpp:122)
==93803== by 0x149681B7: quda::DiracTwistedClover::checkParitySpinor(quda::ColorSpinorField const&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:34)
==93803== by 0x1496861B: quda::DiracTwistedCloverPC::Dslash(quda::ColorSpinorField&, quda::ColorSpinorField const&, QudaParity_s) const (dirac_twisted_clover.cpp:219)
==93803== by 0x14968E63: quda::DiracTwistedCloverPC::M(quda::ColorSpinorField&, quda::ColorSpinorField const&) const (dirac_twisted_clover.cpp:289)
==93803== by 0x1485FC3F: quda::DiracM::operator()(quda::ColorSpinorField&, quda::ColorSpinorField const&, quda::ColorSpinorField&) const (dirac_quda.h:2117)
==93803== by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804== by 0x148A576B: quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_ca_gcr.cpp:223)
==93804== by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804== by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804== by 0x1490916F: invertQuda (interface_quda.cpp:3011)
==93804== by 0x1004C08F: invert_eo_degenerate_quda (quda_interface.c:2099)
==93804== by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804== by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804== Address 0x15b7deaa8 is 8 bytes inside a block of size 3,112 free'd
==93804== at 0x4086234: free (vg_replace_malloc.c:540)
==93804== by 0x149CF2BB: quda::host_free_(char const*, char const*, int, void*) (malloc.cpp:475)
==93804== by 0x14949E8F: operator delete (object.h:24)
==93804== by 0x14949E8F: quda::cudaGaugeField::~cudaGaugeField() (cuda_gauge_field.cpp:111)
==93804== by 0x148E88BB: freeSloppyGaugeQuda() (interface_quda.cpp:1046)
==93804== by 0x148E8C57: freeGaugeQuda (interface_quda.cpp:1104)
==93804== by 0x100483BB: _loadGaugeQuda (quda_interface.c:587)
==93804== by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804== by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93804== by 0x10075A77: cloverdet_derivative (cloverdet_monomial.c:100)
==93804== by 0x1007F59B: monitor_forces (monitor_forces.c:58)
==93804== by 0x1003701B: update_tm (update_tm.c:134)
==93804== by 0x1000758F: main (hmc_tm.c:402)
==93804== Block was alloc'd at
==93804== at 0x408484C: malloc (vg_replace_malloc.c:309)
==93803== by 0x1485215B: quda::MG::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (multigrid.cpp:1277)
==93804== by 0x149CFA6F: quda::safe_malloc_(char const*, char const*, int, unsigned long) (malloc.cpp:282)
==93804== by 0x1490D45B: operator new (object.h:22)
==93804== by 0x1490D45B: loadGaugeQuda (interface_quda.cpp:673)
==93804== by 0x100482EB: _loadGaugeQuda (quda_interface.c:595)
==93804== by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==93804== by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==93803== by 0x148BEC43: quda::GCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&) (inv_gcr_quda.cpp:411)
==93804== by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==93804== by 0x100366A3: update_tm (update_tm.c:130)
==93804== by 0x1000758F: main (hmc_tm.c:402)
==93804==
BeginExternalInverter QUDA
Pipeline = 24
gcrNkrylov = 24
MGCoarseMuFactor = 1.0, 1.0, 50.0
MGNumberOfLevels = 3
MGNumberOfVectors = 24, 32
MGSetupSolver = cg
MGSetup2KappaMu = 0.000336154560
MGVerbosity = silent, silent, silent
MGSetupSolverTolerance = 5e-7, 5e-7
MGSetupMaxSolverIterations = 1500, 1500
MGCoarseSolverType = gcr, gcr, cagcr
MgCoarseSolverTolerance = 0.1, 0.1, 0.1
MGCoarseMaxSolverIterations = 15, 15, 15
MGSmootherType = cagcr, cagcr, cagcr
MGSmootherTolerance = 0.2, 0.2, 0.2
MGSmootherPreIterations = 0, 0, 0
MGSmootherPostIterations = 4, 4, 4
MGBlockSizesX = 2,2
MGBlockSizesY = 2,2
MGBlockSizesZ = 2,2
MGBlockSizesT = 2,2
MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90
MGResetSetupMDUThreshold = 1.0
# tau = 1.0 / 17 = 0.05882353 -> Threshold = 0.058
MGRefreshSetupMDUThreshold = 0.058
MGRefreshSetupMaxSolverIterations = 20, 20
EndExternalInverter
BeginOperator CLOVER
CSW = 1.76
kappa = 0.15
2kappamu = 0.0015846837
SolverPrecision = 1e-14
MaxSolverIterations = 1000
# solver = cg
solver = mg
UseEvenOdd = yes
useexternalinverter = quda
usesloppyprecision = single
EndOperator
BeginMonomial CLOVERDET
Timescale = 1
kappa = 0.15
2KappaMu = 0.0015846837
CSW = 1.76
rho = 0.09353509
MaxSolverIterations = 1000
AcceptancePrecision = 1.e-19
ForcePrecision = 1.e-15
Name = cloverdetlight
solver = mg
useexternalinverter = quda
usesloppyprecision = single
EndMonomial
BeginMonomial CLOVERDETRATIO
Timescale = 1
kappa = 0.15
2KappaMu = 0.0015846837
rho = 0.01039279
rho2 = 0.09353509
CSW = 1.76
MaxSolverIterations = 1000
AcceptancePrecision = 1.e-19
ForcePrecision = 1.e-16
Name = cloverdetratio1light
solver = mg
useexternalinverter = quda
usesloppyprecision = single
EndMonomial
I attach @Marcogarofalo since he is observing the same issue
I notice
UseEvenOdd = yes
is not add to the cloverdet
, while it is used in the clover
which is working smoothly (I believe). Is this related to your opinion? I'll do a test...
Ok I understand that the
UseEvenOdd = yes
is not needed and is not understood at all
The clover monomials should always be EO (it's an unholy mess for historical reasons)...
UseEvenOdd
which, in princple, sets all monomials to be EO-preconditioned (unless a monomial is encountered which does not support this)UseEvenOdd
parameter to control certain historically relevant casesyes
or no
and with MG for the online measurement, no
makes sense (and in general for measurements).ok, the global flag is set to yes, this is ruled out
For some reason the heatbath part of the momomial is working, but the force calculation is not...
in other words when the monomial is created, the sloppy gauge field is initialized, then when the force is computed, the sloppy field is freed, is not recreated, but later is addressed by the solver
Hi, yes, in case I do in the beginning of monitor_force-s
update_tm_gauge_id(&g_gauge_state, 0.1);
andupdate_tm_gauge_id(&g_gauge_state, -0.1);
the problem actually disappers
I see only now this comment by Ferenz. It looks to me like this might be related to PR https://github.com/etmc/tmLQCD/pull/522, where we observed another problem related to gauge_state. Possibly the PR https://github.com/etmc/tmLQCD/pull/523/ might fix the issue?
Hi @sunpho84, I tried the PR #523, however I still get the issue when the monitor forces is turned on: MG level 0 (GPU): ERROR: Precisions 4 8 do not match (/cyclamen/home/fpittler/code/quda_ndeg/lib/../include/kernels/dslash_wilson.cuh:51 in WilsonArg())
Hi, yes, in case I do in the beginning of monitor_force-s
update_tm_gauge_id(&g_gauge_state, 0.1);
andupdate_tm_gauge_id(&g_gauge_state, -0.1);
the problem actually disappers
Just for reference, I report here another workaround that makes the problem disappear. One should add:
updateMultigridQuda(quda_mg_preconditioner, &quda_mg_param);
after this line: https://github.com/etmc/tmLQCD/blob/23003f1d66d5cdde2e2c6b2c046e0c4df1d16643/quda_interface.c#L2101
@sbacchio can you give #525 a try for the problem that you've encountered?
@sbacchio did the changes solve the issue with MonitorForces
?
I still see this issue in
/m100_work/INF22_lqcd123_0/hmc/cA211.12.48/start_from_0186/new_3nodes/logs/log_cA211.12.48_5918403.out
@Marcogarofalo in your input file, can you specify
BeginOperator CLOVER
CSW = 1.74
kappa = 0.140065
2KappaMu = 0.0003361560
solver = mg
SolverPrecision = 1e-18
MaxSolverIterations = 70000
useevenodd = yes
useexternalinverter = quda
usesloppyprecision = single ## <-- add this
EndOperator
to see if this resolves the problem? I think there might be an issue with trying to do full double-precision MG. Doing so is not recommended anyway, but I suspect that this is the reason for what you're seeing in the online measurement.
Strictly speaking we should of course support full double-precision MG, but it's not a high priority as it will be slow.
note that you can also reduce the maximum number of iterations there to at most 500 or so.
Yes sorry, besically the error I am seeing is #530. I thought that I had fixed the input. Thank you.
When
MonitorForces = yes
an error occurs in QUDA. More investigation required.The error is the following:
For more details see
$SCRATCH_fssh/bacchio1/C56/logs/log_trial_4811972.out
on Juwels Booster