etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

write out gauge field when the solver fails during monomial_solve #563

Closed kostrzewa closed 1 year ago

kostrzewa commented 1 year ago

@simone-romiti this should allow you to test with the actual failing gauge config

kostrzewa commented 1 year ago
# Starting trajectory no 0
# : Time for gauge_heatbath 3.347489e-03 s level: 1 proc_id: 0 /HMC/GAUGE:gauge_heatbath
# : Time for sw_term 2.891905e-02 s level: 2 proc_id: 0 /HMC/CLOVERTRLOG:clover_trlog_heatbath/sw_term
# : Time for sw_trace 4.135739e-03 s level: 2 proc_id: 0 /HMC/CLOVERTRLOG:clover_trlog_heatbath/sw_trace
# : Time for clover_trlog_heatbath 3.333514e-02 s level: 1 proc_id: 0 /HMC/CLOVERTRLOG:clover_trlog_heatbath
# : Time for sw_term 2.230593e-02 s level: 2 proc_id: 0 /HMC/cloverdet:cloverdet_heatbath/sw_term
# : Time for sw_invert 1.069058e-02 s level: 2 proc_id: 0 /HMC/cloverdet:cloverdet_heatbath/sw_invert
# : Time for random_energy0 1.834037e-03 s level: 2 proc_id: 0 /HMC/cloverdet:cloverdet_heatbath/random_energy0
# : Time for Qp 4.230258e-03 s level: 2 proc_id: 0 /HMC/cloverdet:cloverdet_heatbath/Qp
# : Time for cloverdet_heatbath 3.911479e-02 s level: 1 proc_id: 0 /HMC/cloverdet:cloverdet_heatbath
# : Time for sw_term 1.579158e-02 s level: 2 proc_id: 0 /HMC/cloverdetratio:cloverdetratio_heatbath/sw_term
# : Time for sw_invert 9.308949e-03 s level: 2 proc_id: 0 /HMC/cloverdetratio:cloverdetratio_heatbath/sw_invert
# : Time for random_energy0 1.842062e-03 s level: 2 proc_id: 0 /HMC/cloverdetratio:cloverdetratio_heatbath/random_energy0
# : Time for Qp_zero_pf 2.824019e-03 s level: 2 proc_id: 0 /HMC/cloverdetratio:cloverdetratio_heatbath/Qp_zero_pf
#RG_Mixed CG: N_outer: 21 
# RG_mixed CG: iter_out: 1 iter_in_sp: 2 iter_in_dp: 0
#RG_mixed CG: iter: 3 eps_sq: 1.0000e-20 t/s: 1.5636e-02
# FIXME: note the following flop counts are wrong! Consider only the time to solution!
#RG_mixed CG: flopcount (for e/o tmWilson only): t/s: 1.5636e-02 mflops_local: 3445.3 mflops: 3445.3
# : Time for solve_degenerate 1.739835e-02 s level: 2 proc_id: 0 /HMC/cloverdetratio:cloverdetratio_heatbath/solve_degenerate
# Constructing LEMON writer for file conf_monomial_solve_fail.0000.0.000000 for append = 0
# Time spent writing 2.36 Mb was 12.1 ms.
# Writing speed: 195 Mb/s (195 Mb/s per MPI process).
# Scidac checksums for gaugefield conf_monomial_solve_fail.0000.0.000000:
#   Calculated            : A = 0xe65ae1e6 B = 0x09ca5033.
FATAL ERROR
  Within solve_degenerate (reported by node 0):
    Error: solver reported -1 iterations.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
kostrzewa commented 1 year ago

The trajectory counter written into the gauge configuration is wrong, of course, unless trajectory_counter == nstore, that is, when Nsave = 1 ...

kostrzewa commented 1 year ago

If you find that this works as required feel free to approve and merge it.

kostrzewa commented 1 year ago

@simone-romiti does this work and can it be merged?