etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Ensure gauge_id member is initialized when the gauge_status is created #522

Closed sunpho84 closed 2 years ago

sunpho84 commented 2 years ago

I'm in the process o fixing a nasty segmentation fault that arises in random places around. With the help of @Marcogarofalo I've managed to run the code under valgrind, and I'm proceeding to fix each complain in turn, this is the first one

==20258== Use of uninitialised value of size 8
==20258==    at 0xBD369EC: __printf_fp_l (in /usr/lib64/power9/libc-2.28.so)
==20258==    by 0x1FFF007AAF: ???
==20258==    by 0xBD32623: printf_positional (in /usr/lib64/power9/libc-2.28.so)
==20258==    by 0xBD33BDF: vfprintf@@GLIBC_2.17 (in /usr/lib64/power9/libc-2.28.so)
==20258==    by 0x10045397: vprintf (stdio.h:41)
==20258==    by 0x10045397: tm_debug_printf (tm_debug_printf.c:35)
==20258==    by 0x100481F7: _loadGaugeQuda (quda_interface.c:565)
==20258==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==20258==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==20258==    by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==20258==    by 0x100366A3: update_tm (update_tm.c:130)
==20258==    by 0x1000758F: main (hmc_tm.c:402)
==20258==  Uninitialised value was created by a stack allocation
==20258==    at 0x10150A18: init_global_states (init_global_states.c:25)

My understanding is that the gauge_id is not set to zero right where it should (at the creation of the state), but a complicated mechanism is in place. The gauge_id is referred in several places (some of clearly harmless, some of which possibly harmful)

kostrzewa commented 2 years ago

Thanks, I agree that this was clearly a bug.

kostrzewa commented 2 years ago

Looking back, I think the fact that gauge_id was not set was "by design". The point being that if loaded == 0, none of the other struct members may contain valid values.

kostrzewa commented 2 years ago

The same applies to tm_CloverState_t and tm_CloverInverseState_t. I would actually argue that what was pulled in here should be reverted.

kostrzewa commented 2 years ago

Or instead, the default value should be some large negative number, because it is of course correct that _loadGaugeQuda reads this value and might read it before it has been initialised.

kostrzewa commented 2 years ago

The issue (in the case of the HMC) is the fact that update_tm_gauge_state is not called after random_gauge_field or after unit_g_gauge_field. The solution would be to modify these functions in start.c to call update_tm_gauge_state.

kostrzewa commented 2 years ago

There may be other oversights in the logic and it's good that other people are looking at this now.

sunpho84 commented 2 years ago

this is another occurrence where the gauge_id is referred

# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
==20258== Conditional jump or move depends on uninitialised value(s)
==20258==    at 0x1004843C: check_quda_gauge_state (quda_types.h:192)
==20258==    by 0x1004843C: _loadGaugeQuda (quda_interface.c:581)
==20258==    by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==20258==    by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==20258==    by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==20258==    by 0x100366A3: update_tm (update_tm.c:130)
==20258==    by 0x1000758F: main (hmc_tm.c:402)
==20258==  Uninitialised value was created by a stack allocation
==20258==    at 0x10150A18: init_global_states (init_global_states.c:25)

which points to

https://github.com/etmc/tmLQCD/blob/23003f1d66d5cdde2e2c6b2c046e0c4df1d16643/quda_types.h#L193

sunpho84 commented 2 years ago

of course this complaint by valgrind was occurring before the fix

kostrzewa commented 2 years ago

I think we can safely set various things to large negative numbers by default (so that they are never uninitialised) and beyond that, whenever the state is modified, we must call update_tm_*_state.

Marcogarofalo commented 2 years ago

why large negative, -1 is not enough?

since quda_gauge_state starts with gauge_id=-1 why not the same for gauge_param.gauge_id

static inline void reset_quda_gauge_state(tm_QudaGaugeState_t * const quda_gauge_state){
  quda_gauge_state->gauge_id = -1;
  quda_gauge_state->loaded = 0;
}
kostrzewa commented 2 years ago

@Marcogarofalo sure. I wanted to use a large negative number because then it's clear that the thresholds are exceeded. The whole "state tracking" is quite fragile unfortunately...

kostrzewa commented 2 years ago

Please let me know what you think of the proposal in #523

sunpho84 commented 2 years ago

Yeah it makes sense, as long as one keep memory of updating the status where needed. The large negative number can be probably better understood