Closed sunpho84 closed 2 years ago
Thanks, I agree that this was clearly a bug.
Looking back, I think the fact that gauge_id
was not set was "by design". The point being that if loaded == 0
, none of the other struct members may contain valid values.
The same applies to tm_CloverState_t
and tm_CloverInverseState_t
. I would actually argue that what was pulled in here should be reverted.
Or instead, the default value should be some large negative number, because it is of course correct that _loadGaugeQuda
reads this value and might read it before it has been initialised.
The issue (in the case of the HMC) is the fact that update_tm_gauge_state
is not called after random_gauge_field
or after unit_g_gauge_field
. The solution would be to modify these functions in start.c
to call update_tm_gauge_state
.
There may be other oversights in the logic and it's good that other people are looking at this now.
this is another occurrence where the gauge_id
is referred
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
==20258== Conditional jump or move depends on uninitialised value(s)
==20258== at 0x1004843C: check_quda_gauge_state (quda_types.h:192)
==20258== by 0x1004843C: _loadGaugeQuda (quda_interface.c:581)
==20258== by 0x1004BDB3: invert_eo_degenerate_quda (quda_interface.c:2062)
==20258== by 0x1013FB13: solve_degenerate (monomial_solve.c:127)
==20258== by 0x100775A7: cloverdetratio_heatbath (cloverdetratio_monomial.c:287)
==20258== by 0x100366A3: update_tm (update_tm.c:130)
==20258== by 0x1000758F: main (hmc_tm.c:402)
==20258== Uninitialised value was created by a stack allocation
==20258== at 0x10150A18: init_global_states (init_global_states.c:25)
which points to
https://github.com/etmc/tmLQCD/blob/23003f1d66d5cdde2e2c6b2c046e0c4df1d16643/quda_types.h#L193
of course this complaint by valgrind was occurring before the fix
I think we can safely set various things to large negative numbers by default (so that they are never uninitialised) and beyond that, whenever the state is modified, we must call update_tm_*_state
.
why large negative, -1 is not enough?
since quda_gauge_state starts with gauge_id=-1 why not the same for gauge_param.gauge_id
static inline void reset_quda_gauge_state(tm_QudaGaugeState_t * const quda_gauge_state){
quda_gauge_state->gauge_id = -1;
quda_gauge_state->loaded = 0;
}
@Marcogarofalo sure. I wanted to use a large negative number because then it's clear that the thresholds are exceeded. The whole "state tracking" is quite fragile unfortunately...
Please let me know what you think of the proposal in #523
Yeah it makes sense, as long as one keep memory of updating the status where needed. The large negative number can be probably better understood
I'm in the process o fixing a nasty segmentation fault that arises in random places around. With the help of @Marcogarofalo I've managed to run the code under valgrind, and I'm proceeding to fix each complain in turn, this is the first one
My understanding is that the
gauge_id
is not set to zero right where it should (at the creation of the state), but a complicated mechanism is in place. Thegauge_id
is referred in several places (some of clearly harmless, some of which possibly harmful)