etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

partial workaround for some of the precision mismatch problems in QUDA interface #525

Closed kostrzewa closed 2 years ago

kostrzewa commented 2 years ago

We were not tracking the precisions of the gauge and clover fields present on the device and hence these were bound to lead to mismatches when switching from one monomial employing an operator and solver with a particular set of precisions to another monomial employing a different set of precisions (for cuda_prec, cuda_prec_sloppy, cuda_prec_refinement_sloppy, cuda_prec_precondition and cuda_prec_eigensolver and the corresponding ones for clover_quda_*.

Unfortunately, this does not fully "solve" #517, but it does raise a new type of error there which might help resolve that too in the end.

The set of changes here has a drawback of course: the gauge and clover fields are reloaded much more frequently instead of just causing the missing precision to be instantiated from the existing double precision field on the device. Not sure how bad the additional overhead is compared to the time spent in a trajectory.

It is especially a complete waste of time to call reorder_gauge_toQuda so frequently because this should really only be called when g_gauge_field or any of the theta angles have actually changed. freeGaugeQuda() and loadGaugeQuda(..) do have to be called, however (at least for now, since that is our mechanism for ensuring that the field is up to date).

kostrzewa commented 2 years ago

Do not merge yet. I'm afraid there are other issues left...

kostrzewa commented 2 years ago

https://github.com/etmc/tmLQCD/pull/525/commits/6c040df7fb9c312d6adcabf672db9e031150bbd8 resolves #517 and solidifies the precision mismatch fix

The problem was that the MG Setup (in particular I guess the coarse operators) seem to have an internal memory of the gauge field device pointers (rather than an abstract reference which I would expect to update with the gauge field on the device).

When we call freeGaugeQuda() in the HMC, we are left with dangling pointers in the MG and this is what causes the crazy "volume mismatches". At the same time, the current gauge and clover fields must be consistent with the precisions in the MG Setup and this leads to the precision mismatches.

I'm not happy with this because it induces lots of MG Setup updates, but these are not THAT expensive. I think this is ready to test now.

Marcogarofalo commented 2 years ago

I'm not happy with this because it induces lots of MG Setup updates, but these are not THAT expensive. I think this is ready to test now.

but it works. Also all the valgrind messages disappear. Thanks

kostrzewa commented 2 years ago

but it works. Also all the valgrind messages disappear. Thanks

Thanks for the test! It would be interesting to see how a profile with this code compares to the profiles that you generated a while ago.

kostrzewa commented 2 years ago

Thanks for all the tests, valgrind runs and tentative workarounds @sunpho84 @simone-romiti @Marcogarofalo @pittlerf Without the hints from these I would not have been able to fix this...