Towards a GPU version of the gauge force

etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.

http://www.itkp.uni-bonn.de/~urbach/software.html

GNU General Public License v3.0

32 stars 47 forks source link

Towards a GPU version of the gauge force #497

Closed urbach closed 2 years ago

urbach commented 3 years ago

can we use this issue to organise ourselves on this topic?

I'm still lacking a bit the overview, so how do we proceed best here?

Also involves @Marcogarofalo and @sunpho84

kostrzewa commented 3 years ago

I think

https://github.com/lattice/quda/pull/1136

makes clear what happens on the QUDA side.

The C-interface function is computeGaugeForceQuda (https://github.com/qcdcode/quda/blob/b681990fde2ea40de4e5e3637107c0c0becc1ee8/lib/interface_quda.cpp#L4142)

and one needs to study the expected format for the output momentum field which likely needs to be reordered in the same way that the gauge field is reordered (Z -> X, X -> Z) and which might already use QUDA_RECONSTRUCT_8 or QUDA_RECONSTRUCT_10 (the latter is for HISQ, I think), although it might also be stored in full QUDA_RECONSTRUCT_18.

The input gauge field instead is easy to take care of and just needs a call to _loadGaugeQuda from our quda_interface.c which is a no-op if the gauge field on the device is current.

Finally, I would propose that a wrapper function is introduced for the gauge derivative calculation which then hands off to QUDA (or another external library). Of course, one could also hand off right in gauge_derivative at the cost of losing generality.

As discussed yesterday, one could also rename UseExternalInverter to UseExternalLibrary in the process to get a more consistent parameter naming.

Then one could specify:

BeginMonomial GAUGE
  Type = Iwasaki
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

in the input file to offload the derivative to QUDA (note that the UseExternalInverter parameter is currently not parsed for GAUGEMONOMIAL.

The same game can also be played for computing the actual gauge energy, although I expect this to be a very minor part of the total.

kostrzewa commented 3 years ago

Work should happen as a PR on top of https://github.com/etmc/tmLQCD/pull/491. Alternatively we can merge the latter in and have work take place in a PR on top of https://github.com/etmc/tmLQCD/pull/490.

kostrzewa commented 3 years ago

Also keep in mind our kanban board https://github.com/etmc/tmLQCD/projects/2

sunpho84 commented 3 years ago

The kanban is really cool! But, is it enough to drag and drop stuff across columns to get things actually done?

Apart from this, I have been exploring the gauge_monomial.c file, I would say that the switch to the quda gauge force calculation should replace almost the whole body of gauge_derivative routine, with just some basic link reordering, and the final call to _trace_lambda_mul_add_assign. What more?

I think @sbacchio should be included into this issue since he has done already some studies.

kostrzewa commented 3 years ago

The kanban is really cool! But, is it enough to drag and drop stuff across columns to get things actually done?

no, of course not, but it helps keeping an overview over what's going on (and ideally, who is working on what...)

Apart from this, I have been exploring the gauge_monomial.c file, I would say that the switch to the quda gauge force calculation should replace almost the whole body of gauge_derivative routine, with just some basic link reordering, and the final call to _trace_lambda_mul_add_assign. What more?

I agree. It might be that even the trace is not necessary (as it might already be done by QUDA).

sunpho84 commented 3 years ago

I've dug a bit, it looks like the top level interface

https://github.com/lattice/quda/blob/080cb1a83b13572df321b9be1891a9ff126c4e2d/include/quda.h#L1256

does even the full update of the momenta.

If one calls the innermost routine,

https://github.com/lattice/quda/blob/080cb1a83b13572df321b9be1891a9ff126c4e2d/lib/gauge_force.cu#L229

one might avoid this. I don't thing that compression "10" is the kind of projection you aim at, though

kostrzewa commented 3 years ago

I didn't realize that my reply to this had not appeared. The RECONSTRUCT_10 compression for momenta seems to be 9 numbers for the momentum with the last one ignored or used for staggered actions or anisotropy: three imaginary on the diagonal, three complex on the off-diagonal.

Unfortunately, this can't simply be projected to what we want with what exists in QUDA (using one of the copyGauge instances) as RECONSTRUCT_8 instead implements appendix A.2 of https://arxiv.org/pdf/0911.3191.pdf

urbach commented 3 years ago

I'm not able to compile #490, lot's of undefined global variables. Maybe because I don't have tmlqcd_config.h generated. But I don't understand why that is.

kostrzewa commented 3 years ago

I'm not able to compile #490, lot's of undefined global variables. Maybe because I don't have tmlqcd_config.h generated. But I don't understand why that is.

1) are you working in a fresh build directory? 2) was the source code in a new directory (and did you run autoconf to generate a new configure file?)

The transition from config.h to tmlqcd_config.h and the inclusion of the auto-generated tmlqcd_cofnig_internal.h is unfortunately rather precarious if working in an existing build directory as I've changed which files are auto-generated and which are used from the source directory. This leads to a dependency mismatch if working in an existing directory.

In the build directory, include/tmlqcd_config_internal.h should exist and no other file. If there are other files there, delete them.

urbach commented 3 years ago

I had done this transition already at some point.

The source code is not in a new directory, unfortunately. I'll loose all my local branches etc. if I do so.

urbach commented 3 years ago

using a fresh build directory fixed the compile, thanks

urbach commented 3 years ago

how do we loose generality by handing off in gauge_derivative directly?

kostrzewa commented 3 years ago

how do we loose generality by handing off in gauge_derivative directly?

In the sense that writing a general wrapper function which does the hand-off forces one to think about a clean interface and might allow this to also be used for the gradient flow in the end. Also, one might consider writing bits and pieces of device code in simpler libraries (for architectures beyond accelerators supported by QUDA) and to employ these via the same mechanism.

The other point is that we will also need to update the momentum field from the fermionic monomials, so thinking about how to keep the momentum field on the device and the CPU side in sync is a good exercise. If one thinks about this only for the gauge monomial, one might have to duplicate a lot of code in the end.

urbach commented 3 years ago

looking at the corresponding QUDA function I'm not sure we can reuse this for the gradient flow. To me this function appears to do exactly what we do in gauge_derivative. But I'll have to check how the gradient flow is implemented in QUDA.

Keeping the momenta on the GPU and CPU in sync is important to think about. Currently, it seems to me that it is a separate issue, though.

kostrzewa commented 3 years ago

looking at the corresponding QUDA function I'm not sure we can reuse this for the gradient flow. To me this function appears to do exactly what we do in gauge_derivative. But I'll have to check how the gradient flow is implemented in QUDA.

Sure, using the function discussed above is not suitable for the gradient flow but in principle once one has a clean interface for offloading the staple calculation, this can then be re-used identically in our gradient flow routines, such that we don't actually have to mess around with QUDA's gradient flow and instead still use our RK-integrator with just the kernels handed off to the device.

Keeping the momenta on the GPU and CPU in sync is important to think about. Currently, it seems to me that it is a separate issue, though.

I'm not sure. computeGaugeForceQuda and the corresponding fermionic force routines combine what we call the derivative with what is done in update_momenta. I think we can work around that, however, by passing dt=1.0 in computateGaugeForceQuda. We will still have to project down to the 8-real representation after having done so.