etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

WIP Quda work clover force #549

Closed kostrzewa closed 5 months ago

Marcogarofalo commented 1 year ago

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

Marcogarofalo commented 1 year ago

I asked for help to quda developers https://github.com/lattice/quda/pull/1330#issue-1406536580

kostrzewa commented 1 year ago

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

reorder_gauge_toQuda is our own function: what kind of error can it return?

It seems to me that instead where computeCloverForceQuda's third argument should be an array of arrays, you are simply passing spinorIn. As a quick workaround, I guess passing &spinorIn might work when only a single field is required for x, right?

Marcogarofalo commented 1 year ago

I think that the call to reorder_spinor_eo_toQuda is necessary. However the quda function reorder_gauge_toQuda return an error

reorder_gauge_toQuda is our own function: what kind of error can it return?

sorry I made a typos in the comment the function that get an error is computeCloverForceQuda , see https://github.com/lattice/quda/pull/1330

It seems to me that instead where computeCloverForceQuda's third argument should be an array of arrays, you are simply passing spinorIn. As a quick workaround, I guess passing &spinorIn might work when only a single field is required for x, right?

yes I agree

Marcogarofalo commented 1 year ago

@kostrzewa can I merge the other way around, i.e. quda_work into this quda_work_clover_force to get the automatic integration test working again?

kostrzewa commented 1 year ago

@kostrzewa can I merge the other way around, i.e. quda_work into this quda_work_clover_force to get the automatic integration test working again?

yes, that would be the way to go anyway to synchronise before merging everything together

kostrzewa commented 7 months ago

@Marcogarofalo using 020dfa3632ec214a11c72905ff15da0b23ceadde (tmLQCD) and c310d9c53431e2cf75a86e2cf0956ebcc2b9e12e (QUDA), I get the following behaviour on Juwels Booster:

00001033 0.540737337562 3039327.268655220047 0.000000e+00 101 1861 1115 9364 0 3.845400e+02 3.080185e-01
00001034 0.540737337562 3033882.783575160429 0.000000e+00 100 1850 1098 9277 0 1.325076e+02 3.080185e-01
00001035 0.540737337562 3023880.860261276364 0.000000e+00 101 1868 1111 9355 0 1.325901e+02 3.080185e-01

In a test run which works fine with the current quda_work head commit:

00001033 0.539109865241 -0.139158694074 1.149306e+00 101 1864 1026 8308 1 4.022610e+02 3.051663e-01
00001034 0.537638696155 -0.114508185536 1.121322e+00 100 1860 1023 8298 1 2.138116e+02 3.026756e-01
00001035 0.536272387032 -0.084111206234 1.087750e+00 102 1870 1047 8383 1 2.137449e+02 3.004461e-01

using this input file: /p/scratch/isdlqcd/kostrzewa2/benchmarks/tmLQCD_QUDA_HMC/quda-tm_force_64c128/64c128_n8mpi4nt24.input

kostrzewa commented 6 months ago

@Marcogarofalo using 020dfa3 (tmLQCD) and c310d9c53431e2cf75a86e2cf0956ebcc2b9e12e (QUDA) ...

@Marcogarofalo, which was the last QUDA commit that you tested with? It might be that some of the recent changes have messed with our interfacing with the code or, more likely, I did something wrong.

Marcogarofalo commented 6 months ago

c1ee1f7d0df32ad4329d6ef4531d5bc37f7b69a4

Marcogarofalo commented 6 months ago

I just checked that the last commit https://github.com/lattice/quda/pull/1338/commits/ddd0fa900b3cc614e2d0fbc048bc42d807ac9962 works for me in

/p/project/isdlqcd/garofalo1/builds/tmLQCD_tm_force

or

/qbigwork2/garofalo/builds/tmLQCD_fermionic_forces_pascal
Marcogarofalo commented 6 months ago

@kostrzewa Sorry I wasn't seeing the issue running at high debug level. Probably mnl->pf is used in the acceptance so we need to preserve it. Hopefully now it is fixed

kostrzewa commented 6 months ago

We may have a problem with the QUDA force inducing something very reminiscent of an integrator instability for reasons which I currently don't understand. I have two nf=2+1+1 test runs on 24c48 and 48c96 lattices in which, after a few integration steps, solvers begin to diverge (in one case the MG fails to converge, in a another run a simple CG with a large rho fails to converge and finally in yet another, a multi-shift solve diverges and produces a NaN residual).

Running the same exact cases on the same machine (Bender A40 and A100 nodes, respectively) with an older QUDA version as well as tmLQCD without the offloaded fermionic force, one does not seem to encounter these instabilities. There may be numerous reasons for what I'm seeing, of course, such as some other change in QUDA leading to these issues. I'll need to test with this older version of tmLQCD and the latest QUDA commit to make sure.

I have another nf=2 test case where the force offloading does not lead to any issues with solvers but this is currently still running to compare the MD history with and without the QUDA fermion force.

Marcogarofalo commented 6 months ago

If you run at debug level 4 does it tell you that the forces are different? if so, how much?

kostrzewa commented 5 months ago

This is a bit difficult to test at this level since the test doesn't just turn off the force calculation on the GPU but also the inversions, leading to it taking a long time on a realistic lattice.

Marcogarofalo commented 5 months ago

I add the input parameter UseExternalLibrary=quda to compute the force using quda, indipendently than the library used by the inverter.

Debug level 4 now will rerun the derivative function with the same inverter but the native implementation of the force.

kostrzewa commented 5 months ago

Thanks, that's very useful. I'll see if this sheds any light on things. I've extracted compare_derivative into its own compilation unit and have made it a little more flexible, see https://github.com/etmc/tmLQCD/pull/577.

On my 24c48 test lattice I don't see differences in the derivative down to 1e-9 (fermion) and 1e-10 (gauge). Both of these differences are far beyond double precision, of course, but I still find it hard to believe that such small differences would be able to trigger an instability.

kostrzewa commented 5 months ago

I've started some more test runs using the functionality that you added in aeffd7a / 5d3caec and #577 on 48c96 lattices at two quark masses. One of the two runs has started and is not showing any differences > 1e-9 so far. The problem is that it always took a while for the instability to be triggered so we need to wait and see.

kostrzewa commented 5 months ago

@Marcogarofalo I think I might have at least figured out why one set of my test jobs was misbehaving. To be more specific, the ones that I referred to in:

I have two nf=2+1+1 test runs on 24c48 and 48c96 lattices in which, after a few integration steps, solvers begin to diverge (in one case the MG fails to converge, in a another run a simple CG with a large rho fails to converge and finally in yet another, a multi-shift solve diverges and produces a NaN residual).

I was using fc31c7d before you had added the 07fc55e fix.

kostrzewa commented 5 months ago

FYI I have a 64c128 test job running on Booster in which there appear to be no problems.

/p/scratch/isdlqcd/kostrzewa2/benchmarks/tmLQCD_QUDA_HMC/quda-tm_force_64c128/jobscript/logs/log_64c128_tm_force_quda_full_n8_9214292.out

I will perform another longer run (this one has artifically short trajectories), but so far it seems that all is well.

Marcogarofalo commented 5 months ago

Hi, thanks for the good news. I am also running a test of the 64x128 in /qbigwork2/garofalo/builds/tmLQCD_fermionic_forces/phys_test/ so far the plaquette is ok while $\Delta H$ started to show differences

::::::::::::::
device/output.data
::::::::::::::
00001033 0.539124377721 -0.181064112112 1.198492e+00 99 1850 1008 8217 1 7.052586e+02 3.051752e-01
00001034 0.537658649417 -0.136323347688 1.146052e+00 102 1869 1036 8334 1 6.922392e+02 3.026848e-01
00001035 0.536295420353 -0.072135765105 1.074801e+00 101 1888 1037 8464 1 6.917731e+02 3.004625e-01
00001036 0.535011552829 -0.135401181877 1.144996e+00 101 1875 1025 8362 1 7.069782e+02 2.983943e-01
00001037 0.533843994664 0.000084528700 9.999155e-01 101 1875 1033 8372 1 6.941697e+02 2.965856e-01
00001038 0.532752973206 -0.088634211570 1.092681e+00 100 1872 1018 8313 1 7.003413e+02 2.948908e-01
00001039 0.531704500459 -0.006030147895 1.006048e+00 101 1866 1025 8286 1 6.922550e+02 2.933619e-01
00001040 0.530735472711 -0.022664899006 1.022924e+00 101 1883 1028 8354 1 6.910189e+02 2.919589e-01
::::::::::::::
host/output.data
::::::::::::::
00001033 0.539124377721 -0.181256163865 1.198722e+00 99 1850 1010 8218 1 1.087587e+03 3.051752e-01
00001034 0.537658649417 -0.136414341629 1.146157e+00 102 1866 1037 8334 1 1.079608e+03 3.026848e-01
00001035 0.536289797778 -0.074235649779 1.077061e+00 101 1885 1037 8444 1 1.114638e+03 3.004565e-01
00001036 0.535016437923 -0.062977313995 1.065003e+00 100 1877 1019 8385 1 1.103632e+03 2.984491e-01
00001037 0.533841333976 -0.071600986645 1.074227e+00 100 1848 1009 8207 1 1.101621e+03 2.966407e-01
kostrzewa commented 5 months ago

Are you running the same number of trajectories per job? Otherwise the random numbers won't be identical and you are bound to see differences quite early on. After some number of trajectories there's of course nothing that you can do: the runs will diverge somewhere around 20-30 trajectories or even slightly earlier.

kostrzewa commented 5 months ago

In my test using a full 2+1+1 run on a 64c128 lattice things are looking good so far (the top seven trajectories are with the force offloading enabled):

# n8 with quda force full | quda_version quda-develop-25d85b | slurm_job_id 9214636
00001033 0.540752271381 0.043441347778 9.574887e-01 100 14226 175 12811 34394 0 1010 61291 1214 43192 932 3870 109 6159 4976 77081 1 4.504216e+03 3.080557e-01
00001034 0.540752271381 0.565656803548 5.679870e-01 102 14220 179 12795 35007 0 1040 61200 1267 43109 932 3848 109 5938 5168 78220 0 4.421921e+03 3.080557e-01
00001035 0.540713299781 0.471504589543 6.240626e-01 99 14225 175 12849 34024 0 1005 61374 1210 43367 933 3822 107 5853 4949 77301 1 4.452976e+03 3.080064e-01
00001036 0.540689619806 -0.210317837074 1.234070e+00 199 28422 278 25633 48012 0 1920 122603 1901 86464 975 7664 169 11686 7556 153829 1 8.718310e+03 3.079745e-01
00001037 0.540717490165 -0.093398010358 1.097899e+00 99 14214 174 12805 34093 0 1010 61197 1213 43164 934 3835 106 5917 4964 77340 1 4.464691e+03 3.080230e-01
00001038 0.540701838357 -0.041311362758 1.042177e+00 100 14174 176 12763 34198 0 1012 61049 1224 42988 932 3843 109 5865 4968 76331 1 4.421069e+03 3.080025e-01
00001039 0.540722992628 -0.098816601560 1.103864e+00 98 14308 174 12939 33863 0 998 61671 1214 43740 935 3829 108 5950 4894 76960 1 4.427378e+03 3.080321e-01
# n8 with quda no_force full | quda_version quda-develop-25d85b | slurm_job_id 9217535
00001033 0.540752271382 0.043261365965 9.576611e-01 100 14224 175 12813 34388 0 1010 61298 1214 43175 932 3874 109 6158 4974 77093 1 5.152877e+03 3.080557e-01
00001034 0.540752271382 0.565140699968 5.682802e-01 101 14221 180 12794 35034 0 1038 61215 1250 43091 933 3846 109 5942 5161 78216 0 5.070436e+03 3.080557e-01
00001035 0.540713299781 0.471584610641 6.240127e-01 99 14226 175 12850 34019 0 1006 61371 1209 43347 933 3822 107 5851 4957 77339 1 5.078761e+03 3.080064e-01
00001036 0.540689619817 -0.210258966312 1.233998e+00 199 28417 278 25631 48004 0 1922 122607 1908 86476 975 7666 169 11701 7541 153884 1 9.954171e+03 3.079745e-01
00001037 0.540717490182 -0.093846568838 1.098391e+00 99 14217 174 12807 34072 0 1007 61186 1211 43166 934 3838 107 5913 4956 77344 1 5.083100e+03 3.080230e-01
00001038 0.540701838370 -0.041504077613 1.042377e+00 100 14175 177 12764 34232 0 1013 61043 1225 43011 931 3841 108 5864 4971 76314 1 5.042457e+03 3.080025e-01
00001039 0.540722992685 -0.097678773105 1.102609e+00 98 14306 175 12936 33853 0 998 61648 1214 43753 935 3830 108 5942 4923 76966 1 5.047067e+03 3.080321e-01

I will update the above when the second set of trajectories with force offloading disabled has completed.

kostrzewa commented 5 months ago

I think whatever problems I seemed to have were related to the one oversight of employing fc31c7d before you had added the 07fc55e fix plus later on perhaps some random mismatch between tmLQCD and the QUDA version that I was using.

Marcogarofalo commented 5 months ago

Are you running the same number of trajectories per job?

no, I'll redo the test then.

kostrzewa commented 5 months ago

The test above https://github.com/etmc/tmLQCD/pull/549#issuecomment-1898375425 looks fine to me. Could you add something to the documentation about UseExternalLibrary = quda having to be set to use the offloaded fermionic force? I think we can then merge this and start using in production. Thanks a lot!

Marcogarofalo commented 5 months ago

After doing the test with the same random number in the host and device run I got good agreement after 30 trajectories

==> device/output.data <==
00001033 0.539124377721 -0.181064112112 1.198492e+00 99 1850 1008 8217 1 6.779430e+02 3.051752e-01
00001034 0.537658649417 -0.136323347688 1.146052e+00 102 1869 1036 8334 1 6.657363e+02 3.026848e-01
...
00001063 0.516763973986 -0.057475058362 1.059159e+00 101 1924 867 7139 1 6.612186e+02 2.728876e-01

==> host/output.data <==
00001033 0.539124377721 -0.181048505008 1.198473e+00 99 1850 1008 8220 1 1.075596e+03 3.051752e-01
00001034 0.537658649417 -0.136431051418 1.146176e+00 102 1869 1034 8337 1 1.071386e+03 3.026848e-01
...
00001063 0.516763973986 -0.057487478480 1.059172e+00 101 1928 867 7138 1 1.080475e+03 2.728876e-01
kostrzewa commented 5 months ago

A comment to 44449ad: the DET and DETRATIO monomials do support EnableExternalInverter (or at least they should).

Marcogarofalo commented 5 months ago

we never tested:

  useexternalinverter = no
  UseExternalLibrary = quda

I don't see an application of this setup, at the current stage it is likely to crash.

kostrzewa commented 5 months ago

we never tested:

true ... I guess the check for "UseExternalLibrary" should include a simultaneous check for "UseExternalInverter" and spit out an error if the two are not set together...

kostrzewa commented 5 months ago

Note that I've merged with the current quda_work as there was a conflict in the documentation here.

kostrzewa commented 5 months ago

I'm a bit stuck testing this on LUMI-G because of https://github.com/lattice/quda/issues/1432

I can test using an unofficial stack based on ROCm 5.6.1 but I've already found that there are issues when both GDR and P2P are enabled with my current production code (relatively recent tmLQCD-quda_work and quda-develop-02391b12). I see solvers diverging so I first need to find a combination ROCm 5.6.1 (x) GDR=[0,1] (x) P2P=[0,3] which works correctly with that before I can then move on to testing the offloaded fermionic force.

kostrzewa commented 5 months ago

I was able to test this on LUMI-G now and it works perfectly fine. 10% speedup overall in a large run on 128 nodes which should translate to more than that at a smaller scale.

kostrzewa commented 5 months ago

@Marcogarofalo this is really awesome, thanks!

Marcogarofalo commented 5 months ago

Thank you for the help!