etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Quda work ndg force #612

Closed Marcogarofalo closed 1 month ago

kostrzewa commented 2 months ago

Awesome, this is working great. Here's a comparison on Juwels Booster on 4 nodes, 64c128 at the physical point with consistent random numbers.

no force offloading (for reference)

00001054 0.545663519531 0.020930789411 9.792867e-01 81 14613 1688 261376 327 32074 36839 0 1658 10002 1160 54301 143 9753 4907 97531 1 1.469851e+04 3.131941e-01
00001055 0.545676315454 -0.208904463798 1.232327e+00 80 14626 1609 261254 323 32075 36858 0 1706 9976 1140 54258 140 9681 4871 97607 1 1.460724e+04 3.132264e-01
00001056 0.545682037202 0.138889044523 8.703246e-01 80 14594 1654 260446 324 31963 35986 0 1649 10182 1133 54098 148 9950 4767 96687 1 1.474566e+04 3.132276e-01
00001057 0.545682509451 0.226925754920 7.969800e-01 80 14507 1617 258577 320 31691 35898 0 1681 10081 1123 53740 140 9920 4733 96450 1 1.461649e+04 3.132220e-01

light force offloading only

00001401 0.545681973756 0.569591499865 5.657565e-01 80 14628 1603 259506 315 31451 35516 0 1624 10011 1109 53519 138 9568 4718 96323 1 1.242204e+04 3.131983e-01
00001402 0.545701067640 0.041082913056 9.597496e-01 81 14689 1610 261058 316 31719 35647 0 1627 10023 1111 53903 145 9794 4727 97032 1 1.396068e+04 3.132205e-01
00001403 0.545696849214 0.160429969430 8.517775e-01 79 14590 1584 259024 311 31397 35699 0 1634 10081 1099 53497 141 9808 4745 96752 1 1.282431e+04 3.132271e-01
00001404 0.545662972281 -0.045102979988 1.046136e+00 80 14479 1594 256674 314 31056 35580 0 1608 10038 1103 52961 143 9822 4731 95628 1 1.241667e+04 3.131858e-01

+ ND force offloading

(first trajectory includes tuning)

00001401 0.545681973797 0.569108584896 5.660298e-01 80 14628 1603 259486 315 31451 35526 0 1626 10007 1107 53504 137 9569 4719 96307 1 1.025544e+04 3.131983e-01
00001402 0.545701067693 0.040016509593 9.607736e-01 81 14690 1611 261098 316 31727 35650 0 1627 10010 1112 53930 136 9832 4725 97018 1 8.803802e+03 3.132205e-01
00001403 0.545696849273 0.159103434533 8.529081e-01 79 14586 1583 258986 312 31393 35698 0 1636 10063 1101 53510 139 9783 4743 96757 1 8.879981e+03 3.132271e-01
00001404 0.545662972387 -0.047294547781 1.048431e+00 80 14479 1596 256673 313 31054 35575 0 1606 10046 1102 52984 143 9825 4727 95639 1 9.201252e+03 3.131858e-01
kostrzewa commented 2 months ago

The speed-up will be even greater on a machine like Leonardo or LUMI-G. I'll let Andrey know that he can run some first tests for the finite-temperature runs. I'll put you in CC @Marcogarofalo

urbach commented 2 months ago

super!

kostrzewa commented 2 months ago

14500 -> 12500 -> 9000 !

kostrzewa commented 2 months ago

@Marcogarofalo There's an issue with the timing on the QUDA side. It seems like the time spent in computeTMCloverForceQuda is counted internally multiple times.

kostrzewa commented 2 months ago

What I mean is the following:

   computeTMCloverForceQuda Total time =   382.651 secs
                 download     =    95.261 secs ( 24.895%),       with     2582 calls at 3.689e+04 us per call
                   upload     =    83.468 secs ( 21.813%),       with     1033 calls at 8.080e+04 us per call
                     init     =    15.232 secs (  3.981%),       with    26165 calls at 5.821e+02 us per call
                  compute     =  7920.459 secs (2069.889%),      with   292924 calls at 2.704e+04 us per call
                    comms     =    54.129 secs ( 14.146%),       with     6426 calls at 8.423e+03 us per call
                     free     =    20.674 secs (  5.403%),       with   236599 calls at 8.738e+01 us per call
        total accounted       =  8189.223 secs (2140.127%)
        total missing         = -7806.571 secs (-2040.127%)
WARNING: Accounted time  8189.223 secs in computeTMCloverForceQuda is greater than total time   382.651 secs

This doesn't affect anything on our side but it does mess with the QUDA profile.

Marcogarofalo commented 2 months ago

here is a comparison of the data before and after the last commit, the speedup can not be seen in such a small test:

debug level 1 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 1.132254e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.598577e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.536940e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.550959e-01 5.069101e-02

debug level 1 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 4.350938e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.661940e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.616898e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.592305e-01 5.069101e-02

debug level 4 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.407556e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 5.201004e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 5.171427e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 5.416352e-01 5.069101e-02

debug level 4 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.419756e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 4.816360e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 4.727600e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 4.695284e-01 5.069101e-02