Quda work ndg force - Githubissues

kostrzewa commented 2 months ago

Awesome, this is working great. Here's a comparison on Juwels Booster on 4 nodes, 64c128 at the physical point with consistent random numbers.

no force offloading (for reference)

00001054 0.545663519531 0.020930789411 9.792867e-01 81 14613 1688 261376 327 32074 36839 0 1658 10002 1160 54301 143 9753 4907 97531 1 1.469851e+04 3.131941e-01
00001055 0.545676315454 -0.208904463798 1.232327e+00 80 14626 1609 261254 323 32075 36858 0 1706 9976 1140 54258 140 9681 4871 97607 1 1.460724e+04 3.132264e-01
00001056 0.545682037202 0.138889044523 8.703246e-01 80 14594 1654 260446 324 31963 35986 0 1649 10182 1133 54098 148 9950 4767 96687 1 1.474566e+04 3.132276e-01
00001057 0.545682509451 0.226925754920 7.969800e-01 80 14507 1617 258577 320 31691 35898 0 1681 10081 1123 53740 140 9920 4733 96450 1 1.461649e+04 3.132220e-01

light force offloading only

00001401 0.545681973756 0.569591499865 5.657565e-01 80 14628 1603 259506 315 31451 35516 0 1624 10011 1109 53519 138 9568 4718 96323 1 1.242204e+04 3.131983e-01
00001402 0.545701067640 0.041082913056 9.597496e-01 81 14689 1610 261058 316 31719 35647 0 1627 10023 1111 53903 145 9794 4727 97032 1 1.396068e+04 3.132205e-01
00001403 0.545696849214 0.160429969430 8.517775e-01 79 14590 1584 259024 311 31397 35699 0 1634 10081 1099 53497 141 9808 4745 96752 1 1.282431e+04 3.132271e-01
00001404 0.545662972281 -0.045102979988 1.046136e+00 80 14479 1594 256674 314 31056 35580 0 1608 10038 1103 52961 143 9822 4731 95628 1 1.241667e+04 3.131858e-01

+ ND force offloading

(first trajectory includes tuning)

00001401 0.545681973797 0.569108584896 5.660298e-01 80 14628 1603 259486 315 31451 35526 0 1626 10007 1107 53504 137 9569 4719 96307 1 1.025544e+04 3.131983e-01
00001402 0.545701067693 0.040016509593 9.607736e-01 81 14690 1611 261098 316 31727 35650 0 1627 10010 1112 53930 136 9832 4725 97018 1 8.803802e+03 3.132205e-01
00001403 0.545696849273 0.159103434533 8.529081e-01 79 14586 1583 258986 312 31393 35698 0 1636 10063 1101 53510 139 9783 4743 96757 1 8.879981e+03 3.132271e-01
00001404 0.545662972387 -0.047294547781 1.048431e+00 80 14479 1596 256673 313 31054 35575 0 1606 10046 1102 52984 143 9825 4727 95639 1 9.201252e+03 3.131858e-01

kostrzewa commented 2 months ago

The speed-up will be even greater on a machine like Leonardo or LUMI-G. I'll let Andrey know that he can run some first tests for the finite-temperature runs. I'll put you in CC @Marcogarofalo

urbach commented 2 months ago

super!

kostrzewa commented 2 months ago

14500 -> 12500 -> 9000 !

kostrzewa commented 2 months ago

@Marcogarofalo There's an issue with the timing on the QUDA side. It seems like the time spent in computeTMCloverForceQuda is counted internally multiple times.

kostrzewa commented 2 months ago

What I mean is the following:

   computeTMCloverForceQuda Total time =   382.651 secs
                 download     =    95.261 secs ( 24.895%),       with     2582 calls at 3.689e+04 us per call
                   upload     =    83.468 secs ( 21.813%),       with     1033 calls at 8.080e+04 us per call
                     init     =    15.232 secs (  3.981%),       with    26165 calls at 5.821e+02 us per call
                  compute     =  7920.459 secs (2069.889%),      with   292924 calls at 2.704e+04 us per call
                    comms     =    54.129 secs ( 14.146%),       with     6426 calls at 8.423e+03 us per call
                     free     =    20.674 secs (  5.403%),       with   236599 calls at 8.738e+01 us per call
        total accounted       =  8189.223 secs (2140.127%)
        total missing         = -7806.571 secs (-2040.127%)
WARNING: Accounted time  8189.223 secs in computeTMCloverForceQuda is greater than total time   382.651 secs

This doesn't affect anything on our side but it does mess with the QUDA profile.

Marcogarofalo commented 2 months ago

here is a comparison of the data before and after the last commit, the speedup can not be seen in such a small test:

debug level 1 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 1.132254e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.598577e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.536940e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.550959e-01 5.069101e-02

debug level 1 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 4.350938e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.661940e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.616898e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.592305e-01 5.069101e-02

debug level 4 rel precision + no strict checks b415eb6

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.407556e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 5.201004e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 5.171427e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 5.416352e-01 5.069101e-02

debug level 4 rel precision + no strict checks, e29573f

00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.419756e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 4.816360e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 4.727600e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 4.695284e-01 5.069101e-02

etmc / tmLQCD

Quda work ndg force #612

no force offloading (for reference)

light force offloading only

+ ND force offloading