Closed Marcogarofalo closed 1 month ago
The speed-up will be even greater on a machine like Leonardo or LUMI-G. I'll let Andrey know that he can run some first tests for the finite-temperature runs. I'll put you in CC @Marcogarofalo
super!
14500 -> 12500 -> 9000 !
@Marcogarofalo There's an issue with the timing on the QUDA side. It seems like the time spent in computeTMCloverForceQuda
is counted internally multiple times.
What I mean is the following:
computeTMCloverForceQuda Total time = 382.651 secs
download = 95.261 secs ( 24.895%), with 2582 calls at 3.689e+04 us per call
upload = 83.468 secs ( 21.813%), with 1033 calls at 8.080e+04 us per call
init = 15.232 secs ( 3.981%), with 26165 calls at 5.821e+02 us per call
compute = 7920.459 secs (2069.889%), with 292924 calls at 2.704e+04 us per call
comms = 54.129 secs ( 14.146%), with 6426 calls at 8.423e+03 us per call
free = 20.674 secs ( 5.403%), with 236599 calls at 8.738e+01 us per call
total accounted = 8189.223 secs (2140.127%)
total missing = -7806.571 secs (-2040.127%)
WARNING: Accounted time 8189.223 secs in computeTMCloverForceQuda is greater than total time 382.651 secs
This doesn't affect anything on our side but it does mess with the QUDA profile.
here is a comparison of the data before and after the last commit, the speedup can not be seen in such a small test:
debug level 1 rel precision + no strict checks b415eb6
00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 1.132254e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.598577e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.536940e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.550959e-01 5.069101e-02
debug level 1 rel precision + no strict checks, e29573f
00000000 0.112190826944 8544.821712773226 0.000000e+00 56 182 128 246 209 338 0 4.350938e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 184 127 245 206 340 0 2.661940e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 182 125 241 206 335 0 2.616898e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 181 124 239 204 334 0 2.592305e-01 5.069101e-02
debug level 4 rel precision + no strict checks b415eb6
00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.407556e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 5.201004e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 5.171427e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 5.416352e-01 5.069101e-02
debug level 4 rel precision + no strict checks, e29573f
00000000 0.112190826944 8544.821712773226 0.000000e+00 56 364 128 492 209 676 0 1.419756e+00 5.069101e-02
00000001 0.112190826944 10141.057194254070 0.000000e+00 56 368 127 490 206 680 0 4.816360e-01 5.069101e-02
00000002 0.112190826944 8569.440461946775 0.000000e+00 55 364 125 482 206 670 0 4.727600e-01 5.069101e-02
00000003 0.112190826944 7382.444491649191 0.000000e+00 55 362 124 478 204 668 0 4.695284e-01 5.069101e-02
Awesome, this is working great. Here's a comparison on Juwels Booster on 4 nodes, 64c128 at the physical point with consistent random numbers.
no force offloading (for reference)
light force offloading only
+ ND force offloading
(first trajectory includes tuning)