clawpack / amrclaw

AMR version of Clawpack
http://www.clawpack.org
BSD 3-Clause "New" or "Revised" License
26 stars 46 forks source link

improving timing with 8 byte integers #250

Closed rjleveque closed 5 years ago

rjleveque commented 5 years ago

Several updates to how timing is done:

A corresponding PR will be submitted to GeoClaw.

mjberger commented 5 years ago
  1. I’m not sure that continuing timing upon restart is a good idea. The run might use a different number of cores for example. I usually just keep the original timing, and the new timing,

  2. I thought the problem with negative times was not due to wrapping, but this PR makes it sound like it is, if 8 byte integers fixes it. What’s the story?

Mjb

On Aug 22, 2019, at 1:27 PM, Randall J. LeVeque notifications@github.com wrote:

Several updates to how timing is done:

Wall clock timing done with calls to system_clock now pass in 8-byte integers so many integer declarations are now (kind=8). This has two advantages: Greater granularity when timing things that take only a few ms so that adding up over many calls is perhaps more accurate. More importantly, the count_max parameter is much larger so the count does not wrap (which then gives negative elapsed time when subtracting clock_finish - clock_start). This was a frequent problem on long GeoClaw runs on certain computers.

A fix was added so that wall time is properly incremented after a restart. I think that now after a restart the timing.csv file should be accurate for both wall times and CPU times. (Unless of course you restart more than once from the same checkpoint file, since valout keeps adding lines to the file).

Two examples have modified setplot.py to plot the timing stats, which requires clawpack/visclaw#247 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_visclaw_pull_247&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=T-Yns6yqvmemC5fOVAVq3JiYWFrcEHew21f0Ap1zO9k&e= to work properly (but should fail gracefully otherwise). Sample output from these examples can be viewed at http://staff.washington.edu/rjl/misc/timing_plots/ https://urldefense.proofpoint.com/v2/url?u=http-3A__staff.washington.edu_rjl_misc_timing-5Fplots_&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=CxQaeiBofO_WyrIZWD3EUY0glGoNg5Bz7DF1uQsWuPk&e= A corresponding PR will be submitted to GeoClaw.

You can view, comment on, or merge this pull request online at:

https://github.com/clawpack/amrclaw/pull/250 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=FQQ3uVMwYgcxdmCgOQzEGt7PT3POUwcxvbTMzt1RWU0&e= Commit Summary

change all args to system_clock to integer(kind=8) fix some timing things, now works also with restarts add timing to setplot.py in two examples File Changes

M examples/advection_2d_swirl/setplot.py https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D0&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=kOtVfWEbeaUTJqg8my3RU1zY1wkmXJGYqdIGREYNcTI&e= (24) M examples/euler_2d_quadrants/setplot.py https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D1&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=yuuyB1UwAgffQnxrB_Licvu3gt0MInztZl4zB_XniOI&e= (25) M src/2d/advanc.f https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D2&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=XWo-bSJWBHtgsxHMS4YpTYGu4X9MoXKj5R2j5Y_Lg3M&e= (5) M src/2d/amr2.f90 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D3&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=WI4b1tmHkCBP-56JaWnmxcm-C1uG0dr9iI0PV_uRNWg&e= (38) M src/2d/amr_module.f90 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D4&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=5MSLRwmkCnkDqltCjOa8Nsm8MGnUNpX-uAjkxpjtwIo&e= (11) M src/2d/filpatch.f90 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D5&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=hbheklJMAhSLqhgk2VlfGJGfOue33OnSxrz89N_UPxY&e= (2) M src/2d/filval.f90 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D6&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=WNjwlWdtyckIGugdrBhR7Zzx7OXIVilwWUYKMLMfcw8&e= (2) M src/2d/flglvl2.f https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D7&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=_uCh1LXaCgLVrj9s87hOVwm1GNAhi3FX4baPFayxEFM&e= (2) M src/2d/grdfit2.f https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D8&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=yN9BNm6o6Cv7-BNwb2BSgMY1TtfSclmyNJJFo_0ds1s&e= (4) M src/2d/regrid.f https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D9&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=lWpPIPGOhF7RJ-gEpNePvIQcyBSL1qFwDOOKolOl3Q4&e= (2) M src/2d/tick.f https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D10&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=RkGe-AwrTAnfP_7FJjl3ntoV5lsCzw7UAvwjLsmfic0&e= (4) M src/2d/valout.f90 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250_files-23diff-2D11&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=fX1uO6MatA_ygGArGzdcf-3e5455k3hkUJILCA6QkwY&e= (9) Patch Links:

https://github.com/clawpack/amrclaw/pull/250.patch https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250.patch&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=tSdmOK_Pg7_qoi_cJEnD4C5OcbbbeYn2AKTUk_R53e4&e= https://github.com/clawpack/amrclaw/pull/250.diff https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250.diff&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=xOlxwbv_yDhwaYCIIIFImNW1IiEsUrdTGQLtDORSino&e= — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clawpack_amrclaw_pull_250-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAGUGC7YFLOZZE6ZAEYGNMTQF3ZBVA5CNFSM4IOY6CSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HG4H3XQ&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=kyKz2pMzKhlMcB1fThtywafgxpdpNVrV2DN26romej8&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAGUGC3FVOZAWTZATBOSIIDQF3ZBVANCNFSM4IOY6CSA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=6hyZ7UiH3y7i6PzBXqqlFnKtlUKcPQdVi676_hKDaTw&m=sgjZUbiMxMo10PRV9RXNZMR_wJe_N6iDO7XQCO3b2wI&s=0cjOKdP8L2GzZi_e_iGlPY-HOmmO8ztK2Al_c968YJo&e=.

rjleveque commented 5 years ago

@mjberger:

  1. The version now in master accumulates the wall time but uses an uninitialized variable for the first output of cpu time if you have it output at the restart time, and beyond that restarts from 0. So cpu and wall time are not reported consistently. The fix accumulates both wall time and cpu time. We wrote out the timers to the restart file in an earlier PR so we could do this, and I find it useful since the main thing I use restarts for is when I decide I should have run a tsunami simulation out to a later time. To keep track of the time since restart only would require a different set of fixes to master. Should that be added as a restart option?

  2. See https://gcc.gnu.org/onlinedocs/gfortran/SYSTEM_005fCLOCK.html, in particular Note that the millisecond resolution of the kind=4 version implies that the COUNT will wrap around in roughly 25 days. Millisecond resolution means COUNT_RATE = 1000, which it is on my laptop. But on an HPC system I am using, for some reason COUNT_RATE = 10000 with kind=4, meaning it wraps every 2.5 days, and many job runs are longer. Even with a 25 day wrap and a short run you sometimes get unlucky and it wraps mid-run. With kind=8 typically COUNT_RATE is 1e6 or 1e9 and even with the nanosecond resolution it only wraps every 292 years because COUNT_MAX is huge.

Before merging, in addition to more discussion on restarts, please hold off since I realized I built in the assumption that if units[simtime] == 'minutes' then the simulation time t should be divided by 60, for example, implicitly assuming that the simulation was in units of seconds. This is right for GeoClaw, but might not be right for an amrclaw application where the user uses minutes or some other unit as the internal units of t. I'll fix this by allowing the user to specify simtime_factor as well as simtime_units.

mandli commented 5 years ago

My 2¢:

rjleveque commented 5 years ago

@mandli: I found another bug in how the timing stats were initialized after a restart, but I think it's working now.

rjleveque commented 5 years ago

I'm going to merge this for now so that travis can test https://github.com/clawpack/geoclaw/pull/405.