Closed Larofeticus closed 6 years ago
I'm taking a second look at this. Not ready yet.
The measured speedup from this change (30to10 mesh, 256 KNL) using 2 threads is ~%8
Average before:
14.90136s -- (14.53823s, 14.72948s, 15.34476s, 15.23302s, 14.66132s)
Average after:
13.60519s -- (13.58743s, 13.52997s, 13.90475s, 13.51177s, 13.49203s)
This optimization only benefits 2 or more threads; MPI only runs will be unaffected.
I tested a local merge into ocean/develop and posted here: https://github.com/mark-petersen/MPAS/tree/ocean/reduce_thread_barriers_merge That passes the MPAS-Ocean nightly regression suite on grizzly (LANL) and is bit-for-bit with the head using both gnu/optimized and intel/debug. I tested with OPENMP=true and both one and two threads for gnu and intel, results all pass and remain bfb with head. Everything looks good so far. The testing includes bfb comparisons on different partitions, between thread counts, and across restarts, on several domains and configurations.
@ndkeen These framework changes are in the ACME branch: https://github.com/ACME-Climate/ACME/tree/mark-petersen/mpas/reduce_thread_barriers Please test them for improved performance.
@ndkeen and @Larofeticus The ACME branch with this MPAS PR update passes:
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel
PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_gnu
ERS_Ld3.T62_oEC60to30v3.GMPAS-IAF.cori-knl_gnu
SMS_Ln9.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu
SMS_Ln9.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel
@mgduda @climbfuji @matthewhoffman @akturner This PR removes thread barriers from sections of the framework, and is bit-for-bit with the previous commit on all MPAS-Ocean and ACME tests. It may speed up threaded runs on KNL by 8%. Could you test when you get a chance?
BTW, those ACME tests had this PR added to the framework in ocean, sea ice, and land ice.
Using Mark's branch mark-petersen/mpas/reduce_thread_barriers
and including a quick fix for the timer bug here https://github.com/ACME-Climate/ACME/issues/1897#issuecomment-344348072, I tried a few G cases. I'm comparing results with master as of nov6. First I tried 60to30. I used 16 ICE and 16 OCN nodes for a total of 32 cori-knl nodes.
Deleting my results as I was using the wrong source. When changing branches, a submodule update is required to get the changes and I had not done that.
Now when I try, I get an error. This is from the 600-node case which is the only one that has run. Smaller jobs should hopefully start soon.
38399: forrtl: severe (174): SIGSEGV, segmentation fault occurred
38399: Image PC Routine Line Source
38399: acme.exe 0000000001F42281 Unknown Unknown Unknown
38399: acme.exe 0000000001F403BB Unknown Unknown Unknown
38399: acme.exe 0000000001EE9BC4 Unknown Unknown Unknown
38399: acme.exe 0000000001EE99D6 Unknown Unknown Unknown
38399: acme.exe 0000000001E69499 Unknown Unknown Unknown
38399: acme.exe 0000000001E75306 Unknown Unknown Unknown
38399: acme.exe 000000000183C890 Unknown Unknown Unknown
38399: acme.exe 000000000054CCD4 mpas_dmpar_mp_mpa 8233 mpas_dmpar.f90
38399: acme.exe 0000000000556EFB mpas_dmpar_mp_mpa 6941 mpas_dmpar.f90
38399: acme.exe 0000000000554ABB mpas_dmpar_mp_mpa 7009 mpas_dmpar.f90
38399: acme.exe 00000000005EE7A5 mpas_stream_manag 4610 mpas_stream_manager.f90
38399: acme.exe 00000000005EB500 mpas_stream_manag 3907 mpas_stream_manager.f90
38399: acme.exe 00000000005EA8FE mpas_stream_manag 3462 mpas_stream_manager.f90
38399: acme.exe 0000000000FB8281 ocn_forward_mode_ 181 mpas_ocn_forward_mode.f90
38399: acme.exe 0000000000E5F9EC ocn_core_mp_ocn_c 89 mpas_ocn_core.f90
38399: acme.exe 0000000000B53CD0 ocn_comp_mct_mp_o 524 ocn_comp_mct.f90
38399: acme.exe 000000000042AECE component_mod_mp_ 231 component_mod.F90
38399: acme.exe 0000000000419CB1 cime_comp_mod_mp_ 1198 cime_comp_mod.F90
38399: acme.exe 0000000000427D6F MAIN__ 92 cime_driver.F90
@mark-petersen if it is not too late I will try this on the weekend, sorry for the delay.
@climbfuji Not too late at all. Yes, please test when you get a chance, and in particular, test with multiple threads and watch for any speed-up.
@Larofeticus what timer did you see the 8% speed-up above - time integration
or a timer around a smaller region?
I'm removing @matthewhoffman and @akturner as reviewers, as the ACME testing exercises sea ice and land ice cores, and I told them about this at a Tuesday meeting.
Those are the numbers from time integration.
@mgduda or @climbfuji could you give this a try in the atmosphere core? In particular, it needs to be tested with threading on, and any speed-up in halo exchanges due to this PR would be very useful to know.
@mark-petersen Sorry I couldn't do this last weekend. Planning for this Sunday on SuperMUC using 2 threads per core.
@mark-petersen @mgduda @Larofeticus I finally did the tests with the atmosphere core that I promised to do. I ran an idealized test case (physics off, i.e. just dynamics with all its communication) for a setup with 150 cells per MPI task, 2 OpenMP threads per MPI task. The average time required for one step of the time integration went down from 0.01750s to 0.01629s, i.e. a reduction of 7% (1920 time steps taken). The sub-timers of the time-integration that measure certain compute blocks but exclude the halo exchanges did not show any improvement. I assume therefore that the 7% performance gain are coming from the halo exchanges. I ran the tests without disk output to avoid any influence from this side. The results are bfb identical.
@mark-petersen Apologies for the delay! I'll test the changes in this PR this afternoon.
@climbfuji Thank you for testing. The 7% gain plus bfb match are all very promising. Thanks to @Larofeticus for his probing to find these thread barriers.
@Larofeticus I just noticed that this PR is from your develop
branch. In general, I think we'd like to have the PR be from its own branch. I'm not sure whether GitHub will allow you to change the source branch of a PR, so I think we have two options here: (1) close this PR and make a new PR from a new branch containing just these changes, or (2) just merge this PR as-is. My preference would be for option (1). @mark-petersen and @matthewhoffman , do either of you have a preference?
@Larofeticus Given that your develop
branch has some other PRs merged onto it that aren't in the MPAS-Dev develop
branch (see below), I think I'd really prefer to have the changes in this PR migrated into their own branch, to close this PR, and to open a new PR. Else, these other, local PRs in your develop
branch will show up in the MPAS-Dev develop
branch.
I'm sorry for all of the trouble, and for the long delay in looking at this PR. Hopefully we (as collective MPAS developers) can put together an up-to-date document on standard MPAS development practices to help out new code contributors in future!
commit 1d24211e457b49f0412e35160a086ea3cb92c7aa
Merge: 705dd70 60e7681
Author: Bill Arndt <larofeticus@gmail.com>
Date: Tue Nov 14 12:37:12 2017 -0800
Merge pull request #2 from Larofeticus/revert-1-exchange_reuse
Revert "Add 3 routines to mpas_dmpar which support reuse of halo exchange dat…"
commit 705dd705f3b6a802b0c4ca26aac40eba883d195e
Merge: 8588df6 5a58c60
Author: Bill Arndt <larofeticus@gmail.com>
Date: Tue Nov 14 12:34:02 2017 -0800
Merge pull request #1 from Larofeticus/exchange_reuse
Add 3 routines to mpas_dmpar which support reuse of halo exchange dat…
Currently, none of the halo reuse code is in this PR. Those two commits are where I accidentally put it into the same branch, saw it also affected this PR, and then removed it. Git is super fastidious about preserving my mistake and an undo for posterity.
See #1459
Halo exchange buffer packing currently has an excessive number of implied barriers because the !$omp do pragma, which ends with a barrier, occurs within several loops iterating linked lists of commListPtr, fieldCursor, and exchListPtr.
This code moves the !$omp do out to the commList (which tends to have 6-8 items) and thus reduces the number of thread barriers to 1 per pack_buffer call.
For exchanges moving large amounts of data with multi-dimensional fields the impact is small, but for small and frequent exchanges, such as those during high-frequency subcycles, the cumulative effect of all those extra thread barriers is significant.