Reduce thread barriers in mpas_dmpar halo packing routines

Larofeticus commented 7 years ago

Halo exchange buffer packing currently has an excessive number of implied barriers because the !$omp do pragma, which ends with a barrier, occurs within several loops iterating linked lists of commListPtr, fieldCursor, and exchListPtr.

This code moves the !$omp do out to the commList (which tends to have 6-8 items) and thus reduces the number of thread barriers to 1 per pack_buffer call.

For exchanges moving large amounts of data with multi-dimensional fields the impact is small, but for small and frequent exchanges, such as those during high-frequency subcycles, the cumulative effect of all those extra thread barriers is significant.

Larofeticus commented 7 years ago

I'm taking a second look at this. Not ready yet.

Larofeticus commented 6 years ago

The measured speedup from this change (30to10 mesh, 256 KNL) using 2 threads is ~%8 Average before: 14.90136s -- (14.53823s, 14.72948s, 15.34476s, 15.23302s, 14.66132s)
Average after: 13.60519s -- (13.58743s, 13.52997s, 13.90475s, 13.51177s, 13.49203s)

This optimization only benefits 2 or more threads; MPI only runs will be unaffected.

mark-petersen commented 6 years ago

I tested a local merge into ocean/develop and posted here: https://github.com/mark-petersen/MPAS/tree/ocean/reduce_thread_barriers_merge That passes the MPAS-Ocean nightly regression suite on grizzly (LANL) and is bit-for-bit with the head using both gnu/optimized and intel/debug. I tested with OPENMP=true and both one and two threads for gnu and intel, results all pass and remain bfb with head. Everything looks good so far. The testing includes bfb comparisons on different partitions, between thread counts, and across restarts, on several domains and configurations.

mark-petersen commented 6 years ago

@ndkeen These framework changes are in the ACME branch: https://github.com/ACME-Climate/ACME/tree/mark-petersen/mpas/reduce_thread_barriers Please test them for improved performance.

mark-petersen commented 6 years ago

@ndkeen and @Larofeticus The ACME branch with this MPAS PR update passes:

bit-for-bit across threads: PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_intel PET_Ln9.T62_oQU240.GMPAS-IAF.cori-knl_gnu
exact restart test: ERS_Ld3.T62_oEC60to30v3.GMPAS-IAF.cori-knl_gnu
LISIO smoke test (used this PR for framework on MPAS ocean, sea ice and land ice in ACME) SMS_Ln9.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_gnu SMS_Ln9.T62_oQU120_ais20.MPAS_LISIO_TEST.cori-knl_intel

mark-petersen commented 6 years ago

@mgduda @climbfuji @matthewhoffman @akturner This PR removes thread barriers from sections of the framework, and is bit-for-bit with the previous commit on all MPAS-Ocean and ACME tests. It may speed up threaded runs on KNL by 8%. Could you test when you get a chance?

BTW, those ACME tests had this PR added to the framework in ocean, sea ice, and land ice.

ndkeen commented 6 years ago

Using Mark's branch mark-petersen/mpas/reduce_thread_barriers and including a quick fix for the timer bug here https://github.com/ACME-Climate/ACME/issues/1897#issuecomment-344348072, I tried a few G cases. I'm comparing results with master as of nov6. First I tried 60to30. I used 16 ICE and 16 OCN nodes for a total of 32 cori-knl nodes.

Deleting my results as I was using the wrong source. When changing branches, a submodule update is required to get the changes and I had not done that.

Now when I try, I get an error. This is from the 600-node case which is the only one that has run. Smaller jobs should hopefully start soon.

38399: forrtl: severe (174): SIGSEGV, segmentation fault occurred
38399: Image              PC                Routine            Line        Source             
38399: acme.exe           0000000001F42281  Unknown               Unknown  Unknown
38399: acme.exe           0000000001F403BB  Unknown               Unknown  Unknown
38399: acme.exe           0000000001EE9BC4  Unknown               Unknown  Unknown
38399: acme.exe           0000000001EE99D6  Unknown               Unknown  Unknown
38399: acme.exe           0000000001E69499  Unknown               Unknown  Unknown
38399: acme.exe           0000000001E75306  Unknown               Unknown  Unknown
38399: acme.exe           000000000183C890  Unknown               Unknown  Unknown
38399: acme.exe           000000000054CCD4  mpas_dmpar_mp_mpa        8233  mpas_dmpar.f90
38399: acme.exe           0000000000556EFB  mpas_dmpar_mp_mpa        6941  mpas_dmpar.f90
38399: acme.exe           0000000000554ABB  mpas_dmpar_mp_mpa        7009  mpas_dmpar.f90
38399: acme.exe           00000000005EE7A5  mpas_stream_manag        4610  mpas_stream_manager.f90
38399: acme.exe           00000000005EB500  mpas_stream_manag        3907  mpas_stream_manager.f90
38399: acme.exe           00000000005EA8FE  mpas_stream_manag        3462  mpas_stream_manager.f90
38399: acme.exe           0000000000FB8281  ocn_forward_mode_         181  mpas_ocn_forward_mode.f90
38399: acme.exe           0000000000E5F9EC  ocn_core_mp_ocn_c          89  mpas_ocn_core.f90
38399: acme.exe           0000000000B53CD0  ocn_comp_mct_mp_o         524  ocn_comp_mct.f90
38399: acme.exe           000000000042AECE  component_mod_mp_         231  component_mod.F90
38399: acme.exe           0000000000419CB1  cime_comp_mod_mp_        1198  cime_comp_mod.F90
38399: acme.exe           0000000000427D6F  MAIN__                     92  cime_driver.F90

climbfuji commented 6 years ago

@mark-petersen if it is not too late I will try this on the weekend, sorry for the delay.

mark-petersen commented 6 years ago

@climbfuji Not too late at all. Yes, please test when you get a chance, and in particular, test with multiple threads and watch for any speed-up.

mark-petersen commented 6 years ago

@Larofeticus what timer did you see the 8% speed-up above - time integration or a timer around a smaller region?

mark-petersen commented 6 years ago

I'm removing @matthewhoffman and @akturner as reviewers, as the ACME testing exercises sea ice and land ice cores, and I told them about this at a Tuesday meeting.

Larofeticus commented 6 years ago

Those are the numbers from time integration.

mark-petersen commented 6 years ago

@mgduda or @climbfuji could you give this a try in the atmosphere core? In particular, it needs to be tested with threading on, and any speed-up in halo exchanges due to this PR would be very useful to know.

climbfuji commented 6 years ago

@mark-petersen Sorry I couldn't do this last weekend. Planning for this Sunday on SuperMUC using 2 threads per core.

climbfuji commented 6 years ago

@mark-petersen @mgduda @Larofeticus I finally did the tests with the atmosphere core that I promised to do. I ran an idealized test case (physics off, i.e. just dynamics with all its communication) for a setup with 150 cells per MPI task, 2 OpenMP threads per MPI task. The average time required for one step of the time integration went down from 0.01750s to 0.01629s, i.e. a reduction of 7% (1920 time steps taken). The sub-timers of the time-integration that measure certain compute blocks but exclude the halo exchanges did not show any improvement. I assume therefore that the 7% performance gain are coming from the halo exchanges. I ran the tests without disk output to avoid any influence from this side. The results are bfb identical.

mgduda commented 6 years ago

@mark-petersen Apologies for the delay! I'll test the changes in this PR this afternoon.

mark-petersen commented 6 years ago

@climbfuji Thank you for testing. The 7% gain plus bfb match are all very promising. Thanks to @Larofeticus for his probing to find these thread barriers.

mgduda commented 6 years ago

@Larofeticus I just noticed that this PR is from your develop branch. In general, I think we'd like to have the PR be from its own branch. I'm not sure whether GitHub will allow you to change the source branch of a PR, so I think we have two options here: (1) close this PR and make a new PR from a new branch containing just these changes, or (2) just merge this PR as-is. My preference would be for option (1). @mark-petersen and @matthewhoffman , do either of you have a preference?

mgduda commented 6 years ago

@Larofeticus Given that your develop branch has some other PRs merged onto it that aren't in the MPAS-Dev develop branch (see below), I think I'd really prefer to have the changes in this PR migrated into their own branch, to close this PR, and to open a new PR. Else, these other, local PRs in your develop branch will show up in the MPAS-Dev develop branch.

I'm sorry for all of the trouble, and for the long delay in looking at this PR. Hopefully we (as collective MPAS developers) can put together an up-to-date document on standard MPAS development practices to help out new code contributors in future!

commit 1d24211e457b49f0412e35160a086ea3cb92c7aa
Merge: 705dd70 60e7681
Author: Bill Arndt <larofeticus@gmail.com>
Date:   Tue Nov 14 12:37:12 2017 -0800

    Merge pull request #2 from Larofeticus/revert-1-exchange_reuse

    Revert "Add 3 routines to mpas_dmpar which support reuse of halo exchange dat…"

commit 705dd705f3b6a802b0c4ca26aac40eba883d195e
Merge: 8588df6 5a58c60
Author: Bill Arndt <larofeticus@gmail.com>
Date:   Tue Nov 14 12:34:02 2017 -0800

    Merge pull request #1 from Larofeticus/exchange_reuse

    Add 3 routines to mpas_dmpar which support reuse of halo exchange dat…

Larofeticus commented 6 years ago

Currently, none of the halo reuse code is in this PR. Those two commits are where I accidentally put it into the same branch, saw it also affected this PR, and then removed it. Git is super fastidious about preserving my mistake and an undo for posterity.

Larofeticus commented 6 years ago

See #1459

MPAS-Dev / MPAS

Reduce thread barriers in mpas_dmpar halo packing routines #1449