reduce number of thread barriers used when packing data into halo exc…

Larofeticus commented 6 years ago

Halo exchange buffer packing currently has an excessive number of implied barriers because the !$omp do pragma, which ends with a barrier, occurs within several loops iterating linked lists of commListPtr, fieldCursor, and exchListPtr.

This code moves the !$omp do out to the commList (which tends to have 6-8 items) and thus reduces the number of thread barriers to 1 per pack_buffer call.

For exchanges moving large amounts of data with multi-dimensional fields the impact is small, but for small and frequent exchanges, such as those during high-frequency subcycles, the cumulative effect of all those extra thread barriers is significant.

mark-petersen commented 6 years ago

This PR replaces #1449. See discussion there. @Larofeticus: I put your original description on the first comment above.

Larofeticus commented 6 years ago

@mgduda i'm now curious what atmosphere is doing differently. can you point me at the file where that lives?

Also, I'm under the impression that in a coupled run, atmosphere runs in parallel with ocean + sea ice, and between them the ocean + sea ice is currently the bottleneck?

mgduda commented 6 years ago

@Larofeticus In MPAS-Atmosphere, we're still using the "classic" halo exchange routines: mpas_dmpar_exch_halo_field. The MPAS-Ocean core is using the routine mpas_dmpar_field_halo_exch, which internally builds a halo exchange group with one member.

Larofeticus commented 6 years ago

line wrap change made

Also it took me a few seconds of confusion to note the difference between mpas_dmpar_exch_halo_field and mpas_dmpar_field_halo_exch

i see those have no internal openmp at all besides the master thread only? i'm going to guess the thread model differences between cores drive the choice to use one exchange or the other?

mark-petersen commented 6 years ago

@climbfuji and @mgduda can you help me understand the difference between the 7% speedup found by @climbfuji here: https://github.com/MPAS-Dev/MPAS/pull/1449#issuecomment-348863588 and the 'no change' result by @mgduda above? One must be incorrect.

climbfuji commented 6 years ago

@mark-petersen @mgduda In order to conclude that one report must be incorrect the two experiments must be 100% identical. There are too many other factors that influence the results. For one, the test that was run. I used an idealized test case without physics, which takes away a large fraction of the time required in each time integration step that is not affected by this change. Second, we probably used different machines for the test, and OpenMP performance and behaviour is quite different on different machines. I used SuperMUC and made use of hyperthreading for the two OpenMP tasks on each core. Third, both of us ran the tests probably once only for the reference (develop) branch and this PR. We all know that every run is different on a supercomputer because of the nodes that get selected, other jobs running concurrently, ... If one really was to do a proper assessment, a large number of runs for both versions of the code are required. I can't explain why my test ran faster by 6.5-7%, but at least that is the result of the test that I ran (see attached screenshot; left: develop; right: this PR or better the former one #1449).

Larofeticus commented 6 years ago

My change can only affects the halo exchange calls which use exchange groups. I've been led to believe that atmo uses the other type.

If atmo ONLY uses the non grouped exchanges, then any difference in run time can only be attributed to configuration difference or machine noise. If it mixes exchange types there is room for effect.

As long as it's bfb, and atmo isn't the bottleneck in coupled runs, then it doesn't matter much, anyway?

mark-petersen commented 6 years ago

@climbfuji Thanks for taking the time to test it and explain your test. I'm not worried about the results, I was just trying to figure it out.

mark-petersen commented 6 years ago

I merged these exact commits into ocean/develop to make sure, and everything passes bfb with previous, 1vs2 threads, restart, partitions, etc on intel and gnu.

climbfuji commented 6 years ago

@mark-petersen I guess what we can say at least is that we did not see any performance degradation and that's a good thing!

MPAS-Dev / MPAS

reduce number of thread barriers used when packing data into halo exc… #1459