MPAS-Dev / MPAS

Repository for private MPAS development prior to the MPAS v6.0 release.
Other
4 stars 0 forks source link

Halo exchange reuse in ocean core #1458

Closed Larofeticus closed 6 years ago

Larofeticus commented 6 years ago

Primary changes here are the use of mpas_dmpar reuse calls during the barotropic subcycle. This prevents the subcycle from creating and destroying the same data structure during every subcycle iteration.

There is also a lesser change where I tracked the subcycle data dependencies by hand, and modified to remove some "not sure why but an extra exchange here works" code. I originally did this to enable using larger halos which would not need to be exchanged during every subcycle, but the gains were not there so I moved on.

mark-petersen commented 6 years ago

Requires #1453 framework change before merging.

mark-petersen commented 6 years ago

I rebased this to the head of ocean/develop, then cherry-picked the commit from #1453 and squashed the original ocean changes. This is now a branch that compiles "out of the box". After #1453 is merged, we can remove that first commit on here. I tested this PR with gnu and nightly regression suite is bit for bit with head of ocean/develop.

@Larofeticus please test the new head of this PR. It should be identical to before, but now with the recent framework commits.

Larofeticus commented 6 years ago

I've seen your request for testing. Cori is down for maint today so It will need to wait a day or two.

Larofeticus commented 6 years ago

Ok, I was finally able to get a few runs with this repackaging of things and the times and output look consistent with before.

mark-petersen commented 6 years ago

@Larofeticus Now that the framework changes are merged in, I removed the framework commit here. I rebased on the head of ocean/develop and cherry-picked your previous commit. Please retest, you can use the same test case as you did on Dec 21.

Larofeticus commented 6 years ago

Ok, I got a test run through with this as it currently exists and the outcome looks fine.

mark-petersen commented 6 years ago

This PR remains in the history, but was reverted with commit 18e45cdd6, due to a memory leak when testing with ACME. See MPAS-Dev/MPAS#1490 and ACME PR https://github.com/ACME-Climate/ACME/pull/2041. @Larofeticus could you test this branch further and look for memory leaks? We will need a new PR to bring it back in.

Larofeticus commented 6 years ago

I am done with my paper and now paying attention to this.

How does a branch/PR to work on it need to be set up? I'd like to avoid all the rebasing and whatever other git nonsense traps I keep stepping on.

I've found some framework code that looks suspicious so far. Don't yet have a good plan for reproducing it locally as running 90 days of high res simulation to trigger an OOM is not a good use of resources. I haven't had luck with any performance/debugging tools that require messing with the build system or linking.

mgduda commented 6 years ago

I've generally had good experiences working with valgrind to find memory leaks, in particular, the "memcheck" tool with the --leak-check=full option. I think you'd only need to run the model long enough for the memory leak to occur once for valgrind to spot it.

mark-petersen commented 6 years ago

You can continue to test with this branch. I just restored it on your fork. When you find a problem, you will need to make a new pull request.

Larofeticus commented 6 years ago

update: ==253962== More than 10000000 total errors detected. I'm not reporting any more. ==253962== Final error counts will be inaccurate. Go fix your program!

Larofeticus commented 6 years ago

==70041== ==70041== Conditional jump or move depends on uninitialised value(s) ==70041== at 0x1775EFF: _int_free (malloc.c:3945) ==70041== by 0x174F41E: catopen (catgets.c:78) ==70041== by 0x1677380: for__issue_diagnostic (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0x167F26D: for__signal_handler (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0x101D3FF: ??? (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0xD36B17: box_rearrange_mp_findioproc (box_rearrange.F90.in:821) ==70041==