Closed Larofeticus closed 6 years ago
Requires #1453 framework change before merging.
I rebased this to the head of ocean/develop
, then cherry-picked the commit from #1453 and squashed the original ocean changes. This is now a branch that compiles "out of the box". After #1453 is merged, we can remove that first commit on here. I tested this PR with gnu and nightly regression suite is bit for bit with head of ocean/develop.
@Larofeticus please test the new head of this PR. It should be identical to before, but now with the recent framework commits.
I've seen your request for testing. Cori is down for maint today so It will need to wait a day or two.
Ok, I was finally able to get a few runs with this repackaging of things and the times and output look consistent with before.
@Larofeticus Now that the framework changes are merged in, I removed the framework commit here. I rebased on the head of ocean/develop
and cherry-picked your previous commit. Please retest, you can use the same test case as you did on Dec 21.
Ok, I got a test run through with this as it currently exists and the outcome looks fine.
This PR remains in the history, but was reverted with commit 18e45cdd6, due to a memory leak when testing with ACME. See MPAS-Dev/MPAS#1490 and ACME PR https://github.com/ACME-Climate/ACME/pull/2041. @Larofeticus could you test this branch further and look for memory leaks? We will need a new PR to bring it back in.
I am done with my paper and now paying attention to this.
How does a branch/PR to work on it need to be set up? I'd like to avoid all the rebasing and whatever other git nonsense traps I keep stepping on.
I've found some framework code that looks suspicious so far. Don't yet have a good plan for reproducing it locally as running 90 days of high res simulation to trigger an OOM is not a good use of resources. I haven't had luck with any performance/debugging tools that require messing with the build system or linking.
I've generally had good experiences working with valgrind to find memory leaks, in particular, the "memcheck" tool with the --leak-check=full
option. I think you'd only need to run the model long enough for the memory leak to occur once for valgrind to spot it.
You can continue to test with this branch. I just restored it on your fork. When you find a problem, you will need to make a new pull request.
update: ==253962== More than 10000000 total errors detected. I'm not reporting any more. ==253962== Final error counts will be inaccurate. Go fix your program!
==70041== ==70041== Conditional jump or move depends on uninitialised value(s) ==70041== at 0x1775EFF: _int_free (malloc.c:3945) ==70041== by 0x174F41E: catopen (catgets.c:78) ==70041== by 0x1677380: for__issue_diagnostic (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0x167F26D: for__signal_handler (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0x101D3FF: ??? (in /global/u1/w/warndt/build_MPAS-O/MPAS/ocean_model) ==70041== by 0xD36B17: box_rearrange_mp_findioproc (box_rearrange.F90.in:821) ==70041==
Primary changes here are the use of mpas_dmpar reuse calls during the barotropic subcycle. This prevents the subcycle from creating and destroying the same data structure during every subcycle iteration.
There is also a lesser change where I tracked the subcycle data dependencies by hand, and modified to remove some "not sure why but an extra exchange here works" code. I originally did this to enable using larger halos which would not need to be exchanged during every subcycle, but the gains were not there so I moved on.