Create the `provis_state` subpool at RK4 initialization to avoid memory leak

sbrus89 commented 8 months ago

I have noticed a memory leak in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (https://github.com/MPAS-Dev/compass/pull/802) on pm-cpu. I can typically only get through about 42 days of simulation before running out of memory.

This issue is related to creating/destroying the provis_state subpool at each timestep. We had a similar issue a few years back that required memory leaks fixes in the mpas_pool_destroy_pool (https://github.com/MPAS-Dev/MPAS-Model/pull/367) subroutine. However, I believe there is still a memory leak in the mpas_pool_remove_subpool (which calls pool_remove_member) that is called following mpas_pool_destroy_pool. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.

I'm not familiar enough with the details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the provis_state subpool once at initialization as opposed to creating and destroying it every timestep. This PR is meant to socialize this as a potential approach. The main consequence of this is that the mpas_pool_copy_pool subroutine needs to have a overrideTimeLevel option similar to that used in mpas_pool_clone_pool under the previous approach. I've tested these changes with the vr45to5 test case and they do allow me to run for a full 125 days.

mark-petersen commented 8 months ago

@sbrus89 thanks for your work on this. In recent simulations by @jeremy-lilly and @gcapodag we had a memory leak problem with RK4 on perlmutter. We will retest, as this might take care of it! We were trying to run a 125m single-layer hurricane Sandy test, so very similar to your problem above, and it would run out of memory after a few days.

I spent time fixing RK4 memory leaks back in 2019. I obviously didn't fix them all, and greatly appreciate your effort on this now. For reference, here are the old issues and PRs: https://github.com/MPAS-Dev/MPAS-Model/issues/137 https://github.com/MPAS-Dev/MPAS-Model/pull/142 https://github.com/MPAS-Dev/MPAS-Model/pull/185

I will also conduct some tests. If they all work, we can move this to E3SM.

mark-petersen commented 8 months ago

Will compile with small fix:

--- a/components/mpas-framework/src/framework/mpas_pool_routines.F
+++ b/components/mpas-framework/src/framework/mpas_pool_routines.F
@@ -991,7 +991,7 @@ module mpas_pool_routines
-                  if (present(overrideTimeLevel)) then
+                  if (present(overrideTimeLevels)) then

sbrus89 commented 8 months ago

Thanks @mark-petersen, hopefully this helps with @jeremy-lilly and @gcapodag's issue as well.

sbrus89 commented 8 months ago

Thanks @xylar, I tried to make overrideTimeLevels operate analogously to the existing functionality in mpas_pool_clone_pool. Thanks again for your suggestions!

gcapodag commented 8 months ago

Hi All, thanks @sbrus89 for bringing this up. LTS and FB-LTS that are both merged in master now, also use the same process of cloning pools. Please if you are going to merge this into master, include the changes also on those two time stepping methods so that they are not left out.

mark-petersen commented 8 months ago

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel

xylar commented 8 months ago

I agree with @gcapodag, let's include LTS and FB-LTS before this goes to E3SM.

gcapodag commented 8 months ago

Thanks everyone for your work on this. I have just submitted a job on Perlmutter to see if it fixes our problem. I'll keep you all posted.

sbrus89 commented 8 months ago

Sounds good @gcapodag, I'll add these changes for LTS and FB-LTS as well

gcapodag commented 8 months ago

Great, thank you very much @sbrus89 !!

gcapodag commented 8 months ago

Hi All, just wanted to confirm that this fix on RK4 allowed us to run on Perlmutter for 25 days on a mesh with 4617372 cells and a highest resolution near the coast of 125 m. Before this fix, the run would crash for a segmentation fault around two days in, and we were able to make it proceed past that point only using around 5% of the node capacity using these specs for srun: srun -N 64 -n 128 --ntasks-per-node=2 --ntasks-per-core=1. Thanks a lot to you all for figuring out a solution!

sbrus89 commented 8 months ago

@gcapodag, I pushed the LTS/FB-LTS changes but haven't tested them yet.

gcapodag commented 8 months ago

thanks @sbrus89 , I just tested on a 2 hr run on Perlmutter and the changes to LTS and FB-LTS are BFB.

sbrus89 commented 8 months ago

Thanks very much for testing, @gcapodag! I think I'll go ahead and move this over to E3SM in that case.

E3SM-Ocean-Discussion / E3SM

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87