Closed sbrus89 closed 8 months ago
@sbrus89 thanks for your work on this. In recent simulations by @jeremy-lilly and @gcapodag we had a memory leak problem with RK4 on perlmutter. We will retest, as this might take care of it! We were trying to run a 125m single-layer hurricane Sandy test, so very similar to your problem above, and it would run out of memory after a few days.
I spent time fixing RK4 memory leaks back in 2019. I obviously didn't fix them all, and greatly appreciate your effort on this now. For reference, here are the old issues and PRs: https://github.com/MPAS-Dev/MPAS-Model/issues/137 https://github.com/MPAS-Dev/MPAS-Model/pull/142 https://github.com/MPAS-Dev/MPAS-Model/pull/185
I will also conduct some tests. If they all work, we can move this to E3SM.
Will compile with small fix:
--- a/components/mpas-framework/src/framework/mpas_pool_routines.F
+++ b/components/mpas-framework/src/framework/mpas_pool_routines.F
@@ -991,7 +991,7 @@ module mpas_pool_routines
- if (present(overrideTimeLevel)) then
+ if (present(overrideTimeLevels)) then
Thanks @mark-petersen, hopefully this helps with @jeremy-lilly and @gcapodag's issue as well.
Thanks @xylar, I tried to make overrideTimeLevels
operate analogously to the existing functionality in mpas_pool_clone_pool
. Thanks again for your suggestions!
Hi All, thanks @sbrus89 for bringing this up. LTS and FB-LTS that are both merged in master now, also use the same process of cloning pools. Please if you are going to merge this into master, include the changes also on those two time stepping methods so that they are not left out.
Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:
00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test
In E3SM, passes
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel
I agree with @gcapodag, let's include LTS and FB-LTS before this goes to E3SM.
Thanks everyone for your work on this. I have just submitted a job on Perlmutter to see if it fixes our problem. I'll keep you all posted.
Sounds good @gcapodag, I'll add these changes for LTS and FB-LTS as well
Great, thank you very much @sbrus89 !!
Hi All, just wanted to confirm that this fix on RK4 allowed us to run on Perlmutter for 25 days on a mesh with 4617372
cells and a highest resolution near the coast of 125 m
. Before this fix, the run would crash for a segmentation fault around two days in, and we were able to make it proceed past that point only using around 5% of the node capacity using these specs for srun
: srun -N 64 -n 128 --ntasks-per-node=2 --ntasks-per-core=1
. Thanks a lot to you all for figuring out a solution!
@gcapodag, I pushed the LTS/FB-LTS changes but haven't tested them yet.
thanks @sbrus89 , I just tested on a 2 hr run on Perlmutter and the changes to LTS and FB-LTS are BFB.
Thanks very much for testing, @gcapodag! I think I'll go ahead and move this over to E3SM in that case.
I have noticed a memory leak in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (https://github.com/MPAS-Dev/compass/pull/802) on pm-cpu. I can typically only get through about 42 days of simulation before running out of memory.
This issue is related to creating/destroying the
provis_state
subpool at each timestep. We had a similar issue a few years back that required memory leaks fixes in thempas_pool_destroy_pool
(https://github.com/MPAS-Dev/MPAS-Model/pull/367) subroutine. However, I believe there is still a memory leak in thempas_pool_remove_subpool
(which callspool_remove_member
) that is called followingmpas_pool_destroy_pool
. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.I'm not familiar enough with the details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the
provis_state
subpool once at initialization as opposed to creating and destroying it every timestep. This PR is meant to socialize this as a potential approach. The main consequence of this is that thempas_pool_copy_pool
subroutine needs to have aoverrideTimeLevel
option similar to that used inmpas_pool_clone_pool
under the previous approach. I've tested these changes with the vr45to5 test case and they do allow me to run for a full 125 days.