E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
332 stars 334 forks source link

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #6334

Closed sbrus89 closed 2 weeks ago

sbrus89 commented 1 month ago

This PR fixes a memory leak I noticed in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (https://github.com/MPAS-Dev/compass/pull/802) on pm-cpu. Previously, I could only get through about 42 days of simulation before running out of memory.

This issue is related to creating/destroying the provis_state subpool at each timestep. We had a similar problem a few years back that required memory leaks fixes in the mpas_pool_destroy_pool (https://github.com/MPAS-Dev/MPAS-Model/pull/367) subroutine. However, I believe there is still a memory leak in the mpas_pool_remove_subpool routine (which calls pool_remove_member) that is called following mpas_pool_destroy_pool. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.

I'm not familiar enough with the low-level details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the provis_state subpool once at initialization as opposed to creating and destroying it every timestep. The main consequence of this approach is that the mpas_pool_copy_pool subroutine needs to have a overrideTimeLevel option similar to that used in mpas_pool_clone_pool used previously.

I've tested these changes with the vr45to5 tides test case and they do allow me to run for a full 125 days and are B4B with the previously create/destroy approach. The LTS and FB-LTS timestepping schemes are also updated here in the same way to prevent the same memory leak.

Since RK4 is not used in E3SM, this PR is B4B for all E3SM tests. The mpas_pool_copy_pool routine modified here is not used in MPAS-Seaice or MALI. See the Ocean-Discussions PR for additional background and testing.

[BFB]

sbrus89 commented 1 month ago

@matthewhoffman and @akturner, I thought I'd tag you on this since it touches the framework, even though this is apparently the first usage of mpas_pool_copy_pool in E3SM.

gcapodag commented 1 month ago

great, thanks @sbrus89

sbrus89 commented 1 month ago

@erinethomas and @darincomeau, would you mind taking a look at the framework changes in this PR just in case you see any issues from the sea ice side?

mark-petersen commented 1 month ago

Repeating post from https://github.com/E3SM-Ocean-Discussion/E3SM/pull/87#issuecomment-2037271385

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel
mark-petersen commented 1 month ago

Repeating post from https://github.com/E3SM-Ocean-Discussion/E3SM/pull/87#issuecomment-2037271385. Note these tests all do not use LTS.

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel
mark-petersen commented 1 month ago

@xylar and @matthewhoffman these changes are only used by the ocean core, and only for RK4 and LTS timestepping. I think you are safe with a visual review, as I tried to be thorough in my testing of RK4 and LTS and saw BFB results.

git grep -in mpas_pool_clone_pool
mpas-framework/src/core_sw/mpas_sw_time_integration.F:120:        call mpas_pool_clone_pool(statePool, provisStatePool, 1)
mpas-framework/src/framework/mpas_pool_routines.F:434:!  routine mpas_pool_clone_pool
mpas-framework/src/framework/mpas_pool_routines.F:444:   recursive subroutine mpas_pool_clone_pool(srcPool, destPool, overrideTimeLevels)!{{{
mpas-framework/src/framework/mpas_pool_routines.F:466:            call pool_mesg('ERROR in mpas_pool_clone_pool: Input time levels cannot be less than 1.')
mpas-framework/src/framework/mpas_pool_routines.F:909:                call mpas_pool_clone_pool(ptr % data % p, newmem % data % p)
mpas-framework/src/framework/mpas_pool_routines.F:921:   end subroutine mpas_pool_clone_pool!}}}
mpas-framework/src/framework/mpas_stream_manager.F:411:           call mpas_pool_clone_pool(manager % defaultAtts, new_stream % att_pool)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:259:    call mpas_pool_clone_pool(tendPool, tendSum1stPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:261:    call mpas_pool_clone_pool(tendPool, tendSum2ndPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:263:    call mpas_pool_clone_pool(tendPool, tendSum3rdPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:265:    call mpas_pool_clone_pool(tendPool, tendSlowPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_rk4.F:233:         call mpas_pool_clone_pool(statePool, provisStatePool, 1)
erinethomas commented 1 month ago

in line with @mark-petersen's comment above - I was not able to see any place where this would impact MPAS-SI this morning ... so I think its ok on the sea-ice side...

sbrus89 commented 1 month ago

@mark-petersen, the mpas-ocean rk4, lts, and fblts time stepping routines are the only places that use the mpas_pool_copy_pool routine I modified:

git grep -in mpas_pool_copy_pool
components/mpas-framework/src/framework/mpas_pool_routines.F:925:!  routine mpas_pool_copy_pool
components/mpas-framework/src/framework/mpas_pool_routines.F:935:   recursive subroutine mpas_pool_copy_pool(srcPool, destPool, overrideTimeLevels)!{{{
components/mpas-framework/src/framework/mpas_pool_routines.F:958:            call pool_mesg('ERROR in mpas_pool_copy_pool: Input time levels cannot be less than 1.')
components/mpas-framework/src/framework/mpas_pool_routines.F:1171:                   call mpas_pool_copy_pool(ptr % data % p, mem % p)
components/mpas-framework/src/framework/mpas_pool_routines.F:1181:   end subroutine mpas_pool_copy_pool!}}}
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_fblts.F:230:      call mpas_pool_copy_pool(tendPool, tendSum3rdPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_fblts.F:231:      call mpas_pool_copy_pool(tendPool, tendSlowPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:264:    call mpas_pool_copy_pool(tendPool, tendSum1stPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:265:    call mpas_pool_copy_pool(tendPool, tendSum2ndPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:266:    call mpas_pool_copy_pool(tendPool, tendSum3rdPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:267:    call mpas_pool_copy_pool(tendPool, tendSlowPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_rk4.F:233:         call mpas_pool_copy_pool(statePool, provisStatePool, 1)
jonbob commented 3 weeks ago

passes:

merged to next

jonbob commented 2 weeks ago

merged to master