ESCOMP / POP2-CESM

Parallel Ocean Program (POP2) in CESM
http://www.cesm.ucar.edu/models/cesm2/ocean/
4 stars 24 forks source link

Memory leak in NUOPC cap #55

Closed alperaltuntas closed 3 years ago

alperaltuntas commented 3 years ago

Description of the issue:

There is a memory leak in nuopc cap that's crashing the run after about 4 years of non-stop C case run. One can simply observe the memory usage reported in cesm log file to confirm the issue (and don't have to run it for 4 years). I haven't been able to pinpoint the source yet, but I am reporting it here in case @mvertens or others would also like to tackle.

Version:

Machine/Environment Description: cheyenne_intel

Any xml/namelist changes or SourceMods: none

mvertens commented 3 years ago

@alperaltuntas - I believe this is due to the processor layout and is a known problem when DATM is run concurrently to POP or MOM. Can you please try this again with all tasks on the same PEs?

mvertens commented 3 years ago

@alperaltuntas - this is an ESMF issue and @ theurich is working on resolving it. CTSM had the same issue. DATM is sending messages and filling up buffers because there is no throttling and noting is coming back to DATM.

alperaltuntas commented 3 years ago

Thanks, @mvertens. Placing all tasks on the same PEs alleviate the issue, although I'll note that there is still a steady increase in memory usage. (It increases from ~450mb to ~700mb in three model years.)

mvertens commented 3 years ago

@alperaltuntas - that sounds like an addition problem. Can you please tell me how to reproduce your case?

alperaltuntas commented 3 years ago

Here is my caseroot: /glade/scratch/altuntas/c.e23.Cswav.T62_g17.long_npc.001

cesm2_3_alpha03a/cime/scripts/create_newcase --res T62_g17 --compset 2000_DATM%NYF_SLND_DICE%SSMI_POP2_DROF%NYF_SGLC_SWAV_SESP --run-unsupported --driver nuopc --case c.e23.Cswav.T62_g17.long_npc.001
mvertens commented 3 years ago

Thanks! I'll give this a try.

On Mon, May 31, 2021 at 4:27 PM Alper Altuntas @.***> wrote:

Here is my caseroot: /glade/scratch/altuntas/c.e23.Cswav.T62_g17.long_npc.001

cesm2_3_alpha03a/cime/scripts/create_newcase --res T62_g17 --compset 2000_DATM%NYF_SLND_DICE%SSMI_POP2_DROF%NYF_SGLC_SWAV_SESP --run-unsupported --driver nuopc --case c.e23.Cswav.T62_g17.long_npc.001

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/POP2-CESM/issues/55#issuecomment-851703708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4XCE2L43THJWQSPLOHSLDTQQEL3ANCNFSM45XQQTJA .

-- Mariana Vertenstein CESM Software Engineering Group Head National Center for Atmospheric Research Boulder, Colorado Office 303-497-1349 Email: @.***

mvertens commented 3 years ago

@alperaltuntas @jedwards4b - I have traced this to a memory leak not in the pop nuopc cap but in the budget module med_diag_mod.F90. I did this by running the following test:

SMS_Ly1_Vnuopc.T62_g17.2000_DATM%NYF_SLND_DICE%SSMI_DOCN%DOM_DROF%NYF_SGLC_SWAV_SESP

Out of the box there are no diag calls in the run sequence. However, when the following calls were added to the run sequence

  MED med_phases_diag_atm
  MED med_phases_diag_rof
  MED med_phases_diag_accum
  MED med_phases_diag_print

The memory leak that @alperaltuntas observed once again appears in the med.log file. I will raise this issue in CMEPS and the issue here can be closed.

alperaltuntas commented 3 years ago

Thanks, @mvertens!