Closed alperaltuntas closed 3 years ago
@alperaltuntas - I believe this is due to the processor layout and is a known problem when DATM is run concurrently to POP or MOM. Can you please try this again with all tasks on the same PEs?
@alperaltuntas - this is an ESMF issue and @ theurich is working on resolving it. CTSM had the same issue. DATM is sending messages and filling up buffers because there is no throttling and noting is coming back to DATM.
Thanks, @mvertens. Placing all tasks on the same PEs alleviate the issue, although I'll note that there is still a steady increase in memory usage. (It increases from ~450mb to ~700mb in three model years.)
@alperaltuntas - that sounds like an addition problem. Can you please tell me how to reproduce your case?
Here is my caseroot: /glade/scratch/altuntas/c.e23.Cswav.T62_g17.long_npc.001
cesm2_3_alpha03a/cime/scripts/create_newcase --res T62_g17 --compset 2000_DATM%NYF_SLND_DICE%SSMI_POP2_DROF%NYF_SGLC_SWAV_SESP --run-unsupported --driver nuopc --case c.e23.Cswav.T62_g17.long_npc.001
Thanks! I'll give this a try.
On Mon, May 31, 2021 at 4:27 PM Alper Altuntas @.***> wrote:
Here is my caseroot: /glade/scratch/altuntas/c.e23.Cswav.T62_g17.long_npc.001
cesm2_3_alpha03a/cime/scripts/create_newcase --res T62_g17 --compset 2000_DATM%NYF_SLND_DICE%SSMI_POP2_DROF%NYF_SGLC_SWAV_SESP --run-unsupported --driver nuopc --case c.e23.Cswav.T62_g17.long_npc.001
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/POP2-CESM/issues/55#issuecomment-851703708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4XCE2L43THJWQSPLOHSLDTQQEL3ANCNFSM45XQQTJA .
-- Mariana Vertenstein CESM Software Engineering Group Head National Center for Atmospheric Research Boulder, Colorado Office 303-497-1349 Email: @.***
@alperaltuntas @jedwards4b - I have traced this to a memory leak not in the pop nuopc cap but in the budget module med_diag_mod.F90. I did this by running the following test:
SMS_Ly1_Vnuopc.T62_g17.2000_DATM%NYF_SLND_DICE%SSMI_DOCN%DOM_DROF%NYF_SGLC_SWAV_SESP
Out of the box there are no diag calls in the run sequence. However, when the following calls were added to the run sequence
MED med_phases_diag_atm
MED med_phases_diag_rof
MED med_phases_diag_accum
MED med_phases_diag_print
The memory leak that @alperaltuntas observed once again appears in the med.log file. I will raise this issue in CMEPS and the issue here can be closed.
Thanks, @mvertens!
Description of the issue:
There is a memory leak in nuopc cap that's crashing the run after about 4 years of non-stop C case run. One can simply observe the memory usage reported in cesm log file to confirm the issue (and don't have to run it for 4 years). I haven't been able to pinpoint the source yet, but I am reporting it here in case @mvertens or others would also like to tackle.
Version:
Machine/Environment Description: cheyenne_intel
Any xml/namelist changes or SourceMods: none