E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
355 stars 368 forks source link

memory leak for ERP_Ld3.ne4_oQU240.F2010 #4546

Closed rljacob closed 3 years ago

rljacob commented 3 years ago

A memory leak is being reported on some machine for ERP_Ld3.ne4_oQU240.F2010.

The leak started after the test was changed in PR #4488. It went from ERP_Ln9.ne4_ne4.FC5AV1C-L longname: 2000_EAM%AV1C-L_ELM%SPBC_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV

to

ERP_Ld3.ne4_oQU240.F2010 longname: 2010_EAM%CMIP6_ELM%SPBC_MPASSI%PRES_DOCN%DOM_MOSART_SGLC_SWAV

The message in TestStatus is something like: 2021-09-18 04:18:16: memleak detected, memory went from 5680.520000 to 6473.240000 in 1 days

Machines with memory leak: mappy, gnu 8.1.0 cori-knl, intel 19.0.3 cori-haswell, intel 19.0.3 theta, intel 19.1.0

Machines without leak: chrysalis, intel 20.0.4 anvil, intel 20.0.4 compy, pgi 19.10 compy, intel 19.0.5 ascent, gnu 8.1.1 ascent, xlf 16.1.1

rljacob commented 3 years ago

@jonbob ran a 20-day test on anvil with mpassi on its own set of pes and INFO_MPROF=1 so we get memory output for every ROOTPE. The output shows this:

Resident memory size:

ROOTPE comps RSS day1 RSS day2 RSS day20
0 cpl, atm 744.578 779.293 858.012
108 lnd, ocn, rof 247.691 253.754 331.004
144 ice 322.328 323.141 342.570

High water memory:

ROOTPE comps VSZ day1 VSZ day2 VSZ day20
0 cpl, atm 6871.63 6871.63 6916.51
108 lnd, ocn, rof 5773.75 5777.21 5843.43
144 ice 5833.89 5851.89 5940.97
singhbalwinder commented 3 years ago

May be we should try SMS and ERS to see if we get a memory leak there as well. If it is just ERP, there might be something to do with the threading.

rljacob commented 3 years ago

It also possibly related to the compiler version since intel 19 has it but not intel 20.

rljacob commented 3 years ago

More from @jonbob On cori if I move mpassi off to its own processor set, the test still fails and is complaining about memory on the nodes with cpl/atm/lnd/ocn/rof – so I don’t believe this is related to a memory leak in mpassi. I can put each component on its own pes and try to see which one is causing trouble?

OK, I moved lnd/rof/ocn off to another set of pes and the test still fails, pointing at a rootpe that only has cpl and atm. So if there is a memory leak, it would be with one of those two components?

For what it’s worth, the failing test on cori is pointing at the memory highwater when it complains. It does grow on cori but not on anvil, so I’m confused… The versions of intel are different, but not that much – cori has intel/19.0.3.199 and anvil uses intel/20.0.4

rljacob commented 3 years ago

Another test from @jonbob "Could it be that EAM has a leak related to the prescribed seaice data? Going from seaice on ne4 to oQU240 means going from 866 horizontal points to 7153

ERP_Ld3.ne4_oQU480.F2010.cori-knl_intel fails as well. Trying now with ne4pg2 – unless there’s a reason we want ne4 instead

ERP_Ld3.ne4pg2_oQU480.F2010.cori-knl_intel also fails, so there goes that idea"

jonbob commented 3 years ago

@singhbalwinder - I ran an ERS test on cori but it had similar results. But it also defaulted to running threaded, so I'll try to turn threading off and try again

jonbob commented 3 years ago

@rljacob - I'll try on compy, since it also has intel19 support. At least that may help resolve the question of compiler version

singhbalwinder commented 3 years ago

Thanks @jonbob . Is it reproducible with an SMS test? If yes, it may be easier to reproduce/debug.

jonbob commented 3 years ago

It should be, but I'll try a test to make sure

jonbob commented 3 years ago

Note that the ERS tests don't fail the memory test, just say there's insufficient data. But the memory high-water changes are consistent with the ERP test results

rljacob commented 3 years ago

Another thing to try: go back to master before PR #4488 and see if ERP_Ld3.ne4_ne4.FC5AV1C-L has a memory leak on cori (running the old test for 3 days instead of 9 steps). The results are implying that it's in the atmosphere. If that test passes, we have to look at what code is being activated for 2000_EAM%AV1C-L vs. 2010_EAM%CMIP6

rljacob commented 3 years ago

Actually the results are more strongly indicating its a compiler bug and we should just upgrade the compilers.

jonbob commented 3 years ago

On compy using intel 19.0.5, the ERP memory leak test passes

jonbob commented 3 years ago

The SMS test doesn't show the memory leak, which is a bit confusing until I looked closer at the cpl logs. I think the issue isn't a memory leak per se but an I/O issue because of how the test is constructed. Running ERP or ERS for 3 days means the model will drop restart files at the end of the second day -- so the memory highwater is actually catching any memory used for restart output. Usually the memory leak test throws out the last day, probably to avoid measuring all the I/O finalization.

rljacob commented 3 years ago

I like that explanation but why don't all the 3 day ERP tests have a false memory leak?

jonbob commented 3 years ago

I like it too, but that is the obvious question. Unless some of the platforms and/or compilers are impacted differently by the restart output? But that's not something I have much knowledge of

jonbob commented 3 years ago

I'm trying a full-length ERS test on cori, just to see what that does

jonbob commented 3 years ago

Somehow the full-length ERS test passes the memleak, even though the memory highwater goes from 4743.46 to 5241.42 between days 5 and 6 (or 6 and 7, depending on how you count) when it writes the restart files. I would have expected that to cause it to fail, since the memory increase is more than 10%?

jonbob commented 3 years ago

And nothing in TestStatus.log about memory, so I'm a bit confused

rljacob commented 3 years ago

@amametjanov do you know if some tests have memory leak checking on and other's don't?

jonbob commented 3 years ago

env_test.xml does have a value for the tolerance:


    <entry id="TEST_MEMLEAK_TOLERANCE" value="0.10">
      <type>real</type>
      <desc>Expected relative memory usage growth for test</desc>
    </entry>
jonbob commented 3 years ago

Unless it does the memleak test on the second run, in which case it wouldn't find an issue. I guess for the ERP_Ld3 the second run wouldn't be long enough for a memory comparison and it's possible it then uses the first run? But I'm not sure how much logic is build into that part of the testing system. Let me try one final run, a full-length ERP test

amametjanov commented 3 years ago

All tests have memleak checking: default tolerance is 10%. A machine-specific value for TEST_MEMLEAK_TOLERANCE can be set in config_machines.xml. I think the mem-highwater increase in the initial run of ERP is due to PIO_ROOT=0 and CPL_ROOTPE=0. A workaround is to add a testmod:

$ cat cime_config/testmods_dirs/allactive/pioroot1/shell_commands 
./xmlchange PIO_ROOT=1
jonbob commented 3 years ago

My longer ERP run also fails the memleak check, but I think I have an explanation. It's still looking at the cpl log from the first run because the second one is done in a different subdirectory -- so it does see the impact from the restart output. The ERS does both runs in the same directory, so it makes (some) sense that the testing picks the most recent cpl log to look at for a memory leak, which does not have the restart write.

jonbob commented 3 years ago

Thanks for the idea @amametjanov. The test does pass if I make the PIO_ROOT 1 for all components. Out of curiosity I ran some tests with PIO_ROOT 1 for all components but 0 for 1. So far it fails when ELM has PIO_ROOT=0, but I'll keep testing. @rljacob - I can try the older version as you suggested and just confirm that this issue is not something new.

jonbob commented 3 years ago

Also fails if PIO_ROOT=0 for EAM

jonbob commented 3 years ago

If I set PIO_ROOT=0 for ELM or EAM, the memory test fails (with all other components using PIO_ROOT=1). If cpl, ocn, or ice is the only component using PIO_ROOT=0, the test passes

jonbob commented 3 years ago

@rljacob - using master from 9/10/2021, ERP_Ld3.ne4_ne4.FC5AV1C-L.cori-knl_intel fails the memory test. So whatever problem this is, it was not introduced in the last few weeks

jedwards4b commented 3 years ago

If there is a mem leak you should be able to see it in an SMS test of sufficient length but instead of relying on the test you should inspect the coupler logs.
Look for lines like

memory_write: model date =   20191114       0 memory =   66044.02 MB (highwater)        542.60 MB (usage)  (pe=    0 comps= cpl ATM LND GLC ESP)

and watch for growth in the memusage field (542.60 in this example). You may need to change the PE layout to match that of the ERP test, for example if the problem is threading.

jonbob commented 3 years ago

Thanks @jedwards4b . As far as I can tell, the memory jump just comes from writing restart files, so an SMS test probably wouldn't catch it. I don't know if the jump is indicative of a leak or simply more usage for I/O (or just I)

jedwards4b commented 3 years ago

That could be the issue - the memory leak test is little faulty in that regard. You can modify the SMS test to turn on high frequency output then run for a month just to be sure. Once you've read in the data and written the first output you should expect the memory usage to stabilize.

jonbob commented 3 years ago

Thanks again @jedwards4b . I'll check with Rob and see if he's that concerned. My sense if that since this is the only test that fails the memleak check, it's simply capturing the restart write usage inadvertently. But it's Rob's call. Thanks for weighing in

jonbob commented 3 years ago

@rljacob - what do you want to do with this issue? I think my tests show that the memory leak is caused by highwater memory during the writing of restart files, particularly by ELM or EAM. I don't know if this is due to something in PIO or those components, or something about the platforms/compilers?

jonbob commented 3 years ago

@rljacob -- my other suggestion is that we go back to a 9-step ERP test, since that will pass this memory problem

rljacob commented 3 years ago

The 9-step didn't work when the river was changed from ROF to MOSART.

I think you found the problem in this comment: https://github.com/E3SM-Project/E3SM/issues/4546#issuecomment-923480161 We should just change ERP to work like ERS.

Its still weird that this is compiler dependent. One would think the amount of memory allocated wouldn't depend on that.

jonbob commented 3 years ago

I agree, I can't make much sense of the memory allocation and why it's so compiler dependent. And I'm not sure if we should dig in to understand it or not? Thanks for the reminder about MOSART not working correctly on the 9-step test -- that makes sense as well

dqwu commented 3 years ago

@jonbob FYI, this test called MPI_Abort with a high error code during case2run on anlgce (ANL GCE node, Ubuntu 18, 4.15.0-147-generic, GCC 8.3.0):

...
[1]  Opened existing file /nfs/gce/projects/climate/inputdata/lnd/clm2/snicardata/snicar_drdt_bst_fit_60_c070416.nc          98
[1]  Opened existing file /nfs/gce/projects/climate/inputdata/lnd/clm2/paramdata/clm_params_c180301.nc          99
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.elm.r.0001-01-03-00000.nc         100
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.elm.rh0.0001-01-03-00000.nc         101
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.r.0001-01-03-00000.nc         102
[1]  Opened existing file /nfs/gce/projects/climate/inputdata/rof/mosart/MOSART_global_half_20180721a.nc         103
[1] MOSART decomp info proc =         1 begr =     32401 endr =     64800 numr =     32400
[3] MOSART decomp info proc =         3 begr =     97201 endr =    129600 numr =     32400
[6] MOSART decomp info proc =         6 begr =    194401 endr =    226800 numr =     32400
[7] MOSART decomp info proc =         7 begr =    226801 endr =    259200 numr =     32400
[1]  Opened existing file /nfs/gce/projects/climate/inputdata/rof/mosart/MOSART_global_half_20180721a.nc         104
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.r.0001-01-03-00000.nc         105
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.rh0.0001-01-03-00000.nc         106
[0]  Note: MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but
[0]        this is not supported by the MPI implementation; a threading level of
[0]        MPI_THREAD_SINGLE will be used instead.
[0] application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0

This issue is not reproducible on anlworkstation (legacy ANL workstations, Ubuntu 16, 4.4.0-210-generic, GCC 8.2.0)

rljacob commented 3 years ago

@dqwu that is unrelated to the memory leak issue and is a problem with the GCE MPI library: "MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but this is not supported by the MPI implementation;"

dqwu commented 3 years ago

@rljacob "MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but this is not supported by the MPI implementation;" seems to be a warning, which does not cause MPI_Abort.

On legacy anlworkstation, the warning is the same but here is no MPI_Abort called.

[1]  Opened existing file /home/climate1/acme/inputdata/rof/mosart/MOSART_global_half_20180721a.nc         104
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.mosart.r.0001-01-03-00000.nc         105
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.mosart.rh0.0001-01-03-00000.nc         106
[0]  Note: MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but
[0]        this is not supported by the MPI implementation; a threading level of
[0]        MPI_THREAD_SINGLE will be used instead.
[0] MCT::m_Router::initp_: GSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: GSMap indices not increasing...Will correct
[1]  Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.eam.rs.0001-01-03-00000.nc         119
[1]  Opened existing file /home/climate1/acme/inputdata/lnd/clm2/surfdata_map/surfdata_ne4np4_simyr2010_c210908.nc         138
dqwu commented 3 years ago

@jonbob Do you know if MPAS code ever calls MPI_Abort with error code 1734831948 in some cases? This error code is also mentioned in https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?t=8347

dqwu commented 3 years ago

@rljacob @jonbob Additional information from log.seaice.0000.err

----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task       0 of       8
    Opened at 2021/10/18 12:01:11
----------------------------------------------------------------------

ERROR: Could not open block decomposition file for 8 blocks.
CRITICAL ERROR: Filename: /nfs/gce/projects/climate/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.8
Logging complete.  Closing file at 2021/10/18 12:01:11
jonbob commented 3 years ago

@dqwu - I was just writing to request the last lines from the ice log and any err output. There likely is no 8-processor decomposition file (mpas-cice.graph.info.151209.part.X) for that grid. I can create one, but you may need to run that test with a greater number of pes anyway

rljacob commented 3 years ago

@dqwu please take this discussion elsewhere. It has nothing to do with the topic of this issue.

jonbob commented 3 years ago

@rljacob - what about changing the test to ERP_Ln18.ne4_oQU240.F2010 until we figure out the compiler memory dependence? It seems like running 18-steps will allow MOSART to fit in but running for a shorter amount of time will keep the memory checker from causing issues. I tested it on compy and got this:

PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel CREATE_NEWCASE
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel XML
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SETUP
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SHAREDLIB_BUILD time=376
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel MODEL_BUILD time=1018
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SUBMIT
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel RUN time=82
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel COMPARE_base_rest
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel MEMLEAK insuffiencient data for memleak test
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SHORT_TERM_ARCHIVER
rljacob commented 3 years ago

Yes that's worth trying. Ask Wade to try it on mappy to make sure.