Closed rljacob closed 3 years ago
@jonbob ran a 20-day test on anvil with mpassi on its own set of pes and INFO_MPROF=1 so we get memory output for every ROOTPE. The output shows this:
Resident memory size:
ROOTPE | comps | RSS day1 | RSS day2 | RSS day20 |
---|---|---|---|---|
0 | cpl, atm | 744.578 | 779.293 | 858.012 |
108 | lnd, ocn, rof | 247.691 | 253.754 | 331.004 |
144 | ice | 322.328 | 323.141 | 342.570 |
High water memory:
ROOTPE | comps | VSZ day1 | VSZ day2 | VSZ day20 |
---|---|---|---|---|
0 | cpl, atm | 6871.63 | 6871.63 | 6916.51 |
108 | lnd, ocn, rof | 5773.75 | 5777.21 | 5843.43 |
144 | ice | 5833.89 | 5851.89 | 5940.97 |
May be we should try SMS and ERS to see if we get a memory leak there as well. If it is just ERP, there might be something to do with the threading.
It also possibly related to the compiler version since intel 19 has it but not intel 20.
More from @jonbob On cori if I move mpassi off to its own processor set, the test still fails and is complaining about memory on the nodes with cpl/atm/lnd/ocn/rof – so I don’t believe this is related to a memory leak in mpassi. I can put each component on its own pes and try to see which one is causing trouble?
OK, I moved lnd/rof/ocn off to another set of pes and the test still fails, pointing at a rootpe that only has cpl and atm. So if there is a memory leak, it would be with one of those two components?
For what it’s worth, the failing test on cori is pointing at the memory highwater when it complains. It does grow on cori but not on anvil, so I’m confused… The versions of intel are different, but not that much – cori has intel/19.0.3.199 and anvil uses intel/20.0.4
Another test from @jonbob "Could it be that EAM has a leak related to the prescribed seaice data? Going from seaice on ne4 to oQU240 means going from 866 horizontal points to 7153
ERP_Ld3.ne4_oQU480.F2010.cori-knl_intel fails as well. Trying now with ne4pg2 – unless there’s a reason we want ne4 instead
ERP_Ld3.ne4pg2_oQU480.F2010.cori-knl_intel also fails, so there goes that idea"
@singhbalwinder - I ran an ERS test on cori but it had similar results. But it also defaulted to running threaded, so I'll try to turn threading off and try again
@rljacob - I'll try on compy, since it also has intel19 support. At least that may help resolve the question of compiler version
Thanks @jonbob . Is it reproducible with an SMS test? If yes, it may be easier to reproduce/debug.
It should be, but I'll try a test to make sure
Note that the ERS tests don't fail the memory test, just say there's insufficient data. But the memory high-water changes are consistent with the ERP test results
Another thing to try: go back to master before PR #4488 and see if ERP_Ld3.ne4_ne4.FC5AV1C-L has a memory leak on cori (running the old test for 3 days instead of 9 steps). The results are implying that it's in the atmosphere. If that test passes, we have to look at what code is being activated for 2000_EAM%AV1C-L vs. 2010_EAM%CMIP6
Actually the results are more strongly indicating its a compiler bug and we should just upgrade the compilers.
On compy using intel 19.0.5, the ERP memory leak test passes
The SMS test doesn't show the memory leak, which is a bit confusing until I looked closer at the cpl logs. I think the issue isn't a memory leak per se but an I/O issue because of how the test is constructed. Running ERP or ERS for 3 days means the model will drop restart files at the end of the second day -- so the memory highwater is actually catching any memory used for restart output. Usually the memory leak test throws out the last day, probably to avoid measuring all the I/O finalization.
I like that explanation but why don't all the 3 day ERP tests have a false memory leak?
I like it too, but that is the obvious question. Unless some of the platforms and/or compilers are impacted differently by the restart output? But that's not something I have much knowledge of
I'm trying a full-length ERS test on cori, just to see what that does
Somehow the full-length ERS test passes the memleak, even though the memory highwater goes from 4743.46 to 5241.42 between days 5 and 6 (or 6 and 7, depending on how you count) when it writes the restart files. I would have expected that to cause it to fail, since the memory increase is more than 10%?
And nothing in TestStatus.log about memory, so I'm a bit confused
@amametjanov do you know if some tests have memory leak checking on and other's don't?
env_test.xml does have a value for the tolerance:
<entry id="TEST_MEMLEAK_TOLERANCE" value="0.10">
<type>real</type>
<desc>Expected relative memory usage growth for test</desc>
</entry>
Unless it does the memleak test on the second run, in which case it wouldn't find an issue. I guess for the ERP_Ld3 the second run wouldn't be long enough for a memory comparison and it's possible it then uses the first run? But I'm not sure how much logic is build into that part of the testing system. Let me try one final run, a full-length ERP test
All tests have memleak checking: default tolerance is 10%. A machine-specific value for TEST_MEMLEAK_TOLERANCE
can be set in config_machines.xml.
I think the mem-highwater increase in the initial run of ERP is due to PIO_ROOT=0 and CPL_ROOTPE=0. A workaround is to add a testmod:
$ cat cime_config/testmods_dirs/allactive/pioroot1/shell_commands
./xmlchange PIO_ROOT=1
My longer ERP run also fails the memleak check, but I think I have an explanation. It's still looking at the cpl log from the first run because the second one is done in a different subdirectory -- so it does see the impact from the restart output. The ERS does both runs in the same directory, so it makes (some) sense that the testing picks the most recent cpl log to look at for a memory leak, which does not have the restart write.
Thanks for the idea @amametjanov. The test does pass if I make the PIO_ROOT 1 for all components. Out of curiosity I ran some tests with PIO_ROOT 1 for all components but 0 for 1. So far it fails when ELM has PIO_ROOT=0, but I'll keep testing. @rljacob - I can try the older version as you suggested and just confirm that this issue is not something new.
Also fails if PIO_ROOT=0 for EAM
If I set PIO_ROOT=0 for ELM or EAM, the memory test fails (with all other components using PIO_ROOT=1). If cpl, ocn, or ice is the only component using PIO_ROOT=0, the test passes
@rljacob - using master from 9/10/2021, ERP_Ld3.ne4_ne4.FC5AV1C-L.cori-knl_intel fails the memory test. So whatever problem this is, it was not introduced in the last few weeks
If there is a mem leak you should be able to see it in an SMS test of sufficient length but instead of relying on the test you should inspect the coupler logs.
Look for lines like
memory_write: model date = 20191114 0 memory = 66044.02 MB (highwater) 542.60 MB (usage) (pe= 0 comps= cpl ATM LND GLC ESP)
and watch for growth in the memusage field (542.60 in this example). You may need to change the PE layout to match that of the ERP test, for example if the problem is threading.
Thanks @jedwards4b . As far as I can tell, the memory jump just comes from writing restart files, so an SMS test probably wouldn't catch it. I don't know if the jump is indicative of a leak or simply more usage for I/O (or just I)
That could be the issue - the memory leak test is little faulty in that regard. You can modify the SMS test to turn on high frequency output then run for a month just to be sure. Once you've read in the data and written the first output you should expect the memory usage to stabilize.
Thanks again @jedwards4b . I'll check with Rob and see if he's that concerned. My sense if that since this is the only test that fails the memleak check, it's simply capturing the restart write usage inadvertently. But it's Rob's call. Thanks for weighing in
@rljacob - what do you want to do with this issue? I think my tests show that the memory leak is caused by highwater memory during the writing of restart files, particularly by ELM or EAM. I don't know if this is due to something in PIO or those components, or something about the platforms/compilers?
@rljacob -- my other suggestion is that we go back to a 9-step ERP test, since that will pass this memory problem
The 9-step didn't work when the river was changed from ROF to MOSART.
I think you found the problem in this comment: https://github.com/E3SM-Project/E3SM/issues/4546#issuecomment-923480161 We should just change ERP to work like ERS.
Its still weird that this is compiler dependent. One would think the amount of memory allocated wouldn't depend on that.
I agree, I can't make much sense of the memory allocation and why it's so compiler dependent. And I'm not sure if we should dig in to understand it or not? Thanks for the reminder about MOSART not working correctly on the 9-step test -- that makes sense as well
@jonbob FYI, this test called MPI_Abort with a high error code during case2run on anlgce (ANL GCE node, Ubuntu 18, 4.15.0-147-generic, GCC 8.3.0):
...
[1] Opened existing file /nfs/gce/projects/climate/inputdata/lnd/clm2/snicardata/snicar_drdt_bst_fit_60_c070416.nc 98
[1] Opened existing file /nfs/gce/projects/climate/inputdata/lnd/clm2/paramdata/clm_params_c180301.nc 99
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.elm.r.0001-01-03-00000.nc 100
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.elm.rh0.0001-01-03-00000.nc 101
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.r.0001-01-03-00000.nc 102
[1] Opened existing file /nfs/gce/projects/climate/inputdata/rof/mosart/MOSART_global_half_20180721a.nc 103
[1] MOSART decomp info proc = 1 begr = 32401 endr = 64800 numr = 32400
[3] MOSART decomp info proc = 3 begr = 97201 endr = 129600 numr = 32400
[6] MOSART decomp info proc = 6 begr = 194401 endr = 226800 numr = 32400
[7] MOSART decomp info proc = 7 begr = 226801 endr = 259200 numr = 32400
[1] Opened existing file /nfs/gce/projects/climate/inputdata/rof/mosart/MOSART_global_half_20180721a.nc 104
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.r.0001-01-03-00000.nc 105
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlgce_gnu.20211018_114718_3s1shh.mosart.rh0.0001-01-03-00000.nc 106
[0] Note: MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but
[0] this is not supported by the MPI implementation; a threading level of
[0] MPI_THREAD_SINGLE will be used instead.
[0] application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0
This issue is not reproducible on anlworkstation (legacy ANL workstations, Ubuntu 16, 4.4.0-210-generic, GCC 8.2.0)
@dqwu that is unrelated to the memory leak issue and is a problem with the GCE MPI library: "MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but this is not supported by the MPI implementation;"
@rljacob "MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but this is not supported by the MPI implementation;" seems to be a warning, which does not cause MPI_Abort.
On legacy anlworkstation, the warning is the same but here is no MPI_Abort called.
[1] Opened existing file /home/climate1/acme/inputdata/rof/mosart/MOSART_global_half_20180721a.nc 104
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.mosart.r.0001-01-03-00000.nc 105
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.mosart.rh0.0001-01-03-00000.nc 106
[0] Note: MPAS has requested an MPI threading level of MPI_THREAD_MULTIPLE, but
[0] this is not supported by the MPI implementation; a threading level of
[0] MPI_THREAD_SINGLE will be used instead.
[0] MCT::m_Router::initp_: GSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
[0] MCT::m_Router::initp_: GSMap indices not increasing...Will correct
[1] Opened existing file ERP_Ld3.ne4_oQU240.F2010.anlworkstation_gnu.C.20211018_030120_87oktb.eam.rs.0001-01-03-00000.nc 119
[1] Opened existing file /home/climate1/acme/inputdata/lnd/clm2/surfdata_map/surfdata_ne4np4_simyr2010_c210908.nc 138
@jonbob Do you know if MPAS code ever calls MPI_Abort with error code 1734831948 in some cases? This error code is also mentioned in https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?t=8347
@rljacob @jonbob Additional information from log.seaice.0000.err
----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task 0 of 8
Opened at 2021/10/18 12:01:11
----------------------------------------------------------------------
ERROR: Could not open block decomposition file for 8 blocks.
CRITICAL ERROR: Filename: /nfs/gce/projects/climate/inputdata/ice/mpas-cice/oQU240/mpas-cice.graph.info.151209.part.8
Logging complete. Closing file at 2021/10/18 12:01:11
@dqwu - I was just writing to request the last lines from the ice log and any err output. There likely is no 8-processor decomposition file (mpas-cice.graph.info.151209.part.X) for that grid. I can create one, but you may need to run that test with a greater number of pes anyway
@dqwu please take this discussion elsewhere. It has nothing to do with the topic of this issue.
@rljacob - what about changing the test to ERP_Ln18.ne4_oQU240.F2010 until we figure out the compiler memory dependence? It seems like running 18-steps will allow MOSART to fit in but running for a shorter amount of time will keep the memory checker from causing issues. I tested it on compy and got this:
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel CREATE_NEWCASE
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel XML
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SETUP
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SHAREDLIB_BUILD time=376
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel MODEL_BUILD time=1018
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SUBMIT
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel RUN time=82
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel COMPARE_base_rest
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel MEMLEAK insuffiencient data for memleak test
PASS ERP_Ln18.ne4_oQU240.F2010.compy_intel SHORT_TERM_ARCHIVER
Yes that's worth trying. Ask Wade to try it on mappy to make sure.
A memory leak is being reported on some machine for ERP_Ld3.ne4_oQU240.F2010.
The leak started after the test was changed in PR #4488. It went from ERP_Ln9.ne4_ne4.FC5AV1C-L longname: 2000_EAM%AV1C-L_ELM%SPBC_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV
to
ERP_Ld3.ne4_oQU240.F2010 longname: 2010_EAM%CMIP6_ELM%SPBC_MPASSI%PRES_DOCN%DOM_MOSART_SGLC_SWAV
The message in TestStatus is something like: 2021-09-18 04:18:16: memleak detected, memory went from 5680.520000 to 6473.240000 in 1 days
Machines with memory leak: mappy, gnu 8.1.0 cori-knl, intel 19.0.3 cori-haswell, intel 19.0.3 theta, intel 19.1.0
Machines without leak: chrysalis, intel 20.0.4 anvil, intel 20.0.4 compy, pgi 19.10 compy, intel 19.0.5 ascent, gnu 8.1.1 ascent, xlf 16.1.1