ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
160 stars 207 forks source link

PR #3716 is breaking Memory Leak Test #3735

Closed fischer-ncar closed 3 years ago

fischer-ncar commented 3 years ago

PR #3716 adds a memory usage query before the run loop. This is causing a false MEMLEAK failure for mpi-serial tests like

SMS_Ly3_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.cheyenne_intel.clm-clm50dynroots

A possible fix would be to skip the first memory usage message in the logs during the memory leak test.

rljacob commented 3 years ago

Can this be reproduced with an A or X case?

fischer-ncar commented 3 years ago

I haven't been able to reproduce with A or X case.

amametjanov commented 3 years ago

@fischer-ncar can you please post an example from that case run? Here is what I see with current master in SMS.f19_g16_rx1.A.anvil_intel's cpl.log:

...
(seq_mct_drv) : Model initialization complete

 memory_write: model date =   00010101       0 memory =    1262.14 MB (highwater)        194.73 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
(prep_ice_merge)  Summary:
...
 tStamp_write: model date =   00010102       0 wall clock = 2020-10-12 18:14:57 avg dt =     0.80 dt =     0.80
 memory_write: model date =   00010102       0 memory =    1262.14 MB (highwater)        194.91 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   00010103       0 wall clock = 2020-10-12 18:14:58 avg dt =     0.81 dt =     0.81
 memory_write: model date =   00010103       0 memory =    1263.10 MB (highwater)        195.56 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
...

Mem-leak check looks at highwater numbers and if highwater[penultimate day] - highwater[first written day] is greater than machine-specific tolerance (or 10%, if no TEST_MEMLEAK_TOLERANCE is specified), then MEMLEAK error is raised.

@jgfouca, for mem-leak checking purposes, maybe highwater at day 1 could be ignored? Like highwater at last day due to heavier IO.

fischer-ncar commented 3 years ago

Here's an example of that case run.

(seq_mct_drv) : Model initialization complete 

 memory_write: model date =   20000101       0 memory =     622.34 MB (highwater)        306.55 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
(prep_lnd_merge)  Summary:
...
 tStamp_write: model date =   20000102       0 wall clock = 2020-10-09 22:11:04 avg dt =     0.33 dt =     0.33
 memory_write: model date =   20000102       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000103       0 wall clock = 2020-10-09 22:11:05 avg dt =     0.29 dt =     0.25
 memory_write: model date =   20000103       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000104       0 wall clock = 2020-10-09 22:11:05 avg dt =     0.28 dt =     0.25
 memory_write: model date =   20000104       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000105       0 wall clock = 2020-10-09 22:11:05 avg dt =     0.27 dt =     0.25
 memory_write: model date =   20000105       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000106       0 wall clock = 2020-10-09 22:11:05 avg dt =     0.27 dt =     0.26
 memory_write: model date =   20000106       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000107       0 wall clock = 2020-10-09 22:11:06 avg dt =     0.26 dt =     0.25
 memory_write: model date =   20000107       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000108       0 wall clock = 2020-10-09 22:11:06 avg dt =     0.26 dt =     0.25
 memory_write: model date =   20000108       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000109       0 wall clock = 2020-10-09 22:11:06 avg dt =     0.26 dt =     0.25
 memory_write: model date =   20000109       0 memory =     691.14 MB (highwater)        311.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000110       0 wall clock = 2020-10-09 22:11:06 avg dt =     0.26 dt =     0.29
 memory_write: model date =   20000110       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000111       0 wall clock = 2020-10-09 22:11:07 avg dt =     0.26 dt =     0.25
 memory_write: model date =   20000111       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000112       0 wall clock = 2020-10-09 22:11:07 avg dt =     0.28 dt =     0.43
 memory_write: model date =   20000112       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000113       0 wall clock = 2020-10-09 22:11:07 avg dt =     0.28 dt =     0.26
 memory_write: model date =   20000113       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000114       0 wall clock = 2020-10-09 22:11:08 avg dt =     0.28 dt =     0.30
 memory_write: model date =   20000114       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000115       0 wall clock = 2020-10-09 22:11:08 avg dt =     0.28 dt =     0.28
 memory_write: model date =   20000115       0 memory =     691.14 MB (highwater)        319.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000116       0 wall clock = 2020-10-09 22:11:08 avg dt =     0.28 dt =     0.34
 memory_write: model date =   20000116       0 memory =     691.14 MB (highwater)        327.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000117       0 wall clock = 2020-10-09 22:11:09 avg dt =     0.28 dt =     0.30
 memory_write: model date =   20000117       0 memory =     702.97 MB (highwater)        339.08 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 tStamp_write: model date =   20000118       0 wall clock = 2020-10-09 22:11:09 avg dt =     0.28 dt =     0.27
 memory_write: model date =   20000118       0 memory =     702.97 MB (highwater)        339.08 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
...
tStamp_write: model date =   20021231       0 wall clock = 2020-10-09 22:16:11 avg dt =     0.28 dt =     0.26
 memory_write: model date =   20021231       0 memory =     702.97 MB (highwater)        362.95 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)