ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
308 stars 312 forks source link

Memory leak in 10-day tracer consistency test with hobart_nag #763

Closed billsacks closed 5 years ago

billsacks commented 5 years ago

Brief summary of bug

I just changed the one-timestep tracer consistency test to a 10-day test: SMS_D_Ld10.f10_f10_musgs.I2000Clm50BgcCropGs.hobart_nag.clm-tracer_consistency. It is now failing the MEMLEAK test - e.g.:

FAIL SMS_D_Ld10.f10_f10_musgs.I2000Clm50BgcCropGs.hobart_nag.clm-tracer_consistency MEMLEAK memleak detected, memory went from 1309.480000 to 1464.990000 in 8 days

General bug information

CTSM version you are using: ctsm1.0.dev050-20-gf96cbf94

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Configurations with water tracers, but seemingly just with hobart_nag (I haven't checked izumi_nag).

Details of bug

Here is the memory growth over this 10-day test (SMS_D_Ld10.f10_f10_musgs.I2000Clm50BgcCropGs.hobart_nag.clm-tracer_consistency) (this was from a different run from the one summarized in the MEMLEAK line above, hence the slightly different numbers):

 memory_write: model date =   20000102       0 memory =    1240.24 MB (highwater)        234.47 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000103       0 memory =    1259.48 MB (highwater)        255.86 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000104       0 memory =    1279.66 MB (highwater)        276.51 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000105       0 memory =    1298.84 MB (highwater)        296.54 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000106       0 memory =    1319.03 MB (highwater)        316.73 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000107       0 memory =    1338.21 MB (highwater)        336.56 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000108       0 memory =    1358.40 MB (highwater)        356.16 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000109       0 memory =    1377.57 MB (highwater)        375.97 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000110       0 memory =    1397.76 MB (highwater)        395.75 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000111       0 memory =    1416.94 MB (highwater)        416.20 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)

I reran this test a few times, and get about the same results each time. Without water tracers (SMS_D_Ld10.f10_f10_musgs.I2000Clm50BgcCropGs.hobart_nag.clm-default), there still appears to be a memory leak, but of about 25% the magnitude:

 memory_write: model date =   20000102       0 memory =    1208.82 MB (highwater)        230.67 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000103       0 memory =    1212.76 MB (highwater)        235.74 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000104       0 memory =    1217.68 MB (highwater)        241.64 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000105       0 memory =    1222.59 MB (highwater)        247.27 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000106       0 memory =    1227.52 MB (highwater)        252.43 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000107       0 memory =    1232.44 MB (highwater)        257.46 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000108       0 memory =    1237.36 MB (highwater)        262.29 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000109       0 memory =    1242.28 MB (highwater)        267.27 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000110       0 memory =    1247.20 MB (highwater)        272.17 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000111       0 memory =    1252.13 MB (highwater)        277.67 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)

I tried turning off all output in the test with tracers (doing only monthly rather than daily output, and even setting hist_empty_htapes to true), and that didn't help much, if at all.

However, this memory leak does NOT show up on cheyenne_gnu, cheyenne_intel, hobart_gnu, hobart_intel or gnu on my laptop (bishorn) – though hobart_intel shows a small increase in usage (but not really highwater) over time:

 memory_write: model date =   20000102       0 memory =    1473.98 MB (highwater)        243.09 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000103       0 memory =    1475.10 MB (highwater)        245.87 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000104       0 memory =    1475.10 MB (highwater)        246.84 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000105       0 memory =    1475.10 MB (highwater)        247.28 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000106       0 memory =    1475.10 MB (highwater)        247.66 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000107       0 memory =    1475.10 MB (highwater)        247.89 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000108       0 memory =    1475.10 MB (highwater)        247.90 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000109       0 memory =    1475.10 MB (highwater)        248.67 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000110       0 memory =    1475.10 MB (highwater)        248.67 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
 memory_write: model date =   20000111       0 memory =    1475.10 MB (highwater)        248.91 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC WAV IAC ESP)
billsacks commented 5 years ago

Because this problem seems limited to the nag compiler (which we don't use for production runs), and because there seems to be at least a small memory leak even without water tracers with nag, I'm going to tentatively chalk this up to a compiler-specific issue and close it as a wontfix.

billsacks commented 5 years ago

I got a slightly larger memory leak in this test in a recent run of the test suite (for ctsm1.0.dev056): FAIL SMS_D_Ld10.f10_f10_musgs.I2000Clm50BgcCropGs.hobart_nag.clm-tracer_consistency MEMLEAK memleak detected, memory went from 1250.830000 to 1518.500000 in 8 days. For now I'm just increasing the memleak tolerance on this test from 0.2 to 0.3.