E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
340 stars 343 forks source link

Radiation diagnostics out of memory crash #2575

Closed polunma closed 3 years ago

polunma commented 5 years ago

I have spent more than 2 weeks trying to figure out a model crash. Finally I was able to identify that the crash can be reproduced with current master without any code modification. I just checked out the latest master, made no change to the code, and enabled 10 radiation diagnostics for a run. The model ran for 1 month, wrote out h0 files, and ran a few more days and crashed with an error message stating “out of memory”. Could somebody please help?

polunma commented 5 years ago

I forgot to mention that if I restart every 1 month, the model can continue to run.

rljacob commented 5 years ago

Can you give some more info on how to reproduce? What is the create_newcase command? How does one enable 10 radiation diagnostics?

ndkeen commented 5 years ago

Yes please give more information. Ideally a create_test command (on a specific machine) and explain how to do what is different than the default. If I can repeat it, it's much more likely to make progress. Otherwise, I can only guess: If it is running out of memory, one thing we typically try is running with more nodes and/or with fewer MPI's per node. Try running without threads? If it continues for a month after restart, that does sound interesting -- it implies that there could be a memory leak which might only show after running long enough.

polunma commented 5 years ago

Thanks Rob and Noel so much for your help! Balwinder also suspected a memory leak. (He suggested trying to see if the model can continue to run with month-to-month restarts.) Here are some more details if you want to reproduce the crash:

  1. The runs are done on cori-knl. Default settings.
  2. ./create_newcase -case $CASEROOT -mach $MACH -res ne30_ne30 -compset FC5AV1C-04P2 -compiler intel
  3. To do 10 radiation diagnostics, modify the atmosphere model namelist: cat <! user_nl_cam &camexp rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12' fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10' / EOF
singhbalwinder commented 5 years ago

@polunma : Do you think omitting fincl1 output will still result in a crash?

For those who are not familiar with the scripts that we use to run the model, build the model using the create_newcase command @polunma mentioned and add the text between "cat <! user_nl_cam &camexp"

and

"/ EOF"

in the user_nl_cam file in the case directory. That is, add the following in the user_nl_cam:

rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12'
fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10

Please note that path to input data directory is hardwired here (/project/projectdirs/acme/inputdata/) so you would have to change that if you run on any other machine except the NERSC machines.

yfenganl commented 5 years ago

I ran into a similar issue when I was doing ne120 runs (FC5AVIC-H01A) on Anvil with only one radiation diagnostics call. I checked the memory usage in the log file as Az suggested, and found that the memory use was indeed accumulating as the run continued until it exceeded the maximum memory per node. We didn't got a chance to trace down the problem.

But restarting seemed to solve the problem. I was able to complete one-year runs by restarting before the running time got close to the crash point.

I was using a version checked out in April, 2018. haven't tested it with the new master.

-Yan


    *   *   *              Yan   Feng, Ph.D.
  *   *   *   *           Atmospheric and climate scientist
*       *        *         Argonne National Laboratory

From: singhbalwinder notifications@github.com Sent: Thursday, October 11, 2018 1:15 PM To: E3SM-Project/E3SM Cc: Subscribed Subject: Re: [E3SM-Project/E3SM] Radiation diagnostics out of memory crash (#2575)

@polunmahttps://github.com/polunma : Do you think omitting fincl1 output will still result in a crash?

For those who are not familiar with the scripts that we use to run the model, build the model using the create_newcase command @polunmahttps://github.com/polunma mentioned and add the text between "cat <! user_nl_cam &camexp"

and

"/ EOF"

in the user_nl_cam file in the case directory. That is add the following in the user_nl_cam:

rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc', 'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc' rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2', 'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4', 'N:CFC11:CFC11', 'N:CFC12:CFC12' fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10

Please note that path to input data directory is hardwired here (/project/projectdirs/acme/inputdata/) so you would have to change that if you run on any other machine except the NERSC machines.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/E3SM-Project/E3SM/issues/2575#issuecomment-429063730, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQer0p_kxaZ9bYy7xo-xG8kIUL2fKyv7ks5uj4rVgaJpZM4XW30g.

ndkeen commented 5 years ago

I ran a ne30 F case using only 3 nodes, 67 MPI's per node, 4 threads each on cori-knl (using a repo from Aug21st that has some additional profiling mods). I used the above user_nl_cam (thanks for clarification Balwinder). I asked to run for 2 days. Sure enough, the memory use is increasing steadily over time. (Note, I updated plot below after re-running for 2 complete days).

Does it make sense to try reducing the number of entries in user_nl_cam to see if there is a specific one that causes issue?

prss pernode 15663587 00000

By contrast, here is the same plot for a run made without those radiation entries in user_nl_cam (running for 2 days):

prss pernode 15479787 00000

ndkeen commented 5 years ago

Using some even more experimental tools I've been working on, I see that the memory increases more in CAM_run2 than in CAM_run1. Every other call to CAM_run2 is about 10MB, while every call to CAM_run1 increases the peak memory by about 1MB.

I can show a plot, but it's pretty messy.

This is the peak RSS (ie it will only increase) over time for each rank. The measurements are at certain places in the code. Looking at the raw data, I can say the notes above. But it's still good to see the plot. It's nice that rank0 just uses more memory overall, so the blue dots (rank0) stand out. This is different than the above plots -- the memory data is not coming from top, but from a call within the code.

rpeak_per_rank_via_timers dpi1800 j15663587

Image is largish as I created with higher DPI to allow for better zooming in. Might need to download file first for better zooming.

singhbalwinder commented 5 years ago

Thanks @ndkeen ! Those are really clear visualizations. As far as I remember, CAM_run1 calls radiation (tphysbc), which calls these radiation diagnostics. So this tells us something we already know now that the diagnostics are causing this memory leak.

Thanks @yfenganl for reporting on ne120 grid. It might be faster to reproduce this using ne120 as it may already be using a lot of memory.

ndkeen commented 5 years ago

FWIW, if I remove the fincl1 line in the user_nl_cam, I still see the same memory behavior.

I also ran without fincl1 line and with only the first rad_diag_1 line in user_nl_cam. The memory use is substantially less, but it's not clear if there is still a "growth" or not. I can look more closely if this is a worthwhile avenue.

Also, I ran with DEBUG=TRUE. The job ran out of time, but after many steps there were no errors.

polunma commented 5 years ago

Thank you all very much for taking a look! Is there any hope of identifying/fixing the bug soon? BTW @ndkeen , fincl1 line is essential because otherwise the results from radiation diagnostics are not written out. One of my month-to-month runs failed (error message below). However, the weirdest thing is that I simply resubmit the same job and it was done successfully... 0: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 0: newchild: child "a:radiation" can't be a parent of itself 525: forrtl: severe (154): array index out of bounds 525: Image PC Routine Line Source 525: e3sm.exe 0000000004B1521E Unknown Unknown Unknown 525: e3sm.exe 00000000043B7F40 Unknown Unknown Unknown 525: e3sm.exe 0000000001874E22 aero_model_mp_mod 3009 aero_model.F90 525: e3sm.exe 00000000018748F8 aero_model_mp_aer 1642 aero_model.F90 525: e3sm.exe 00000000005FC982 physpkg_mp_tphysb 2621 physpkg.F90 525: e3sm.exe 00000000005F5709 physpkg_mp_phys_r 1029 physpkg.F90 525: e3sm.exe 0000000004013103 Unknown Unknown Unknown 525: e3sm.exe 0000000003FCA9E0 Unknown Unknown Unknown 525: e3sm.exe 0000000003FCBFD4 Unknown Unknown Unknown 525: e3sm.exe 0000000003F9D8F4 Unknown Unknown Unknown 525: e3sm.exe 00000000005F524B physpkg_mp_phys_r 1018 physpkg.F90 525: e3sm.exe 00000000004EEEF7 cam_comp_mp_cam_r 250 cam_comp.F90 525: e3sm.exe 00000000004DF4E1 atm_comp_mct_mp_a 522 atm_comp_mct.F90 525: e3sm.exe 00000000004285F4 component_modmp 728 component_mod.F90 525: e3sm.exe 000000000040E976 cime_comp_modmp 3370 cime_comp_mod.F90 525: e3sm.exe 00000000004282F3 MAIN__ 103 cime_driver.F90 525: e3sm.exe 000000000040A80E Unknown Unknown Unknown 525: e3sm.exe 0000000004C382D9 Unknown Unknown Unknown 525: e3sm.exe 000000000040A6F9 Unknown Unknown Unknown srun: error: nid02921: task 525: Exited with exit code 154 srun: Terminating job step 15659352.0 495: slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Resource temporarily unavailable 0: slurmstepd: error: STEP 15659352.0 ON nid02906 CANCELLED AT 2018-10-11T05:01:09 505: forrtl: error (78): process killed (SIGTERM)

ndkeen commented 5 years ago

I spent more time trying to debug this. I don't have a fix, but I do have some more information. I tried several things. I did try valgrind but it has not yet been useful (valgrind details below).

Using my own attempts to measure memory by placing calls within the code, I can narrow down where the memory (RSS) is growing/shrinking. I write the current RSS as well as the Peak RSS. Originally, I was tracking the Peak, but this did not lead anywhere as the leak can be elsewhere -- increasing the base memory use, while some other code might allocate/deallocate and be the actual hiwater RSS.

So looking at RSS and tracking when it increases but does not increase, I see: Every call to CAM_run2 shows an increase in memory with no corresponding decrease. Within there, it looks like the increase happens in the code within the timer "microp_tend". However, there's NOT an increase every pass thru microp_tend. Certainly a pattern, but it's not obvious -- I can write a script and make a plot to show the pattern. I can do that if it helps.

Within the microp_tend code, I think the increase happens in subroutine micro_mg_cam_tend(). Again, not every call, but in a pattern. I have a little more detail inside of this routine, but it is quite large/complicated (well beyond the size of what good SW engineering would suggest, but ...) and I was hoping someone more familiar with it could weigh in.

Running the same case without the radiation diagnostics shows no increase in memory as described above (in fact, the memory is well-behaved across the day).

The reason I wanted to try without the fincl1 line is not because I was suggesting that this could be a solution, but rather to debug. If we remove that line and memory does not increase, it could help narrow down the issue. When I tried it (earlier on), it seemed like the memory behavior is the same. The same argument with trying fewer radiation diagnostics -- is it possible only one or a few of those diagnostics cause an issue? I haven't tried this yet.

I also tried the same test using use_hetfrz_classnuc = .false. As this is something we've been using with the coupled runs and I see that the code is doing something different when using this flag. Originally reported that this crashed, but I was mistaken -- something else I was doing had crashed. Running this again with no issues. Still need to verify that memory issues are similar, but no reason to suspect it will be any different.


Below are notes regarding using valgrind: I tried to use valgrind with this ne30 problem. I used developed a valgrind suppressions file to limit the output. I noticed something odd, but it turned out it was likely an issue that happens when using valgrind -- when I re-compiled with -heap-arrays (for Intel compiler) that issue went away and the run continued.

Then the run crashes/hangs with errors like this:

118:  ccm kohlerc - no real(r8) solution found (quartic)
118:  roots = (NaN,NaN) (NaN,NaN) (NaN,NaN) (NaN,NaN)
118:  p0-p3 = -8.481208826751322E-009 -9.331906172134355E-005  0.000000000000000E+000
118:   9.360098364365669E-005
118:  rh=  2.497662239924677E-006
118:  setting radius to dry radius=  4.491510796752449E-002

Without valgrind, the run is fine. So I'm not sure what to make of this. I just haven't had time to figure this out or process the valgrind output (but nothing obvious there). case: /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.intel.n004p0120t30x1.6s.rssp.heaparraysg.vg

I then tried with the GNU compiler and it completes 2 steps with output before timing out. However, the output does not contain the source code info (I did re-compile with -g hoping it would include it)

  8: ==53511== Invalid read of size 16
  8: ==53511==    at 0x170E9D3: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16D6342: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16D70F7: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16CF367: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x169768D: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)

/global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg

I tried these tests on anvil as well as cori-haswell with both intel and GNU. The versions of compilers and valgrind on cori are more recent.

ndkeen commented 3 years ago

Noting that https://github.com/E3SM-Project/E3SM/pull/3866 might address this memory issue. I will test as soon as @singhbalwinder says it's ready. @ndkeen

ndkeen commented 3 years ago

When I try @singhbalwinder branch in PR 3468 and use the same user_nl_cam above, but rename to user_nl_eam now, the memory appears almost constant after 3 days whereas before, even with a master as of Sept 24th, the memory was increasing by at least 160MB every day. So I think that PR will fix this issue.

singhbalwinder commented 3 years ago

Thanks @ndkeen for testing it so quickly. I will make note in my PR that it fixes this memory issue.