[QUESTION] Complete steps to run a nested version? Is it possible to run a nested simulation with netCDF output?

geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).

http://geos-chem.org

Other

169 stars 165 forks source link

[QUESTION] Complete steps to run a nested version? Is it possible to run a nested simulation with netCDF output? #31

Closed FeiYao-Edinburgh closed 5 years ago

FeiYao-Edinburgh commented 5 years ago

I am trying to run a nested GEOS-Chem over China using GEOS-Chem 12.2.0 on AWS. I guess the main links I should follow are http://wiki.seas.harvard.edu/geos-chem/index.php/Setting_up_GEOS-Chem_nested_grid_simulations#How_to_run_the_0.25x0.3125_nested-grid_for_GEOS-FP and http://wiki.seas.harvard.edu/geos-chem/index.php/Creating_GEOS-Chem_run_directories#Tips_and_tricks_for_creating_run_directories. I have run a global 2x25 simulation to save BC files. I have also regridded the restart files using xESMF. I am sorry that I did not get what the next steps that I should follow because the links above point to to each other when referring run nested GEOS-Chem. It is somewhat confusing for me. Therefore, I tried the following:

Regenerate a run directory from UT
change simulation period in input.geos
specify collections in HISTORY.rc
and copy BC files into the generated directory.
Compile the model: make -j4 mpbuild NC_DIAG=y BPCH_DIAG=n TIMERS=1 NEST=CH (seems BPCH_DIAG have to equal y if NEST specified?)
Run the model: ./geos.mp (It is running smoothly but I see HEMCO opening 2x25 met data…)

I have also checked output files and found that they are in 2x25 resolution, so there must be some important steps that I missed, say GRID=0.25x0.3125 or any other configurations in input.geos?

msulprizio commented 5 years ago

When you regenerate a run directory from the UT, what option are you selecting in your CopyRunDirs.input script? I believe you want to select:

## ======= Nested model runs ==================================================
# merra2   05x0625     as     tropchem         2016070100   2016080100     -
# merra2   05x0625     na     tropchem         2016070100   2016080100     -
 geosfp   025x03125   ch     tropchem         2016070100   2016080100     -
# geosfp   025x03125   na     tropchem         2016070100   2016080100     -
## ======= HEMCO standalone ===================================================

Then change to your newly created run directory (geosfp_025x03125_tropchem_ch), do make realclean and recompile. When running nested grid simulations BPCH_DIAG must always be set to y for reading the BC files (still in BPCH format). This should be done automatically if the Makefile recognizes you are running a nested grid simulation. From the run directory Makefile:

ifndef NEST
 NEST        :=n
else
 # NOTE: If Nest is passed, then also set BPCH_DIAG=y.  We have to activate               
 # the bpch I/O code to read the nested boundary conditions.  We can remove               
 # this ELSE block once the bpch diagnostics have been from GC. (bmy, 1/22/18)            
 BPCH_DIAG   :=y
endif

Once compilation is complete, check your lastbuild.mp file to make sure you have the following settings: GRID: 025x03125, NEST: ch.

FeiYao-Edinburgh commented 5 years ago

Sorry for the incomplete information. Frankly, I selected:

# merra2   4x5         -      complexSOA_SVPOA 2016070100   2016080100     -
  geosfp   2x25        -      complexSOA_SVPOA 2016070100   2016090100     -
# merra2   2x25        -      complexSOA_SVPOA 2016070100   2016080100     -

To run a nested version of this kind of simulation, I have run a global one to prepare BC files. When run global simulation, I compile the model with: make -j4 mpbuild NC_DIAG=y BPCH_DIAG=n TIMERS=1. I guess the values set to NC_DIAG and BPCH_DIAG must be different? Therefore, when running nested version, I should compile the model with: make -j4 mpbuild NC_DIAG=n BPCH_DIAG=y TIMERS=1 NEST=CH GRID=025x03125? As such, the output files of nested running could not be in netCDF format?

msulprizio commented 5 years ago

The selection you used should be fine for your global simulation. GEOS-Chem allows for both NC_DIAG=y and BPCH_DIAG=y -- this is somewhat of a silent feature. It allows you to save diagnostic output to netCDF, while also saving out other diagnostics to BPCH format because they have not been converted to netCDF yet (including BC files). To save disk space, you may want to suppress saving out BPCH diagnostics that you do not need by setting the options to 0 in the diagnostic menu of input.geos. Turning netCDF diagnostics on/off is set in HISTORY.rc.

To summarize:

For your global simulation, your compilation settings are fine. (We turn on CPP switch -DBPCH_TPBC in Makefile_header.mk for all simulations to allow for users to save out TPCORE BC files to BPCH format.)
For your nested-grid simulation, please create a 0.25x0.3125 run directory via the Unit Tester instead of modifying your 2x2.5 run directory. There are settings in input.geos (really the comments in the first line) that tell the Makefile what GRID, MET, and simulation you are using.
In your new nested grid run directory, do make realclean and make -j4 NC_DIAG=y BPCH_DIAG=y TIMERS=1 NEST=CH GRID=025x03125. You may also simply do make -j4 NC_DIAG=y TIMERS=1 -- the rest of the compile options should be properly selected for you if you created your run directory from the Unit Tester.

FeiYao-Edinburgh commented 5 years ago

please create a 0.25x0.3125 run directory via the Unit Tester instead of modifying your 2x2.5 run directory

Re: I would like to do that way according to your wiki. However, I do not know how to create a 0.25x0.3125 run directory of complexSOA_SVPOA because there is no that option in CopyRunDirs.input (Also did not find in UnitTest.input). Can I write a one by myself like the following?

# merra2   05x0625     na     tropchem         2016070100   2016080100     -
  geosfp   025x03125   ch     complexSOA_SVPOA 2016070100   2016080100     -
# geosfp   025x03125   na     tropchem         2016070100   2016080100     -

msulprizio commented 5 years ago

You are right. We do not have a nested grid run directory set up for the complexSOA_SVPOA simulation -- sorry for not catching that earlier. In that case, you can proceed with creating your nested grid run directory from your 2x2.5 run directory. Just make sure you also change the string in the first line of input.geos from geosfp_2x25_complexSOA_SVPOA to geosfp_025x03125_complexSOA_SVPOA_ch. Then try recompiling. The Makefile should now automatically select the correct compile options. This can be confirmed in the lastbuild.mp produced when your compilation is complete.

FeiYao-Edinburgh commented 5 years ago

Then try recompiling

Just something wanted confirmation.

In global complexSOA_SVPOA simulation, I think the most important setting is:

Save TPCORE BC's        : T
Input BCs at 2x2.5?     : T
Over China?             : T
TPCORE CH BC directory  : BC_2x25_CH/

I then compiled the model with make -j4 mpbuild NC_DIAG=y BPCH_DIAG=n TIMERS=1. According to your comment, BPCH_DIAG=n will not affect BC writing because of Makefile_header.mk? The command equals to make -j4 mpbuild NC_DIAG=y TIMERS=1?

Next, I would like to step into nested complexSOA_SVPOA simulation over China. This is where I am really confusing. According to your comment, it is something like the following?

Copy geosfp_2x25_complexSOA_SVPOA and rename the copied one to geosfp_025x03125_complexSOA_SVPOA_CH
In geosfp_025x03125_complexSOA_SVPOA_CH, change the first line in input.geos to geosfp_025x03125_complexSOA_SVPOA_CH

Turn off BC in input.geos?

Save TPCORE BC's        : F
Input BCs at 2x2.5?     : F
Over China?             : F
TPCORE CH BC directory  : BC_2x25_CH/

No need to change Makefile? Re-compile the model with: make realclean; make -j4 mpbuild NC_DIAG=y TIMERS=1 NEST=CH GRID=025x03125 MET=geosfp?

Finally I run ./geos.mp with the error that I posted on https://github.com/geoschem/geos-chem-cloud/issues/20. I guess I need to do some further configurations on input.geos.

Anyway, would be grateful if you could help me confirm the steps above are correct?

msulprizio commented 5 years ago

In global complexSOA_SVPOA simulation, I think the most important setting is:

Save TPCORE BC's : T Input BCs at 2x2.5? : T Over China? : T TPCORE CH BC directory : BC_2x25_CH/

I then compiled the model with make -j4 mpbuild NC_DIAG=y BPCH_DIAG=n TIMERS=1. According to your comment, BPCH_DIAG=n will not affect BC writing because of Makefile_header.mk? The command equals to make -j4 mpbuild NC_DIAG=y TIMERS=1?

Correct. You can confirm this by making sure BC files are saved out to theBC_2x25_CH directory after you start running your global simulation. The log file from your global simulation should also indicate that BC files have been written out at the specified interval. (NOTE: You may need to create the BC_2x25_CH subdirectory in your run directory before submitting your global simulation if it doesn't already exist.)

Next, I would like to step into nested complexSOA_SVPOA simulation over China. This is where I am really confusing. According to your comment, it is something like the following?

Copy geosfp_2x25_complexSOA_SVPOA and rename the copied one to geosfp_025x03125_complexSOA_SVPOA_CH

In geosfp_025x03125_complexSOA_SVPOA_CH, change the first line in input.geos to geosfp_025x03125_complexSOA_SVPOA_CH

Turn off BC in input.geos?

Save TPCORE BC's : F Input BCs at 2x2.5? : F Over China? : F TPCORE CH BC directory : BC_2x25_CH/

No need to change Makefile? Re-compile the model with: make realclean; make -j4 mpbuild NC_DIAG=y TIMERS=1 NEST=CH GRID=025x03125 MET=geosfp?

Steps 1 and 2 look good. For step 3, the following changes need to be made to the input.geos file:

Global offsets I0, J0   : 800 420

Tran/conv timestep [sec]: 300
Chem/emis timestep [sec]: 600

%%% NESTED GRID MENU %%%:
Save TPCORE BC's        : F
Input BCs at 2x2.5?     : T
Over China?             : T
TPCORE CH BC directory  : BC_2x25_CH/

NOTES:

You need to keep Input BCs at 2x25? :T and Over China? : T to tell GEOS-Chem to expect 2x2.5 (not 4x5) BCs and to tell GEOS-Chem where to look for the BC files.
Make sure you create a symbolic link in your nested-grid run directory to the BC_2x25_CH subdirectory in your global run directory so that the BC files can be found by GEOS-Chem. Another option would be to add the entire path in input.geos (e.g. TPCORE CH BC directory : /your/path/geosfp_2x25_complexSOA_SVPOA/BC_2x25_CH/).

Finally I run ./geos.mp with the error that I posted on geoschem/geos-chem-cloud#20. I guess I need to do some further configurations on input.geos.

I believe the above steps should resolve your error. If you are still having issues, could you please attach the input.geos, lastbuild.mp, and log files from your global and nested-grid simulations? See this wiki post for instructions.

FeiYao-Edinburgh commented 5 years ago

Finally I run ./geos.mp with the error that I posted on geoschem/geos-chem-cloud#20.

I have solved this issue thanks to Jiawei's hint.

Thanks for your guide regarding step 3, but I am sorry that I still cannot get the model run. I have solved some of them according to HEMCO.log including: Specify METDIR METDIR: /home/ubuntu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP in HEMCO_Config.rc otherwise the model will not use 025x03125 met data. Besides, the 025x03125 met data in China under ExtData have different names compared to global ones, thus $NEST should be added to the following:

103 LIGHTNOX_OTDLIS $ROOT/LIGHTNOX/v2017-09/OTD-LIS-Local-Redist.CTH.v5.$met.$RES.$NEST.v20170928.nc OTD $YYYY/1-12/1/0 C xy unitless NO - 1 1
# --- CN fields ---
* FRLAKE    $METDIR/$CNYR/01/$MET.$CNYR0101.CN.$RES.$NEST.$NC        FRLAKE   */1/1/0               C xy  1  * -  1 1

# --- A1 fields ---
* ALBEDO    $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.A1.$RES.$NEST.$NC     ALBEDO   1980-2018/1-12/1-31/0-23/+30minute C xy  1  * -  1 1

# --- A3cld fields ---
* CLOUD     $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.A3cld.$RES.$NEST.$NC  CLOUD    1980-2018/1-12/1-31/0-23/+90minute C xyz 1  * -  1 1

# --- A3dyn fields ---
* DTRAIN    $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.A3dyn.$RES.$NEST.$NC  DTRAIN   1980-2018/1-12/1-31/0-23/+90minute C xyz 1  * -  1 1

# --- A3mstC fields ---
* DQRCU     $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.A3mstC.$RES.$NEST.$NC DQRCU    1980-2018/1-12/1-31/0-23/+90minute C xyz 1  * -  1 1

# --- A3mstE fields ---
* CMFMC     $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.A3mstE.$RES.$NEST.$NC CMFMC    1980-2018/1-12/1-31/0-23/+90minute C xyz 1  * -  1 1

# --- I3 fields ---
* PS1       $METDIR/$YYYY/$MM/$MET.$YYYY$MM$DD.I3.$RES.$NEST.$NC     PS       1980-2018/1-12/1-31/0-23           C xy  1  * -  1 1

I then compile the model with:

make realclean
make -j4 mpbuild NC_DIAG=y TIMERS=1 NEST=CH GRID=025x03125 MET=geosfp SIM=complexSOA_SVPOA CHEM=SOA_SVPOA

I found that I must add the last two options to render complexSOA_SVPOA simulation.

As I run the model, I got the following error:

 Reading part 2 of HEMCO configuration file: HEMCO_Config.rc
   --> Isoprene to SOA-Precursor   1.4999999999999999E-002
   --> Isoprene direct to SOA (Simple)   1.4999999999999999E-002
   --> Monoterpene to SOA-Precursor   4.4091712946118980E-002
   --> Monoterpene direct to SOA (Simple)   4.4091712946118980E-002
   --> Othrterpene to SOA-Precursor   5.0000000000000003E-002
   --> Othrterpene direct to SOA (Simple)   5.0000000000000003E-002
 HEMCO ERROR: This is not a HEMCO species: OCPI
 ERROR LOCATION: DIAGN_BIOMASS (hcoi_gc_diagn_mod.F90)
 HEMCO ERROR: This is not a HEMCO species: OCPO
 ERROR LOCATION: DIAGN_BIOMASS (hcoi_gc_diagn_mod.F90)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%               HEMCO: Harvard-NASA Emissions Component               %%%%%
%%%%%               You are using HEMCO version v2.1.011                  %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/BIOFUEL/v2014-07/biofuel.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/BROMINE/v2015-02/Bromocarb_Liang2010.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/ACET/v2014-07/ACET_seawater.generic.1x1.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/DepReservoirDefault.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/soilNOx.landtype.generic.025x025.1L.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/soilNOx.climate.generic.05x05.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/DUST_DEAD/v2014-07/dst_tibds.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/DUST_DEAD/v2014-07/GOCART_src_fn.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MEGAN/v2018-05/MEGAN2.1_EF.geos.025x03125.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MEGAN/v2017-07/CLM4_PFT.geos.025x03125.nc
HEMCO: Opening /home/ubuntu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP/2011/01/GEOSFP.20110101.CN.025x03125.CH.nc
HEMCO: Opening ./GEOSChem.Restart.20160701_0000z.nc4
Killed

It printed Killed. I was so amazed! I have checked HEMCO.log but found no clue. For your convenience, I attach my input.geos, HEMCO.log, HEMCO_Config.rc, lastbuild.mp here.

I feel it is somewhat labour-consuming to adjust geosfp_2x25_complexSOA_SVPOA to geosfp_025x03125_complexSOA_SVPOA_ch. If adjusting geosfp_025x03125_tropchem_ch to geosfp_025x03125_complexSOA_SVPOA_ch, less work will be needed?

Configurations.zip

msulprizio commented 5 years ago

It appears your simulation is crashing while it is reading the restart file. Are you using a 2x2.5 restart file or a restart file regridded to the 0.25x0.3125 CH domain? Can you confirm that the file is uncorrupted (i.e. can you open it in another program and view the contents)? Can you also confirm that you are using enough memory for the simulation? You can try increasing your requested memory to see if you get past that point.

It may indeed be easier to create your nested grid run directory for the complexSOA_SVPOA simulation from a nested grid run directory for the tropchem simulation. In that case, you should just need to update the input.geos file to include the complexSOA_SVPOA species and turn on the proper switches in the Aerosol Menu. You would also need to update your restart file -- the 0.25x0.3125 CH restart file for the tropchem simulation won't include the complex SOA species so those species will be initialized to some background concentration (e.g. 1e-20).

FeiYao-Edinburgh commented 5 years ago

Are you using a 2x2.5 restart file or a restart file regridded to the 0.25x0.3125 CH domain? Can you confirm that the file is uncorrupted (i.e. can you open it in another program and view the contents)?

I followed this tutorial (https://github.com/geoschem/GEOSChem-python-tutorial/blob/master/Chapter03_regridding.ipynb) to regrid the 2x25 restart file to 0.25x0.3125 CH domain. I have also kept the metadata of the restart file by adding some additional scripts (e.g., dr_temp.attrs = dr.attrs, https://github.com/geoschem/geos-chem-cloud/issues/21). The regidded restart file is correct and can be smoothly open in other softwares. Nevertheless, I saw one difference between the regridded one and the initial 2x25 one, namely the _FillValue. In initial 2x25 one, it is like: SpeciesRst_RCOOH:_FillValue = -1.e+31f ;, whereas it is like SpeciesRst_RCOOH:_FillValue = NaN ;. Not sure if it is this causing the problem. If it is, how to solve that? @JiaweiZhuang might give some hints on coding?

Can you also confirm that you are using enough memory for the simulation?
ulimit -s unlimited
export OMP_STACKSIZE=500m
Improve 500m to a higher value in .bashrc?

It may indeed be easier to create your nested grid run directory for the complexSOA_SVPOA simulation from a nested grid run directory for the tropchem simulation.

Seems that similar work are needed. I believe I will study the configuration files and hopefully create run directories by writing my own codes.

msulprizio commented 5 years ago

You may also consider trying the method for regridding and cropping restart files that is documented on the GEOS-Chem wiki. Please see this comment: https://github.com/geoschem/geos-chem-cloud/issues/21#issuecomment-469675459.

Your stacksize settings look fine. You may want to try compiling with debug flags on to see if you can get additional information about why your simulation is crashing. Please also see our debugging tips on the GEOS-Chem wiki.

JiaweiZhuang commented 5 years ago

Nevertheless, I saw one difference between the regridded one and the initial 2x25 one, namely the _FillValue. In initial 2x25 one, it is like: SpeciesRst_RCOOH:_FillValue = -1.e+31f ;, whereas it is like SpeciesRst_RCOOH:_FillValue = NaN ;. Not sure if it is this causing the problem.

This shouldn't affect anything. If it does it is easy fix by something like ds['SpeciesRst_RCOOH'].attrs['_FillValue'] = -1.e+31

Update: The NaN behavior seems to be designed on purpose pydata/xarray#1163

FeiYao-Edinburgh commented 5 years ago

It may indeed be easier to create your nested grid run directory for the complexSOA_SVPOA simulation from a nested grid run directory for the tropchem simulation. In that case, you should just need to update the input.geos file to include the complexSOA_SVPOA species and turn on the proper switches in the Aerosol Menu.

Re: the diff command can make this practice much easier and hence I prefer this now.

You may also consider trying the method for regridding and cropping restart files that is documented on the GEOS-Chem wiki. Please see this comment: geoschem/geos-chem-cloud#21 (comment).

Re: I have tried cdo. Starting from the same file (i.e. initial_GEOSChem_rst.2x25_complexSOA_SVPOA.nc), cdo returns the regridded file at ~1.7GiB, whereas xesmf returns the regridded file at ~3.4GiB. But both seems fine since they can be correctly opened in HDFView software as well as ncdump command. Unfortunately, I insist on encountering the error "Killed" when HEMCO tried to open the restart file. Here I provide you my input.geos, HISTORY.rc, HEMCO_Config.rc, HEMCO.log, lastbuild.mp, and regridded restart file from cdo. I would be grateful if you could spend some of time helping me identifying where I am wrong. BTW, I run the model on AWS (ami: ami-06f4d4afd350f6e4c, c5.4xlarge, and 1000GiB).

Configuration files and regridded restart files

https://uoe-my.sharepoint.com/:f:/g/personal/s1855106_ed_ac_uk/Em5WIWFxt61NuTsKYupFJt8BdhVSvzIiTghfm_gxQM2U8Q?e=u79Zkg

cdo command that I run

cdo remapbic,geos.025x03125.grid initial_GEOSChem_rst.2x25_complexSOA_SVPOA.nc GEOSChem.Restart.20160701_0000z_bicub.nc
cdo sellonlatbox,70,140,15,55 GEOSChem.Restart.20160701_0000z_bicub.nc GEOSChem.Restart.20160701_0000z_bicub_ch.nc

Killed error

 Reading part 2 of HEMCO configuration file: HEMCO_Config.rc
   --> Isoprene to SOA-Precursor   1.4999999999999999E-002
   --> Isoprene direct to SOA (Simple)   1.4999999999999999E-002
   --> Monoterpene to SOA-Precursor   4.4091712946118980E-002
   --> Monoterpene direct to SOA (Simple)   4.4091712946118980E-002
   --> Othrterpene to SOA-Precursor   5.0000000000000003E-002
   --> Othrterpene direct to SOA (Simple)   5.0000000000000003E-002
 HEMCO ERROR: This is not a HEMCO species: OCPI
 ERROR LOCATION: DIAGN_BIOMASS (hcoi_gc_diagn_mod.F90)
 HEMCO ERROR: This is not a HEMCO species: OCPO
 ERROR LOCATION: DIAGN_BIOMASS (hcoi_gc_diagn_mod.F90)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%               HEMCO: Harvard-NASA Emissions Component               %%%%%
%%%%%               You are using HEMCO version v2.1.011                  %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/BIOFUEL/v2014-07/biofuel.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/BROMINE/v2015-02/Bromocarb_Liang2010.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/ACET/v2014-07/ACET_seawater.generic.1x1.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/DepReservoirDefault.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/soilNOx.landtype.generic.025x025.1L.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/SOILNOX/v2014-07/soilNOx.climate.generic.05x05.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/DUST_DEAD/v2014-07/dst_tibds.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/DUST_DEAD/v2014-07/GOCART_src_fn.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MEGAN/v2018-05/MEGAN2.1_EF.geos.025x03125.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MEGAN/v2017-07/CLM4_PFT.geos.025x03125.nc
HEMCO: Opening /home/ubuntu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP/2011/01/GEOSFP.20110101.CN.025x03125.CH.nc
HEMCO: Opening ./GEOSChem.Restart.20160701_0000z.nc4
Killed

I have tried adding some additional debugging flags such as DEBUG=y, BOUNDS=y, FPE=y, but unfortunately no further information has been added to HEMCO.log.

May be an alternative way of helping me identify errors?

I understand that it is somewhat difficult to repeat what I have done even with the configuration files listed above. Therefore, I have created an AMI from my EC2 instance and made it public. Its AMI ID is ami-00bd0b65443119dae and is in N. Virginia. I believe you can help me identify my problem more conveniently and efficiently if launching an EC2 instance from this AMI with c5.4xlarge and 1000GiB. The run directory is in ~/GC/geosfp_025x03125_complexSOA_SVPOA_ch/.

@msulprizio @JiaweiZhuang @yantosca Thanks in advance!

JiaweiZhuang commented 5 years ago

Therefore, I have created an AMI from my EC2 instance and made it public.

Thanks! This seems a very convenient way for debugging. One of us will take a look when time allows (things are getting absolutely crazy right before IGC9...)

yantosca commented 5 years ago

Hi Fei,

Thanks reporting this issue. I logged into the cloud with your AMI and was able to reproduce your issue.

It seems to be that the GEOS-Chem complex SOA-SVPOA job is sucking up all of the memory on the node, and is being killed by the kernel. See this forum:

https://stackoverflow.com/questions/726690/what-killed-my-process-and-why

Here, they suggested using the command:

dmesg -T| grep -E -i -B100 'killed process'

to get information about e.g. the last 100 lines (-B100) before the killed job happened. I did that and the last few lines printed out as such:

[Wed Mar 27 20:28:39 2019] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Mar 27 20:28:39 2019] [  522]     0   522    23697      682   184320        0             0 systemd-journal
[Wed Mar 27 20:28:39 2019] [  540]     0   540    10880      887   118784        0         -1000 systemd-udevd
... etc not shown...
[Wed Mar 27 20:28:39 2019] [ 1891]  1000  1891    64772      575   270336        0             0 (sd-pam)
[Wed Mar 27 20:28:39 2019] [ 2023]  1000  2023    27050      757   253952        0             0 sshd
[Wed Mar 27 20:28:39 2019] [ 2027]  1000  2027     5822      903    86016        0             0 bash
[Wed Mar 27 20:28:39 2019] [ 4095]  1000  4095  9823931  7834400 63209472        0             0 geos.mp
[Wed Mar 27 20:28:39 2019] Out of memory: Kill process 4095 (geos.mp) score 984 or sacrifice child
[Wed Mar 27 20:28:39 2019] Killed process 4095 (geos.mp) total-vm:39295724kB, anon-rss:31335836kB, file-rss:1764kB, shmem-rss:0kB

So as you can see geos.mp was taking up all of the physical + virtual memory on the system.

The first thing I thought of was that the run had all of the BPCH diagnostics turned on in the DIAGNOSTIC MENU of input.geos, as well as the netCDF diagnostics for AerosolMass and Aerosols collections. So I simply turned off all the bpch diagnostics by setting everything to zero in the DIAGNOSTICS MENU of input.geos.

That got me past the point where the restart file was being read in, and the model ran down to the UCX initialization and stratospheric chemistry module initialization. But then it halted with this error:

=============================================================
GEOS-CHEM ERROR: Allocation error in array: MInit
STOP at alloc_err.f
+============================================================
 Timer forced to stop due to error: GEOS-Chem
 Timer forced to stop due to error: Initialization

So again we are running up against a memory limit.

The Minit array is in the stratospheric chemistry module (GeosCore/strat_chem_mod.F90). It was used for a global strat prod/loss diagnostic computed by the routine CALC_STE. But in the most recent versions of the code, the CALC_STE routine was commented out, because it was giving misleading results. Therefore, the whole Minit array might no longer be necessary, and probably can be removed from GEOS-Chem. I will have to check with the rest of the GCST to make sure. But you could try commenting out all places in strat_chem_mod.F90 where MInit is allocated, deallocated, and used to see if you can get your run to go further.

The alternative is to pick another instance that has more memory. But I know that also incurs more costs.

Hope this helps!

Bob Y.

FeiYao-Edinburgh commented 5 years ago

So I simply turned off all the bpch diagnostics by setting everything to zero in the DIAGNOSTICS MENU of input.geos.

Re: what do you mean by everything? Like the following? I am sorry I have not find detailed information about these numbers either from http://wiki.seas.harvard.edu/geos-chem/index.php/GEOS-Chem_Input_Files#Diagnostic_Menu or from http://wiki.seas.harvard.edu/geos-chem/index.php/List_of_diagnostics_archived_to_bpch_format (I guess the numbers after the colon is not NDxx #?) Do you have any further link that I can refer to to understand what these numbers and words mean?

%%% DIAGNOSTIC MENU %%% :
Binary punch file name  : trac_avg.geosfp_025x03125_complexSOA_SVPOA_ch.YYYYMMDDhhmm
Diagnostic Entries ---> :  L   Tracers to print out for each diagnostic
ND01: Rn/Pb/Be source   :  0   all
ND02: Rn/Pb/Be decay    :  0   all
ND03: Hg emissions, P/L :  0   all
ND04: CO2 Sources       :  0   all
ND05: Sulfate prod/loss :  0   all
ND06: Dust aer source   :  0   all
ND07: Carbon aer source :  0   all
ND08: Seasalt aer source:  0   all
ND09: -                 :  0   all
ND10: -                 :  0   all
ND11: Acetone sources   :  0   all
ND12: BL fraction       :  0   all
ND13: Sulfur sources    :  0   all
ND14: Cld conv mass flx :  0   all
ND15: BL mix mass flx   :  0   all
ND16: LS/Conv prec frac :  0   all
ND17: Rainout fraction  :  0   all
ND18: Washout fraction  :  0   all
ND19: CH4 loss          :  0   all
ND21: Optical depths    :  0   all
ND22: J-Values          :  0   all
      => JV time range  :      11 13
ND24: E/W transpt flx   :  0   all
ND25: N/S transpt flx   :  0   all
ND26: U/D transpt flx   :  0   all
ND27: Strat NOx,Ox,HNO3 :  0   1 2 7
ND28: Biomass emissions :  0   all
ND29: CO sources        :  0   all
ND30: Land Map          :  0   all
ND31: Pressure edges    :  0   all
ND32: NOx sources       :  0   all
ND33: Column tracer     :  0   all
ND34: Biofuel emissions :  0   all
ND35: Tracers at 500 mb :  0   all
ND36: Anthro emissions  :  0   all
ND37: Updraft scav frac :  0   all
ND38: Cld Conv scav loss:  0   all
ND39: Wetdep scav loss  :  0   all
ND41: Afternoon PBL ht  :  0   all
ND42: SOA concentrations:  0   all
ND43: Chem prod OH, HO2 :  0   all
  ==> OH/HO2 time range :       0 24
ND44: Drydep flx/vel    :  0   all
ND45: Tracer Conc's     :  0   all
  ==> ND45 Time range   :       0 24
ND46: Biogenic emissions:  0   all
ND47: 24-h avg trc conc :  0   all
ND52: GAMMA values      :  0   all
ND53: POPs Emissions    :  0   all
ND54: Time in t'sphere  :  0   all
ND55: Tropopause height :  0   all
ND56: Lightning flashes :  0   all
ND57: Potential T       :  0   all
ND58: CH4 Emissions     :  0   all
ND59: TOMAS aerosol emis:  0   all
ND60: Wetland Frac      :  0   all
ND61: TOMAS 3D rate     :  0   all
ND62: Inst column maps  :  0   all
ND64: Radiative flux    :  0   all
ND66: DAO 3-D fields    :  0   all
ND67: DAO 2-D fields    :  0   all
ND68: Airmass/Boxheight :  0   all
ND69: Surface area      :  0   all
ND70: Debug output      :  0   all
ND71: Hourly max ppbv   :  0   2
ND72: Radiative output  :  0   all
ND73: ISORROPIA         :  0   all

The alternative is to pick another instance that has more memory. But I know that also incurs more costs.

Re: I have updated from c5.4xlarge to c5.18xlarge, which provides more CPUs and memories that help me overcome the Killed problem. I will follow your suggestion to modify GeosCore/strat_chem_mod.F90 to see if the model can be run in a slightly small instance.

Thanks for your work!

yantosca commented 5 years ago

About the DIAGNOSTIC MENU. This would send output to diagnostics in our outdated binary format, which is currently being phased out. To run on the cloud, it is better to use the diagnostics in netCDF format when possible. That is more

When setting these binary diagnostics:

ND45: Tracer Conc's     :  47   all

the number "47" means "save the ND45 diagnostic with 47 levels of output". Specifying zero levels turns off the diagnostic completely.

As I said, right now we still maintain the binary diagnostic output, mostly for backwards compatibility. We are going to start removing binary diagnostics from the code, probably after the IGC9 meeting is finished.

FeiYao-Edinburgh commented 5 years ago

the number "47" means "save the ND45 diagnostic with 47 levels of output". Specifying zero levels turns off the diagnostic completely.

Re: Consistent with my imagination. Thanks for confirmation. In this sense, we only need to change the value to 0 in the first column after the colon. What about the second column after the colon mean (e.g. all)? I understand you are very busy right before the igc9, so just reply me when time allows. I am relatively a new GEOS-Chem user and hence have many questions...

We are going to start removing binary diagnostics from the code, probably after the IGC9 meeting is finished.

Re: Thanks for your work. Personally I prefer netCDF diagnostics. I look forward to seeing your updates.

yantosca commented 5 years ago

The "all" is shorthand for "all available tracers for this diagnostic". Otherwise you could list specific species such as:

ND45: Tracer Conc's     :  47   1 2 3 4 5

but again, as the bpch is going away, I wouldn't worry about it too much.

FeiYao-Edinburgh commented 5 years ago

We turn on CPP switch -DBPCH_TPBC in Makefile_header.mk for all simulations to allow for users to save out TPCORE BC files to BPCH format.

Where could I find the Makefile_header.mk file?

To save disk space, you may want to suppress saving out BPCH diagnostics that you do not need by setting the options to 0 in the diagnostic menu of input.geos.

Could this be easily done with just setting BPCH_DIAG=n?

Can you also confirm that you are using enough memory for the simulation? You can try increasing your requested memory to see if you get past that point.

I just reviewed the whole conversation and found that you may had pointed out the problem that I did not catch. Thanks!