[DISCUSSION] Multiple nested regions on AWS instances

Overview

I am collaborating with a team of health researchers, and for our project they have requested daily near-surface PM2.5 data related to 'all source' vs 'no fires'- for reference, what we are doing is similar to Liu et al., 2017, Epidemiology: 10.1097/eDe.0000000000000556, who ran similar experiments for western North America.

I had a few questions about setting up production runs on AWS (GEOS-Chem Classic, v14, the main branch as of Dec, 2022) to generate the PM2.5 data.

Our eventual needs: ~0.5 degree spatial resolution over three regions (the continental US, tropical S America (Brazil), and a smaller area in Southeast Asia), daily PM2.5, time period: 2018,2019,2020,2021. We will need a 'all source' run as well as a 'no fires' run for each of these regions. (We are funds-limited, so need to keep costs down to < ~$10k.)

Specific questions

1) What is the most sensible/efficient way to set this up? I did some testing on AWS with various c5 instance types/sizes, and it looks like running a global 4x5 degree (c5.9xlarge) to save boundary conditions, then running nested regions (c5.9xlarge or c5.12xlarge), will be the fastest/cheapest way to go, but I'd like some input please.

2) can we run GEOS-Chem using data on AWS s3 from Jan 2018 through Jan 1, 2022? The health data extends to the end of 2021, but if I request the final day of the dry run as Jan 1, 2022, I get errors (looks like no Jan 1 HEMCO data?).

3) In the GEOS-Chem Classic v14, can I select the lat/lon bounds of the output boundary conditions (we don't need the Southern Ocean/Antarctica for example) to decrease boundary condition file size? I see this was an option in v12 (someone asked a question about running nested regions and this issue came up), but now the HISTORY file has no lat/lon selection option for the boundary condition collection:

#==============================================================================
# %%%%% THE BoundaryConditions COLLECTION %%%%%
#
# GEOS-Chem boundary conditions for use in nested grid simulations
#
# Available for all simulations
#==============================================================================
  BoundaryConditions.template:   '%y4%m2%d2_%h2%n2z.nc4',
  BoundaryConditions.frequency:  00000000 030000
  BoundaryConditions.duration:   00000001 000000
  BoundaryConditions.mode:       'instantaneous'
  BoundaryConditions.fields:     'SpeciesBC_?ADV?             ',
::

other errors

1) is there a reason that I keep getting errors when downloading data via the dry run (global 4x5 degree)? for example, I can get all of the data from 2018-2019, then starting in 2020, I get the error: fatal error: An error occurred (404) when calling the HeadObject operation: Key "GEOS_4x5/MERRA2/2020/01/MERRA2.20200101.A1.4x5.nc4" does not exist ...and the fatal errors continue for the 2020 data

What is the latest month/year of complete meteorology etc data available on AWS s3?

4) At least twice in the last week of testing, my AWS spot instance was abruptly terminated without warning due to 'security' issues with no further information available- does anyone know if this is the error we see if the spot instance is terminated by AWS, or if this is something else I should worry about?

Thanks for your help and support- I appreciate it.

Thanks for writing @LukeAParsons. Right now the S3 bucket on Amazon is synced to the Harvard data server, which has much less data on it than the WashU server (http://geoschemdata.wustl.edu). We are working on syncing this S3 bucket directly to the WashU server but due to technical reasons this hasn't yet been completed. We hope to eventually get that working.

When you do a dry-run, you do have the option of downloading from the WashU server instead of from the AWS server directly to your EBS volume. That might be the best option. There should also be more recent data at WashU than on s3://gcgrid (certainly through almost the end of 2022). There is a couple-month time lag in processing the MERRA2 data so December might not yet be ready.

Also note: @msulprizo recently created some cropped nested data fields for other regions of the globe (including South America, Russia, etc). These should be available on the WashU server but are not on the S3 bucket yet. You can also download the global native resoluiondata files and then use a command like cdo selbox to crop the native resolution data to your region. Using cropped met data will result in faster simulation speeds (as the input met data is smaller and takes less time to regrid, etc.)

In answer to your questions:

I did some testing on AWS with various c5 instance types/sizes, and it looks like running a global 4x5 degree (c5.9xlarge) to save boundary conditions, then running nested regions (c5.9xlarge or c5.12xlarge), will be the fastest/cheapest way to go, but I'd like some input please.

I think you might even be able to use a smaller instance for the global 4x5 run (maybe c5.4xlarge). It would take longer but should work. You would definitely need the larger instance sizes for the nested runs as those nodes have more memory.

can we run GEOS-Chem using data on AWS s3 from Jan 2018 through Jan 1, 2022?

You should be able to get this data from the WashU server via a dry-run.

In the GEOS-Chem Classic v14, can I select the lat/lon bounds of the output boundary conditions (we don't need the Southern Ocean/Antarctica for example) to decrease boundary condition file size? I see this was an option in v12 (someone asked a question about running nested regions and this issue came up), but now the HISTORY file has no lat/lon selection option for the boundary condition collection:

You can manually add the LON_RANGE and LAT_RANGE settings to the BoundaryConditions collection to crop to the size that you need.

Also I think that the errors you mentioned will be resolved if you download data from WashU instead of AWS.

Tagging @msulprizio @Jourdan-He @SaptSinha @laestrada

Also note: @msulprizo recently created some cropped nested data fields for other regions of the globe (including South America, Russia, etc). These should be available on the WashU server but are not on the S3 bucket yet.

Just to clarify, I manually uploaded the new nested fields to S3. They are available for 2018-present.

                           PRE CHEM_INPUTS/
                           PRE GCHP/
                           PRE GEOSCHEM_RESTARTS/
                           PRE GEOS_0.25x0.3125/
                           PRE GEOS_0.25x0.3125_AF/  <--
                           PRE GEOS_0.25x0.3125_AS/
                           PRE GEOS_0.25x0.3125_CH/
                           PRE GEOS_0.25x0.3125_EU/
                           PRE GEOS_0.25x0.3125_ME/  <--
                           PRE GEOS_0.25x0.3125_NA/
                           PRE GEOS_0.25x0.3125_OC/  <--
                           PRE GEOS_0.25x0.3125_RU/  <--
                           PRE GEOS_0.25x0.3125_SA/  <--
                           PRE GEOS_0.5x0.625/
                           PRE GEOS_0.5x0.625_AS/
                           PRE GEOS_0.5x0.625_EU/
                           PRE GEOS_0.5x0.625_NA/
                           PRE GEOS_2x2.5/
                           PRE GEOS_4x5/
                           PRE GEOS_MEAN/
                           PRE GEOS_NATIVE/
                           PRE GEOS_c360/
                           PRE HEMCO/
                           PRE gcap/

Thanks to both @msulprizio and @yantosca for the help and replies.

Continuing on this thread, I am also trying to start another tester run (saving global boundary conditions at 4x5) on our local node here at Duke (GCClassic, v14.0.2), and have been running into similar problems, even when trying to download data from wash u in the dry run (--wu):

1) for all of the http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-03/MERRA2/2021/ files (for example: FLASH_CTH_MERRA2_0.5x0.625_2021_12.nc4) the dry run download gives me this error: HTTP request sent, awaiting response... 404 Not Found Do the OFFLINE_LIGHTNING files not exist for 2021? Or can I ignore this missing file issue?

2) I tried running the 4x5 model starting in July 01 , 2017, and I got an error when the model tries to load the variables/fields from the restart file downloaded in the dry run: "cannot get field spc_acet" It seems the restart file is in the correct directory (and correct month, etc), but the code seems to be trying to pull the 'old version' of the variable names (pre v13) from the restart file- when I used ncdump -h to show me the variables in the restart file, I see: SpeciesRst_ACET is the closest variable name. I googled the error, and it looks like the code is trying to look for the older version of the variable name. Even more strangely, I thought maybe I should try to grab the Harvard restart files to test if that fixed it, and it did, but the variable names are the same in this restart file (such as SpeciesRst_ACET)...so I was able to do a test run for this month, but I wanted to report that the --wu origin restart file I initially downloaded produced an error, whereas the Harvard 2017 07 01 restart file does not. Perhaps this was just me doing something wrong in how I was running the model?

3) About only saving a part of the globe for the boundary conditions as part of the boundary conditions collection:

I went into the help page (http://wiki.seas.harvard.edu/geos-chem/index.php/Setting_up_GEOS-Chem_nested_grid_simulations#GEOS-Chem_12.4.0_and_later), and I tried adding in these lines based on what I saw there: BoundaryConditions.LON_RANGE: -180.0 180.0, BoundaryConditions.LAT_RANGE: -55.0 70.0,

but this didn't work, so I also tried this format thinking that it maybe would follow the nested region lat/lon definition format:

#==============================================================================
# %%%%% THE BoundaryConditions COLLECTION %%%%%
#
# GEOS-Chem boundary conditions for use in nested grid simulations
#
# Available for all simulations
#==============================================================================
  BoundaryConditions.template:   '%y4%m2%d2_%h2%n2z.nc4',
  BoundaryConditions.frequency:  00000000 030000
  BoundaryConditions.duration:   00000001 000000
  BoundaryConditions.mode:       'instantaneous'
  BoundaryConditions.LON_RANGE:  [-180.0, 180.0]
  BoundaryConditions.LAT_RANGE:  [-55.0, 70.0]
  BoundaryConditions.fields:     'SpeciesBC_?ADV?             ',
::

but this also didn't work. Any suggestions about how I should be trying to sub-select lat/lon regions for saving the boundary conditions?

Thanks again!

Do the OFFLINE_LIGHTNING files not exist for 2021? Or can I ignore this missing file issue?

In ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-03/MERRA2/ the data only go to 2020. I'm tagging @ltmurray who may have information about the availability of more recent years.

I tried running the 4x5 model starting in July 01 , 2017, and I got an error when the model tries to load the variables/fields from the restart file downloaded in the dry run: "cannot get field spc_acet"

The use of SPC_ACET here refers to the HEMCO container name as defined in HEMCO_Config.rc (see https://github.com/geoschem/geos-chem/blob/ee8d0eb04d7ad095fe07fb930f28036f256b6709/run/GCClassic/HEMCO_Config.rc.templates/HEMCO_Config.rc.fullchem#L3191-L3195). Note that SPC_ there gets replaced by SPC_[SpeciesName] in the source code. I suspect that the date within the restart file might not match the simulation date. You can confirm this by setting Verbose and Warnings to 3 (the max level) in HEMCO_Config.rc and checking your log files. The HEMCO time cycle flag (EFYO) tells GEOS-Chem to only use the "E"xact date and "F"orce the simulation to quit otherwise. You can get around mismatched timestamps (e.g. same month, different year) by modifying the timecycle flag in the linked lines above to something like CYS where it tells HEMCO to use the "C"losest date available and "S"kip (use background values) if a species is not found.

Any suggestions about how I should be trying to sub-select lat/lon regions for saving the boundary conditions?

We now recommend users save out global boundary conditions since they're at coarse resolution and don't take up much disk space. This also allows the same boundary condition files to be used for any nested grid region, which is especially useful if you plan to run for multiple nested regions. You can save global BCs by simply removing the lines for BoundaryConditions.LON_RANGE and BoundaryConditions.LAT_RANGE.

If you have further questions that are specific to GEOS-Chem and not necessarily specific to GEOS-Chem on AWS, we recommend you search past issues at https://github.com/geoschem/geos-chem/issues and if you can't find the answers you're welcome to open new issues there.

More recent lightning is being prepared and should be available by early March. In the interim, we recommend the use of the lightning climatology.

Hi @ltmurray - thanks for your reply- is there any update on the availability of lightning-associated emissions? In your comment you mentioned that 'early March' was the target date for updating the lightning data. Thank you for your help!

geoschem / geos-chem-cloud