[FEATURE REQUEST] Need better scripts to download data from S3

JiaweiZhuang commented 5 years ago

Problem

So far I have been using custom scripts to download input data from S3. They are just a bunch of aws s3 cp commands with ad-hoc --include / --exclude filters. This works fine initially, but can cause maintenance troubles in the long term. Because every version release will add or remove some datasets, I always need to tweak the scripts a bit. This contrasts with the principle of "minimizing human intervention" in software deployment.

For example, 12.5.0 introduces offline grid-independent emissions with a total size of ~2TB. Downloading the entire dataset takes too much time & space, so I need to skip downloading them by default (88f881fd61adde4541ec5fd0115331d7bfc3ea20)

Most complications are from the HEMCO directory, which contains many emission datasets and gets updated frequently. Metfields and other files are generally quite static.

Desired functionalities

The new script might look like the hemco_data_download tool made by @yantosca . Instead of using Perl, the new script should probably be written in Python with boto3. It might use a similar config file like hemcoDataDownload.rc (can be in YAML/JSON), or parse the HEMCO_Config.rc file in the model run directory, to determine which dataset to download.

An important feature is to select a time window, as some datasets span over a long time and the total sizes are quite large. For example, OFFLINE_BIOVOC spans over 4 years, while a typical model simulation only needs a few months:

$ aws s3 ls --request-payer=requester s3://gcgrid/HEMCO/OFFLINE_BIOVOC/v2019-01/0.25x0.3125/
                           PRE 2014/
                           PRE 2015/
                           PRE 2016/
                           PRE 2017/

Where to start

The most straightforward way is probably rewriting hemco_data_download in Python, and replacing wget with s3.download_file() See boto docs. Can also have an option to use ftplib to download data from Harvard FTP, so the same script works with multiple data sources.

@yantosca and Judit (@xaxis3) should be most capable of doing this. This is not urgent right now but will save a lot of time in the long run...

JiaweiZhuang commented 5 years ago

Intake by Anaconda also seems marginally relevant. It helps building data catalogs, mostly for analysis in Python. Probably an overkill just for downloading purpose? Some examples:

HOW TO BUILD AN INTAKE CATALOG, by Met Office
intake-esm by NCAR

JiaweiZhuang commented 4 years ago

The major issue with boto3 is that, it cannot recursively download a directory (boto/boto3#358). You can indeed manually loop over a bucket, but even just getting a list of objects inside a bucket can be somewhat annoying. To robustly get the full list you might need Paginators:

Some AWS operations return results that are incomplete and require subsequent requests in order to attain the entire result set. The process of sending subsequent requests to continue where a previous request left off is called pagination. For example, the list_objects operation of Amazon S3 returns up to 1000 objects at a time, and you must send subsequent requests with the appropriate Marker in order to retrieve the next page of results.

Paginators are a feature of boto3 that act as an abstraction over the process of iterating over an entire result set of a truncated API operation.

Overall it seems a lot cumbersome than aws s3 cp --recursive with AWSCLI.

JiaweiZhuang commented 4 years ago

I also tried s3fs (https://github.com/dask/s3fs). It has much more convenient & intuitive APIs than boto, and also supports recursive downloads. But it seems to have performance issues.

For example, to pull a file from NASA-NEX dataset, the code would be

import s3fs
fs = s3fs.S3FileSystem(anon=False, requester_pays=True)
s3_path = (
    'nasanex/NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/'
    'tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2100.nc'
)
fs.get(s3_path, 'local_file_s3fs.nc')

However, the download takes ~1 minutes on EC2, much slower than aws cp which just takes ~6 seconds. Maybe I am not using the optimal configuration?

On the other hand, boto3 doesn't have such problem. The code below runs as fast as AWSCLI.

import boto3
s3 = boto3.client('s3')
s3.download_file(
    'nasanex',
    'NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2100.nc', 
    'local_file_boto.nc'
)

JiaweiZhuang commented 4 years ago

The simplest way is probably sticking to AWSCLI aws s3 cp --recursive. Instead of writing bash scripts which can get very messy, we can probably have a Python script that prints the full aws s3 cp ... command and then run the generated bash script to download data. Python can easily read the an input YAML/JSON file as our data catalog, perform additional string processing such as restricting the time window, and convert the data paths to --include filters for aws s3 cp.

This can be done together with the HEMCO "dry run" script that is being developed by @jimmielin. The script can parse HEMCO_Config.rc and print all the required input data (in YAML/JSON or just plain text). Such a file list can be used to check the data availability on local disk, or to download missing data from S3.

A side product is an automatically-updated table to summarize the HEMCO datasets, including their file size, time range, and maybe resolution. The file size information can be printed by aws s3 ls --recursive --human-readable --summarize. Currently the data size information is updated manually by @msulprizio at http://wiki.seas.harvard.edu/geos-chem/index.php/HEMCO_data_directories, but there is a lot of human effort to keep track of all the data.

jimmielin commented 4 years ago

I am suggesting an implementation that takes advantage of current HEMCO / GEOS-Chem infrastructure. HEMCO itself has many intricacies when it handles emissions inventories, including which override which given masks and priority - I would trust HEMCO to perform the parsing.

An accurate implementation that behaves exactly like HEMCO at runtime, would be utilizing HEMCO itself. I suggest the implementation of a "dummy driver" Core/hcoio_read_dummy_mod.F90 which is a dummy in the sense that:

It will check existence of files to be read, but will not open them
It will return data to HEMCO, but only zeroes
It will output to a diagnostics file or standard output a log that describes all files used and all files that will be used in the HEMCO run but are currently not present (this is fatal in current runs; the dummy driver will skip, but will output an error message. this way you can get ALL error messages in one go)

This would be coupled with a --dry-run option in the GEOS-Chem (Classic) model. If launching ./geos --dry-run, GEOS-Chem will act in a dry run mode, in which:

It will validate all input.geos, HISTORY.rc and HEMCO_Config.rc options
It will perform time-stepping throughout the entire simulation
It will "read" all emission / met fields as appropriate (outputting the read logs to a diagnostic file or standard output)
It will not perform ANY calculations (all actual processes, e.g. deposition, chemistry, turbulence are SKIPPED.) except in emissions, where it will "calculate" with all zero-data returned by the "dummy IO driver"

In this way all data that will be read by HEMCO can be obtained in one "dry run", and the user will also automatically be able to get his GEOS-Chem run "validated" in the first place. This way the user won't be surprised in the middle of a run when a met field is not found and the run catastrophically crashes.

This would be a good supplement to the current sanity checks in input_mod for input.geos.

JiaweiZhuang commented 4 years ago

Such script is also very useful for specialty simulations. For example, CH4 simulation doesn't need most of inventories as in the standard simulation, but needs an additional HEMCO/CH4 inventory. Here are their config files for reference (version 12.3.2):

JiaweiZhuang commented 4 years ago

This would be coupled with a --dry-run option in the GEOS-Chem (Classic) model.

@jimmielin I think that's a great idea.

I guess we can just collect the HEMCO: Opening lines from the standard log file. Here's an excerpt from a methane simulation:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%               HEMCO: Harvard-NASA Emissions Component               %%%%%
%%%%%               You are using HEMCO version v2.1.012                  %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_MX_2010.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/GEPA/GEPA_Annual.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/Seeps/Maasakkers_Geological_Seeps.01x01.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2014-09/4x5/termites.geos.4x5.nc
HEMCO: Opening /home/ubuntu/ExtData/GEOS_4x5/GEOS_FP/2011/01/GEOSFP.20110101.CN.4x5.nc
HEMCO: Opening ./GEOSChem.Restart.20160701_0000z.nc4
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/TIMEZONES/v2015-02/timezones_voronoi_1x1.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/OLSON_MAP/v2019-02/Olson_2001_Land_Type_Masks.025x025.generic.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/Mexico_Mask.001x001.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/Canada_Mask.001x001.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/CONUS_Mask.001x001.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/CONUS_Mask_Mirror.001x001.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/Mexico_Mask_Mirror.001x001.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/MASKS/v2018-09/Canada_Mask_Mirror.001x001.nc
...

JiaweiZhuang commented 4 years ago

As a quick workaround before such "dry-run" is available, we can parse the log from an actual model run. Such real run requires the input data to be available (can be done on Odyssey as for the standard benchmark), but the file list can be re-used for deploying on other platforms.

Below are my quick implementations of the parser and command generator. This CH4 simulation log can be used as input data. Other log files would also work.

Extract data path as Python list

def extract_data_path(filename, prefix_filter=''):
    """
    filename : str, GEOS-Chem standard log file
    prefix_filter: str, only select file paths starting with this prefix
        e.g. "/home/ubuntu/ExtData/HEMCO/"
    """
    prefix_len = len(prefix_filter)

    data_list = set()  # only keep unique files
    with open(filename, "r") as f:
        line = f.readline()
        while(line):
            if line.startswith('HEMCO: Opening'):
                data_path = line.split()[-1]         
                if data_path.startswith(prefix_filter):
                    trimmed_path = data_path[prefix_len:]  # remove common prefix
                    data_list.add(trimmed_path)
            line = f.readline()

    data_list = sorted(list(data_list))
    return data_list

The scripts takes a log containing

HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_MX_2010.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc
HEMCO: Opening /home/ubuntu/ExtData/HEMCO/CH4/v2017-10/GEPA/GEPA_Annual.nc
...

and returns a Python list

['CH4/v2014-09/4x5/gmi.ch4loss.geos5_47L.4x5.nc',
 'CH4/v2014-09/4x5/termites.geos.4x5.nc',
 'CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc',
...]

Use it as

data_list = extract_data_path('./your_run.log', prefix_filter='/home/ubuntu/ExtData/HEMCO/')

Convert data path into AWSCLI commands

I tested a few implementations which give different performance.

1. Put many --include filters into a single command (slow)

def awscli_filter(data_list, output_script='s3_download_filter.sh', 
                  data_root='$HOME/ExtData/HEMCO/'):

    command_prefix = (
        'aws s3 cp --request-payer=requester --recursive '
        's3://gcgrid/HEMCO/ {} --exclude "*" \\\n'.format(data_root)
    )

    with open(output_script, "w") as f:
        f.write(command_prefix)
        for path in data_list:
            f.write('--include "{}" \\\n'.format(path))

it gives

aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/ $HOME/ExtData/HEMCO/ --exclude "*" \
--include "CH4/v2014-09/4x5/gmi.ch4loss.geos5_47L.4x5.nc" \
--include "CH4/v2014-09/4x5/termites.geos.4x5.nc" \
--include "CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc" \
--include "CH4/v2017-10/CMS/CMS_CH4_FLX_MX_2010.nc" \
...

Full output: s3_download_filter.sh.txt

However, such command spends a long time on metadata checking, before and after the download. The total time to download 4.3 GB data is 6 minutes, among which only ~30s is actually downloading data.

2. Use one aws cp command for one file

def awscli_individual(data_list, output_script='s3_download_individual.sh', 
                      data_root='$HOME/ExtData/HEMCO/'):

    with open(output_script, "w") as f:
        for path in data_list:
            command = (
                'aws s3 cp --request-payer=requester '
                's3://gcgrid/HEMCO/{path} {root}{path}'
                .format(path=path, root=data_root)
            )
            f.write(command + '\n')

it gives

aws s3 cp --request-payer=requester s3://gcgrid/HEMCO/CH4/v2014-09/4x5/gmi.ch4loss.geos5_47L.4x5.nc $HOME/ExtData/HEMCO/CH4/v2014-09/4x5/gmi.ch4loss.geos5_47L.4x5.nc
aws s3 cp --request-payer=requester s3://gcgrid/HEMCO/CH4/v2014-09/4x5/termites.geos.4x5.nc $HOME/ExtData/HEMCO/CH4/v2014-09/4x5/termites.geos.4x5.nc
aws s3 cp --request-payer=requester s3://gcgrid/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc $HOME/ExtData/HEMCO/CH4/v2017-10/CMS/CMS_CH4_FLX_CA_2013.nc
...

Full output: s3_download_individual.sh.txt

This avoids the long metadata checking, and takes ~40s to download 4.3 GB data.

A minor problem is that the download of small files is not parallelized & multl-threaded, as one aws cp command can only run after the previous one is finished.

3. Use one aws cp --recursive command for each top-level directory

import os
from pathlib import Path

def awscli_directory(data_list, output_script='s3_download_directory.sh', 
                     data_root='$HOME/ExtData/HEMCO/'):

    # dir_list = set(os.path.dirname(path) for path in data_list) # bottom-level
    dir_list = set(Path(path).parts[0] for path in data_list)  #  top-level

    with open(output_script, "w") as f:
        for dir_path in dir_list:
            command = (
                'aws s3 cp --request-payer=requester --recursive '
                's3://gcgrid/HEMCO/{dir_path} {root}{dir_path}'
                .format(dir_path=dir_path, root=data_root)
            )
            f.write(command + '\n')

It gives a short list top-level directories:

aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/OLSON_MAP $HOME/ExtData/HEMCO/OLSON_MAP
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/TIMEZONES $HOME/ExtData/HEMCO/TIMEZONES
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/MASKS $HOME/ExtData/HEMCO/MASKS
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/CH4 $HOME/ExtData/HEMCO/CH4
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/OH $HOME/ExtData/HEMCO/OH
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/MODIS_XLAI $HOME/ExtData/HEMCO/MODIS_XLAI

The download of small files is parallelized via --recursive, but this can download more data than needed (6.9 GB instead of 4.3 GB in this case), as not all files in a directory are actually used. Nonetheless, it still finishes faster than the previous per-file download (30s vs 40s).

4. Use one aws cp --recursive command for each low-level directory

Using os.path.dirname to get the bottom-level directory will lead to duplicated nested entries like ['CH4/v2017-10', 'CH4/v2017-10/CMS', ...]

To avoid duplication while still having low-level directories, we can concat the first several items in pathlib.Path.parts.

import os
from pathlib import Path

def awscli_directory_v2(data_list, output_script='s3_download_directory_v2.sh', 
                        data_root='$HOME/ExtData/HEMCO/', nested_level=2):

    dir_list = set(os.path.join(*Path(path).parts[:nested_level]) for path in data_list)

    with open(output_script, "w") as f:
        for dir_path in dir_list:
            command = (
                'aws s3 cp --request-payer=requester --recursive '
                's3://gcgrid/HEMCO/{dir_path} {root}{dir_path}'
                .format(dir_path=dir_path, root=data_root)
            )
            f.write(command + '\n')

which gives

aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/OLSON_MAP/v2019-02 $HOME/ExtData/HEMCO/OLSON_MAP/v2019-02
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/CH4/v2014-09 $HOME/ExtData/HEMCO/CH4/v2014-09
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/CH4/v2017-10 $HOME/ExtData/HEMCO/CH4/v2017-10
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/TIMEZONES/v2015-02 $HOME/ExtData/HEMCO/TIMEZONES/v2015-02
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/MODIS_XLAI/v2017-07 $HOME/ExtData/HEMCO/MODIS_XLAI/v2017-07
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/MASKS/v2018-09 $HOME/ExtData/HEMCO/MASKS/v2018-09
aws s3 cp --request-payer=requester --recursive s3://gcgrid/HEMCO/OH/v2014-09 $HOME/ExtData/HEMCO/OH/v2014-09

or use higher nested level like awscli_directory_v2(data_list, nested_level=3). nested_level=4 is basically the same as per-file download.

However, nested_level=3 still downloads more data than necessary (6.1 GB vs 4.3 GB). It also finished in ~30s.

Tentative conclusion

To minimize data size, use the per-file download method awscli_individual
To speed-up the download of small files, the directory-based recursive download awscli_directory_v2 would help, but that might download more data than needed.

yantosca commented 4 years ago

Hi all -- FYI, I've added Jiawei's function to GCPy. It's included in core.py with name extract_pathnames_from_log. This will be very useful!

For now it is in branch feature/1yr_benchmark but this will go into master very soon.

JiaweiZhuang commented 4 years ago

@yantosca For making the 12.6.0 AMI, one quick fix would be:

Take the log file from a 1-month 4x5 simulation done on Odyssey
Use the log file parser to get the input data list.
Use the AWSCLI command generation script to download the only necessary data from S3.

In the future, step 1-2 will be replaced by a more clever dry-run capability, while step 3 will remain the same.

yantosca commented 4 years ago

Indeed. I'll work on this over the next few days. Stay tuned!

JiaweiZhuang commented 4 years ago

Had a discussion with @yantosca on what data to put into the tutorial AMI. The main complexity comes from the large size of offline emissions. 1-month of OFFLINE_BIOVOC biogenic is 6.6 GB; 1-month of OFFLINE_LIGHTNING is 14 GB. Putting the whole year of data is definitely too much for an AMI.

Tentative solution:

Only put 1-month or even just 1-day of offline emissions in the AMI.
Have a modifiable script for users to pull more data for other simulation period.

Those docs below only shows how to pull new metfields for new simulation period. They need to be updated accordingly to also download new offline emissions:

The ability to pull new offline emissions on-demand is best done by the "dry-run" approach. But a quick workaround would be like:

for month in 09 10
do
aws s3 cp --request-payer=requester --recursive \
  s3://gcgrid/HEMCO/OFFLINE_BIOVOC/v2019-01/0.25x0.3125/2014/$month \
  ~/ExtData/HEMCO/OFFLINE_BIOVOC/v2019-01/0.25x0.3125/2014//$month
done

which pulls two extra months of OFFLINE_BIOVOC data.

lizziel commented 4 years ago

I would support have 1-day in the AMI, and include a script that can pull data to do a 1-month benchmark run. This would allow users to practice pulling the data and also do a run that can be compared to our publicly available benchmark results.

JiaweiZhuang commented 4 years ago

On how to organize the scripts:

The log parser and awscli command generator are already added to GCPy by @yantosca (https://github.com/geoschem/gcpy/commit/e1943bc1866353e0d61547d52ca4eed3176f61dc, https://github.com/geoschem/gcpy/commit/ca06675928987af82fbc00659e4a72629b2bdb76, https://github.com/geoschem/gcpy/commit/f160bd7a9125120da8b251635492ddbba8d530c0)
The generated bash scripts can be further added to https://github.com/geoschem/geos-chem-cloud/tree/master/scripts/download_data/from_S3. This makes sense as running aws cp does not require the entire GCPy dependency, and users don't need to run the Python parser if just to download the default input files. This is also the place to add manual fixes, such as fixing symbolic links (the case for HEMCO/GMI) and correcting Fortran string length limit (very long paths aren't printed properly in logs)

yantosca commented 4 years ago

I will add scripts for the 1-day (AMI example) and 1-month to the geos-chem-cloud repo shortly.

Also note: I might need to fix an issue in the GEOS-Chem repo. Some of the path names (e.g. the FJX_spec.dat) file are not printed completely, due to a character string that is only 80 chars long. That will go into dev/12.6.1. For now I'll manually edit the AWS download scripts.

msulprizio commented 4 years ago

It would also be nice if the dry-run option and/or script used to download the data could tell users how much disk space will be required to download the files they need.

yantosca commented 4 years ago

Also we will need to modify some of the HEMCO extensions and other routines so that it prints either "Reading ..." or "Opening ...". This is what the Python script looks for when parsing the log file. This can be put into 12.6.1

yantosca commented 4 years ago

I was able to create a bash script from the log file output of GEOS-Chem 12.6.0 using the gcpy.aws.s3_download_cmds_from_log() function. Some minor hand editing was needed but this can be avoided once we modify GEOS-Chem to write out all filepaths that are being read. See script: https://github.com/geoschem/geos-chem-cloud/blob/master/scripts/download_data/from_S3/download_20160701_geosfp_4x5_standard.sh.

I will also investigate using this script for longer periods of time, as well as using aws s3 sync instead of aws s3 cp.

yantosca commented 4 years ago

I have pushed a commit to GCPy (https://github.com/geoschem/gcpy/commit/5adcf4861317cde96d96354a86bd0ce80aadddd5) that improves on the aws.s3_download_cmds_from_log function. This needs to be used in conjunction with GEOS-Chem 12.6.1, since prior to 12.6.1, not all file paths were written out to the log or HEMCO.log.

In any case, I was able to generate an S3 download script that parsed the files read by GEOS-Chem (as listed in the log file). I then used that to download data from the s3://gcgrid to an EBS instance for both 1-day and 2-day runs. GEOS-Chem was able to run successfully reading only the data that was downloaded.

Still need to test e.g. for a month-long run but this is a good stopgap measure until we get the GEOS-Chem and HEMCO dryrun options working.

JiaweiZhuang commented 4 years ago

For GCHP we can parse the lines starting with CFIO: Reading:

CFIO: Reading ./ChemDataDir/MODIS_LAI_201707/For_Olson_2001/XLAI_for_GCHP/2011/Condensed_MODIS_XLAI.025x025.201106.nc at 20110615 000000
CFIO: Reading ./ChemDataDir/MODIS_LAI_201707/For_Olson_2001/XLAI_for_GCHP/2011/Condensed_MODIS_XLAI.025x025.201107.nc at 20110715 000000
CFIO: Reading ./MainDataDir/NH3/v2014-07/NH3_geos.4x5.nc at 19900701 000000
CFIO: Reading ./MainDataDir/NH3/v2014-07/NH3_geos.4x5.nc at 19900801 000000
CFIO: Reading ./MainDataDir/NH3/v2014-07/NH3_geos.4x5.nc at 19900701 000000

It's unclear how a dry-run approach will work, though.

JiaweiZhuang commented 4 years ago

Here's how to extract data list from GCHP log file, and download those files from S3.

Data extraction script

def extract_data_path_gchp(filename, prefix_filter=''):
    """
    filename : str, GCHP standard log file
    prefix_filter: str, only select file paths starting with this prefix
        e.g. './MainDataDir/'
    """
    prefix_len = len(prefix_filter)

    data_list = set()  # only keep unique files
    with open(filename, "r") as f:
        line = f.readline()
        while(line):
            if line.startswith('CFIO: Reading '):
                data_path = line.split()[2]         
                if data_path.startswith(prefix_filter):
                    trimmed_path = data_path[prefix_len:]  # remove common prefix
                    data_list.add(trimmed_path)
            line = f.readline()

    data_list = sorted(list(data_list))
    return data_list

It is almost the same as https://github.com/geoschem/geos-chem-cloud/issues/25#issuecomment-541446027

Example usage

Here are two test log files, with emission on or off:

GCHP_7days_0.25met_EmisOn.log
GCHP_7days_0.25met_EmisOff.log They give exactly the same result because turning off emission for GCHP does not affect ExtData (geoschem/gchp#18)

HEMCO data:

data_list = extract_data_path_gchp(
    'GCHP_7days_0.25met_EmisOn.log',
    prefix_filter='./MainDataDir/')

gives

['ACET/v2014-07/ACET_seawater.generic.1x1.nc',
 'AEIC/v2015-01/AEIC.47L.gen.1x1.nc',
 'ALD2/v2017-03/ALD2_seawater.geos.2x25.nc',
 'ALD2/v2017-03/resp.geos.2x25.nc',
 'APEI/v2016-11/APEI.0.1x0.1.nc'
...
]

CHEM_INPUTS:

extract_data_path_gchp(
    './data/GCHP_7days_0.25met_EmisOn.log',
    prefix_filter='./ChemDataDir/')

gives

['MODIS_LAI_201707/For_Olson_2001/XLAI_for_GCHP/2011/Condensed_MODIS_XLAI.025x025.201106.nc',
 'MODIS_LAI_201707/For_Olson_2001/XLAI_for_GCHP/2011/Condensed_MODIS_XLAI.025x025.201107.nc',
 'Olson_Land_Map_201203/Olson_2001_Land_Map.025x025.generic.GCHP.nc']

Metfields:

extract_data_path_gchp(
    './data/GCHP_7days_0.25met_EmisOn.log',
    prefix_filter='./MetDir/')

gives

['2011/01/GEOSFP.20110101.CN.025x03125.nc',
 '2016/07/GEOSFP.20160701.A1.025x03125.nc',
 '2016/07/GEOSFP.20160701.A3cld.025x03125.nc',
 '2016/07/GEOSFP.20160701.A3dyn.025x03125.nc',
...
]

Generate AWSCLI commands

Same as awscli_directory_v2 before, but sort the list for better readability):

import os
from pathlib import Path

def awscli_directory_v3(data_list, output_script='s3_download_directory_v3.sh',
                        data_root='$HOME/ExtData/HEMCO/', nested_level=2):

    dir_list = set(os.path.join(*Path(path).parts[:nested_level]) for path in data_list)
    dir_list = sorted(list(dir_list))

    with open(output_script, "w") as f:
        for dir_path in dir_list:
            command = (
                'aws s3 cp --request-payer=requester --recursive '
                's3://gcgrid/HEMCO/{dir_path} {root}{dir_path}'
                .format(dir_path=dir_path, root=data_root)
            )
            f.write(command + '\n')

It assumes HEMCO files. Metfields, restart files and other directories are easy to track anyways and probably don't need such automated script.

The generated script: s3_download_gchp.sh

Or use data_root='$DATA_ROOT/HEMCO/' to be consistent with currently used scripts, instead of the explicit $HOME/ExtData: s3_download_gchp_dataroot.sh

jimmielin commented 4 years ago

For GCHP we can parse the lines starting with CFIO: Reading:

It's unclear how a dry-run approach will work, though.

Thanks Jiawei. Note that CFIO might not work for the parallel IO option since if might change the IO module name from what I heard.

As for a dry-run option I am not sure since IO extends outside of HEMCO so it is beyond my control. In my opinion though just copying the HEMCO_Config.rc to a standalone HEMCO Install with "Dry run" capabilities sounds fair enough to me. Care needs to be taken so that ExtData.rc matches data requested by HEMCO_Config.rc.

yantosca commented 4 years ago

Also note: in GEOS-Chem "Classic" there are some files that don't get read by HEMCO:

FAST-JX input files (lookup tables)
UCX input files (initial conditions)
Olson Drydep inputs (constants)

Worst to worst, we could probably just hardwire (Olson drydep inputs) into the code. They used to get read from ASCII, now from netCDF, but they probably don't need to be read from a file at all.

For the FAST-JX files (1),those are all text files that are formatted in a special way, so there is no way for HEMCO to use them.

Maybe at some point the UCX boundary/initial conditions could be folded into HEMCO. They are mostly just 2-D (lat-alt) variables.

yantosca commented 4 years ago

@JiaweiZhuang @jimmielin Building off of earlier work, I have created a Python script that will download data from either AWS s3://gcgrid or Compute Canada, given a GEOS-Chem log file containing dry-run output.

See file geoschem/gcpy/examples/dry-run/download_data.py, which was introduced in commit https://github.com/geoschem/gcpy/commit/a4f523f80420de62865b4c14ca1d0ac376e7c8c8.

The script can be called with:

./download_data.py log.dryrun -aws     # Gets data from AWS s3://gcgrid
./download_data.py log.dryrun -cc      # Gets data from Compute Canada

I've saved this as a GCPy example. But because the script only uses core Python, we plan to include it in all run directories for GEOS-Chem 12.7.0.

Also note: unlike the former script parse_dryrun_output.py, the new script download_data.py not only creates a bash script to download data but also executes it.

yantosca commented 4 years ago

I just pushed a new commit: https://github.com/geoschem/gcpy/commit/73200579e2f515d0de50b14d3b05588593bb13fc. This allows you to skip re-downloading data and only print out the log of unique file paths.

This will be a useful option for GCST, as we will want to create a log of unique file paths for each benchmark simulation, but not re-download data that is already present on disk.

yantosca commented 4 years ago

I also pushed a commit to the unit tester (https://github.com/geoschem/geos-chem-unittest/commit/95204b369503d62537bf203bb7ebeaa5ad9d7df4) that will copy download_data.py into each run directory that is created with the gcCopyRunDirs script. This will ship with 12.7.0.

yantosca commented 4 years ago

This should now be resolved by the GEOS-Chem dry-run functionality in versions 12.7.0 and higher. We recommend using 12.9.3 if possible, as several dry-run issues have been corrected.

geoschem / geos-chem-cloud