dtcenter / METplus

Python scripting infrastructure for MET tools.
https://metplus.readthedocs.io
Apache License 2.0
98 stars 37 forks source link

Create a new wrapper to handle the untarring of data as a pre-processing step. #879

Open JohnHalleyGotway opened 3 years ago

JohnHalleyGotway commented 3 years ago

Describe the New Feature

During a meeting with Cristina Stan from GMU on 4/13/21 about using METplus via an Amazon AMI, a good idea arose. A lot of the NOAA data available in Amazon S3 buckets is stored in tar files. Currently, it would be the user's responsibility to retrieve and untar these files prior to processing them using METplus. The same setup also exists on NOAA's WCOSS machine. The run history files are stored in very large tar files.

This task is to develop a pre-processing wrapper to run tar commands (or some variant of them on WCOSS) to extract files from tar files. Note that the wrapper should be able to either untar the whole file or extract a subset of files from that tar file. The wrapper should handle both tarred or tarred/compressed inputs.

Note that this issue requires more clear definition with specific examples of files and machines on which it should be demonstrated. Recommend making sure this wrapper works well on NCAR project machines (seneca), NCAR HPSS (cheyenne or derecho), NOAA HPSS (wcoss), and within the METplus AMI to extract data from S3 buckets.

Could also consider a more broad pre-processing wrapper to pull data across the internet. And untarring could just be a subset of that more broad wrapper.

Acceptance Testing

List input data types and sources. Describe tests required for new functionality.

Time Estimate

Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

georgemccabe commented 3 years ago

A use case exists that runs a python script as a "UserScript" to untar files before processing:

Use Case Config: https://github.com/dtcenter/METplus/blob/develop/parm/use_cases/model_applications/tc_and_extra_tc/UserScript_ASCII2NC_PointStat_fcstHAFS_obsFRD_NetCDF.conf

Python Script to Untar: https://github.com/dtcenter/METplus/blob/develop/parm/use_cases/model_applications/tc_and_extra_tc/UserScript_ASCII2NC_PointStat_fcstHAFS_obsFRD_NetCDF/hrd_frd_sonde_find_tar.py

This logic could be generalized/expanded and added as a simple wrapper to handle this behavior.

georgemccabe commented 3 years ago

According to the GNU tar documentation (https://www.gnu.org/software/tar/manual/html_node/controlling-pattern_002dmatching.html) you can use the --wildcards argument to only extract files that match that wildcard expression. Note that this may not work on every OS. You can also extract a single file, so the user could supply a list of files they want to extract and the wrapper could loop over those files and extract each one.

The -z argument can be added to handle tar.gz or .tgz files, so that could be an option as well (or added if extension is parsed from the filename?)

Examples:

Extract files with .c extension from foo.tar

tar -xf foo.tar -v --wildcards '*.c'

Extract file called myfile.c from foo.tar

tar -xf foo.tar myfile.c

Extract file called myfile.c from foo.tar.gz (gzipped)

tar -xzf foo.tar.gz myfile.c