Open JohnHalleyGotway opened 3 years ago
A use case exists that runs a python script as a "UserScript" to untar files before processing:
Python Script to Untar: https://github.com/dtcenter/METplus/blob/develop/parm/use_cases/model_applications/tc_and_extra_tc/UserScript_ASCII2NC_PointStat_fcstHAFS_obsFRD_NetCDF/hrd_frd_sonde_find_tar.py
This logic could be generalized/expanded and added as a simple wrapper to handle this behavior.
According to the GNU tar documentation (https://www.gnu.org/software/tar/manual/html_node/controlling-pattern_002dmatching.html) you can use the --wildcards argument to only extract files that match that wildcard expression. Note that this may not work on every OS. You can also extract a single file, so the user could supply a list of files they want to extract and the wrapper could loop over those files and extract each one.
The -z argument can be added to handle tar.gz or .tgz files, so that could be an option as well (or added if extension is parsed from the filename?)
tar -xf foo.tar -v --wildcards '*.c'
tar -xf foo.tar myfile.c
tar -xzf foo.tar.gz myfile.c
Describe the New Feature
During a meeting with Cristina Stan from GMU on 4/13/21 about using METplus via an Amazon AMI, a good idea arose. A lot of the NOAA data available in Amazon S3 buckets is stored in tar files. Currently, it would be the user's responsibility to retrieve and untar these files prior to processing them using METplus. The same setup also exists on NOAA's WCOSS machine. The run history files are stored in very large tar files.
This task is to develop a pre-processing wrapper to run tar commands (or some variant of them on WCOSS) to extract files from tar files. Note that the wrapper should be able to either untar the whole file or extract a subset of files from that tar file. The wrapper should handle both tarred or tarred/compressed inputs.
Note that this issue requires more clear definition with specific examples of files and machines on which it should be demonstrated. Recommend making sure this wrapper works well on NCAR project machines (seneca), NCAR HPSS (cheyenne or derecho), NOAA HPSS (wcoss), and within the METplus AMI to extract data from S3 buckets.
Could also consider a more broad pre-processing wrapper to pull data across the internet. And untarring could just be a subset of that more broad wrapper.
Acceptance Testing
List input data types and sources. Describe tests required for new functionality.
Time Estimate
Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.
Sub-Issues
Consider breaking the new feature down into sub-issues.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Funding Source
Define the source of funding and account keys here or state NONE.
Define the Metadata
Assignee
Labels
Projects and Milestone
Define Related Issue(s)
Consider the impact to the other METplus components.
New Feature Checklist
See the METplus Workflow for details.
feature_<Issue Number>_<Description>
feature <Issue Number> <Description>