esturdivant-usgs / science-base-automation

Automating large USGS ScienceBase data releases
4 stars 2 forks source link
metadata python sciencebase-pages xml-files

science-base-automation

Automatically create and populate ScienceBase pages with metadata and data files. Given a ScienceBase (SB) landing page and a directory tree with data and metadata, this script creates SB pages mimicking the directory structure, updates the XML files with new SB links, and populates the SB pages from the data.

Overall process

  1. Set up a local directory structure for your data release.
  2. Set up a ScienceBase landing page.
  3. Modify script parameters in config_autoSB.py.
  4. Run (first install necessary python modules).
  5. Check ScienceBase pages and make manual modifications.

Limitations

(besides soon-to-be-discovered bugs)

How to execute, from the top:

1. Set up a local directory structure for your data release.

See below for a detailed explanation of how SB pages are set up to mimic the directory structure. The parent directory should correspond to the landing page. It will be populated with a child page for all first-level directories that contain XML files in their directory trees. Directories without XML files will be ignored. The contents of each directory that contains an XML will be uploaded to the corresponding child page. If a single XML is present, the title of the corresponding page is taken from the dataset title in the XML file. All XML files should pass MP error checking.

2. Set up a ScienceBase landing page.

Create the data release landing page before running the script. Begin either by uploading an XML file to the File section, which SB will use to automatically populate fields or go straight to working manually with the page. Make manual revisions, such as to the citation, the body, the purpose, etc. If desired, create a preview image by uploading an image to the File section; this will automatically be used as the preview image. You can choose any of these fields to be copied over to child pages (including the preview image).

3. Modify parameters.

Open config_autoSB.py in your Python/text editor and revise the value of each input variable as indicated in the comments.

4. Run script sb_automation.py!

INSTALL

Install additional required python modules: lxml, pysb, science-base-automation. science-base-automation is compatible with Python 3 on OSX and Windows.

Download/fork/clone science-base-automation.

Install lxml and pysb using Conda (recommended):

conda create -n sb_py3 python=3 lxml
source activate sb_py3 # OSX. Windows would be activate sb_py3
pip install sciencebasepy

Alternative to conda: Use pip in your base python environment:

easy_install pip
pip install lxml
pip install sciencebasepy

RUN

In your bash console (Terminal on OSX):

If using Conda, first activate your sb_py3 environment: OSX: conda activate sb_py3 Windows: activate sb_py3. Then:

cd [path]\[to]\science-base-automation
python sb_automation.py

From Finder: Right click sb_automation.py and run with your python launcher of choice.

In your Python IDE of choice: Open the script (sb_automation.py) and run it line by line or however you choose.

5. Check ScienceBase pages and make manual modifications.

If you want to start fresh, an easy way to delete all items pertaining to the parent page, is to set parentdir to an empty directory and set the variable replace_subpages to True.

What the script does:

Background

Terms

Directory structure

Each directory will become a ScienceBase page within your data release. The directories will maintain their hierarchy. Each (error-free) XML file will populate a ScienceBase page. If a directory contains a single XML file, the corresponding ScienceBase page will be populated with that XML file. If the directory contains multiple XML files, each XML will become a child page linked on the page corresponding to its parent directory. Each ScienceBase page will be titled with the name of the source directory unless there is only one XML file in that directory. In that case, the ScienceBase page will be renamed to match the title in the XML file (Identity Information > Citation > Citation Information > Title). Here is an example of how a local file structure will become a ScienceBase page structure:

Variation 1

Input directories and files: DATA_RELEASE_1 - top directory
ScienceBase page: Shorelines of U.S. Atlantic - landing page

Variation 2

Input directories and files: DATA_RELEASE_2 - top directory
ScienceBase page: Shorelines of U.S. Atlantic - landing page

ScienceBase features

Reference for ScienceBase item services: https://my.usgs.gov/confluence/display/sciencebase/ScienceBase+Item+Services sciencebasepy, the ScienceBase python module: https://github.com/usgs/sciencebasepy

Intelligent content from uploaded files

ScienceBase automatically detects the file type and in some cases the contents of uploaded files and makes intelligent decisions about how to use them. For instance, an image file uploaded to a page will be used as the preview image. It will pull information from an XML file to populate fields, and it will detect components of a shapefile or raster file and present them as a shapefile or raster “facet”, which can be downloaded as a package. Even if an XML file is later removed from the Files, the fields populated from it will remain.

Direct download

SB has a URL for direct download of all files from a page. It is https://www.sciencebase.gov/catalog/file/get/[item ID] There is also the option for direct download of a single file, which adds a query onto the get file URL: https://www.sciencebase.gov/catalog/file/get/[item ID]/?name=[file name]. However, this should only be used when the data has been zipped before upload to ensure that a user retrieves all necessary files (including metadata). If a facet was created, a URL for direct download of the all files in the facets can be retrieved from the JSON item.

Tips

Functions

Assorted functions for certain tasks:

Propagate fields from parent to all child pages

# Propagate fields from parent to all child pages
sb = log_in(useremail, password)
landing_id = 'sb_id'
subparent_inherits = ['citation', 'contacts', 'body', 'webLinks', 'relatedItems']
data_inherits = ['citation', 'contacts', 'body', 'webLinks', 'relatedItems']

inherit_topdown(sb, landing_id, subparent_inherits, data_inherits)

Delete original XMLs

# Delete original XMLs
parentdir = r'path/to/parent'

remove_files(parentdir, pattern='**/*.xml_orig')

Check for and upload XMLs that have been modified since last upload.

sb = log_in(useremail, password)
parentdir = r'path/to/parent'

# Check for and upload XMLs that have been modified since last upload.
upload_all_updated_xmls(sb, parentdir)

Update all browse graphics

sb = log_in(useremail, password)
parentdir = r'path/to/parent'
landing_id = 'sb_id'

# Update SB preview image from the uploaded files and update filename and type in XML.
update_all_browse_graphics(sb, parentdir, landing_id)

Change all folder names to match XML titles

parentdir = r'path/to/parent'

# Change all folder names to match XML titles
rename_dirs_from_xmls(parentdir)