Automatically create and populate ScienceBase pages with metadata and data files. Given a ScienceBase (SB) landing page and a directory tree with data and metadata, this script creates SB pages mimicking the directory structure, updates the XML files with new SB links, and populates the SB pages from the data.
(besides soon-to-be-discovered bugs)
See below for a detailed explanation of how SB pages are set up to mimic the directory structure. The parent directory should correspond to the landing page. It will be populated with a child page for all first-level directories that contain XML files in their directory trees. Directories without XML files will be ignored. The contents of each directory that contains an XML will be uploaded to the corresponding child page. If a single XML is present, the title of the corresponding page is taken from the dataset title in the XML file. All XML files should pass MP error checking.
Create the data release landing page before running the script. Begin either by uploading an XML file to the File section, which SB will use to automatically populate fields or go straight to working manually with the page. Make manual revisions, such as to the citation, the body, the purpose, etc. If desired, create a preview image by uploading an image to the File section; this will automatically be used as the preview image. You can choose any of these fields to be copied over to child pages (including the preview image).
Open config_autoSB.py in your Python/text editor and revise the value of each input variable as indicated in the comments.
Input variables that must be updated before running:
Specify which fields will be “inherited” between pages in the following optional lists:
Choose which processes to conduct. The default values will suit most purposes, but these fields allow you to tune the processes to save time.
Optional...
Install additional required python modules: lxml, pysb, science-base-automation. science-base-automation is compatible with Python 3 on OSX and Windows.
Download/fork/clone science-base-automation.
Install lxml and pysb using Conda (recommended):
conda create -n sb_py3 python=3 lxml
source activate sb_py3 # OSX. Windows would be activate sb_py3
pip install sciencebasepy
Alternative to conda: Use pip in your base python environment:
easy_install pip
pip install lxml
pip install sciencebasepy
In your bash console (Terminal on OSX):
If using Conda, first activate your sb_py3 environment: OSX: conda activate sb_py3
Windows: activate sb_py3
. Then:
cd [path]\[to]\science-base-automation
python sb_automation.py
From Finder: Right click sb_automation.py and run with your python launcher of choice.
In your Python IDE of choice: Open the script (sb_automation.py) and run it line by line or however you choose.
If you want to start fresh, an easy way to delete all items pertaining to the parent page, is to set parentdir
to an empty directory and set the variable replace_subpages
to True.
Starts a ScienceBase session.
Works in the landing page and top directory as specified by the input parameters.
Optionally removes all child pages (option replace_subpages
).
Loops through the sub-directories to create or find a matching SB page.
Loops through the XML files to create or find a data page. For each XML file (excluding the landing page XML), it:
Sets bounding box coordinates for parents based on the spatial extent of the data in their child pages.
During processing it stores values in two dictionaries, which are then saved in the top directory as a time-saving measure for future processing.
Each directory will become a ScienceBase page within your data release. The directories will maintain their hierarchy. Each (error-free) XML file will populate a ScienceBase page. If a directory contains a single XML file, the corresponding ScienceBase page will be populated with that XML file. If the directory contains multiple XML files, each XML will become a child page linked on the page corresponding to its parent directory. Each ScienceBase page will be titled with the name of the source directory unless there is only one XML file in that directory. In that case, the ScienceBase page will be renamed to match the title in the XML file (Identity Information > Citation > Citation Information > Title). Here is an example of how a local file structure will become a ScienceBase page structure:
North Carolina - sub-directory
excerpt of title element from within metadata file
<idinfo><citation><citeinfo><title>Coastal baseline for North Carolina…</title></citeinfo></citation></idinfo> - excerpt of title element from within metadata file
excerpt of title element from within metadata file
<idinfo><citation><citeinfo><title>Shorelines of North Carolina…</title></citeinfo></citation></idinfo>
North Carolina - sub-directory
excerpt of title element from within metadata file
<idinfo><citation><citeinfo><title>Shorelines of North Carolina with baseline file…</title></citeinfo></citation></idinfo>
Reference for ScienceBase item services: https://my.usgs.gov/confluence/display/sciencebase/ScienceBase+Item+Services sciencebasepy, the ScienceBase python module: https://github.com/usgs/sciencebasepy
ScienceBase automatically detects the file type and in some cases the contents of uploaded files and makes intelligent decisions about how to use them. For instance, an image file uploaded to a page will be used as the preview image. It will pull information from an XML file to populate fields, and it will detect components of a shapefile or raster file and present them as a shapefile or raster “facet”, which can be downloaded as a package. Even if an XML file is later removed from the Files, the fields populated from it will remain.
SB has a URL for direct download of all files from a page. It is https://www.sciencebase.gov/catalog/file/get/[item ID] There is also the option for direct download of a single file, which adds a query onto the get file URL: https://www.sciencebase.gov/catalog/file/get/[item ID]/?name=[file name]. However, this should only be used when the data has been zipped before upload to ensure that a user retrieves all necessary files (including metadata). If a facet was created, a URL for direct download of the all files in the facets can be retrieved from the JSON item.
Accessing files on a server: In OSX, paths: r'/Volumes/[server directory name]'. Replace [server directory] with the name of the directory on the server, not the server itself. The server must first be mounted and visible in your Volumes. Then get the directory name by viewing the volumes mounted on your computer. Example:
parentdir = r'/Volumes/myserverfolder/data_release'
Although not necessary, you can use find_and_replace variable in config_autoSB.py to replace text in the XML based on placeholder values. The default configuration will search for the strings https://doi.org/XXXXX and DOI:XXXXX and replace the X's with the input DOI value. Note those are five capital X's.
Don't include parentheses in titles. Although it works, it seems to mess up the process for matching the title to the SB page ID in the cases when the page is already created and we need to match the XML to a page (e.g. upload_all_updated_xmls)
Assorted functions for certain tasks:
Propagate fields from parent to all child pages
# Propagate fields from parent to all child pages
sb = log_in(useremail, password)
landing_id = 'sb_id'
subparent_inherits = ['citation', 'contacts', 'body', 'webLinks', 'relatedItems']
data_inherits = ['citation', 'contacts', 'body', 'webLinks', 'relatedItems']
inherit_topdown(sb, landing_id, subparent_inherits, data_inherits)
Delete original XMLs
# Delete original XMLs
parentdir = r'path/to/parent'
remove_files(parentdir, pattern='**/*.xml_orig')
Check for and upload XMLs that have been modified since last upload.
sb = log_in(useremail, password)
parentdir = r'path/to/parent'
# Check for and upload XMLs that have been modified since last upload.
upload_all_updated_xmls(sb, parentdir)
Update all browse graphics
sb = log_in(useremail, password)
parentdir = r'path/to/parent'
landing_id = 'sb_id'
# Update SB preview image from the uploaded files and update filename and type in XML.
update_all_browse_graphics(sb, parentdir, landing_id)
Change all folder names to match XML titles
parentdir = r'path/to/parent'
# Change all folder names to match XML titles
rename_dirs_from_xmls(parentdir)