cmharlow / metadataQA

Metadata Quality Analysis Scripts
42 stars 10 forks source link

Metadata Quality Analysis

Harvest Tested & Working on: Python 2.7.10, 2.7.13, 3.5 Analysis Tested & Working on: Python 2.7.10, 2.7.13

A set of metadata harvesting and analysis scripts, largely built off the model/skeleton of Mark Phillips' wonderful work:

So please give any gratitude for these scripts to Mark Phillips, and any complaints to Christina.

Warning(s)

A huge project and presentation prompted me to finally collect and share these. However, these scripts were originally built usually at midnight with huge metadata projects/migrations looming where I didn't have the tools I needed to properly review metadata sets.

As such, these scripts are rather haphazard and very, very alpha. In particular the DPLA work, which was built in the context of the Digital Library of Tennessee Service hub, as a way to review our metadata once in the DPLA, is the most likely to break in unexpected and beautiful ways.

I am currently working on refactoring these to be more stable, have testing coverage, and perhaps make into a Python library. Check out the Issues on this repo to see the work going on (or see what branches are currently active).

Why not fork from Mark Phillips' original work?

Because these scripts use new libraries*, and they change the original intent of the Phillip's work (working with nested XML & other data publication methods + representations beyond OAI-PMH and XML), these are a new repository.

*Some, not all, of the changes from the originating work include:

Install

This was all built/test on python 2.7.10. It needs tweaking for python 3. I'm working on it - the analysis files except MARC work on python 3. The harvester doesn't work for python3 yet - considering moving to requests library instead of urllib/urlopen, which requires more 2 to 3 conversion work. Or, if you find something that works for 3, you can add it and submit a pull request. Please.

So, working with python 2.7:

  1. Get this repository on your computer somehow. You can:
    1. change to file location where you want these scripts, then clone this git repository to your computer:
      $ git clone https://github.com/cmh2166/metadataQA.git
    2. download this repository to your computer from the GitHub page - use the 'Download Zip' button in bottom right corner. Move the zip file to the place you wish to have these scripts, then unzip.
  2. once you've got the scripts on your computer, change to inside the metadataQA directory, and install the requirements:
$ pip install -r requirements.txt

Now you should be ready to use the scripts.

Examples

This all works at present by using the harvest scripts to get a data file to your computer, then running an analysis script on that file. I'm looking into ways to have the analysis applied directly to the data streams instead of a local file.

Harvesting

Harvest OAI feed

Note: This script at present is set to default to pulling MODS from the UTK Islandora OAI feed and save to a 'output.xml' file.

usage: python oaiharvest.py [options, see below] -l link to OAI feed -o file to save to.

optional arguments:

This downloads all the MODS/XML data from the OAI feed at Florida State University, and saves it to the file 'fsuoai.mods.xml'.

$ python oaiharvest.py -m mods -o fsuoai.mods.xml -l https://fsu.digital.flvc.org/oai2

Harvest DPLA feed

You can pass your DPLA API key to the script either using the -k flag or by setting it as an environmental variable DPLA_APIKEY.

usage: python dplaharvest.py [options, see below] -o file to save data to

optional arguments:

This downloads all the DPLA data that has a creation date after 2020

$ python dplaharvest.py -k YourLongAPIKey -a 2020 -o FileToSaveDataTo.json

Analysis

All of the analysis scripts run similarly to what is described by Mark Phillips here for his own work: Metadata Analysis at the Command Line

oai dc analysis

Works most similarly to the original script created by Mark Phillips.

usage: oaidc_analysis.py data_filename.xml [options, see below]

positional arguments:

optional arguments:

To get a field report:

$ python oaidc_analysis.py test/output.dc.xml

To get all the values for the dc:creator field:

$ python oaidc_analysis.py test/output.xml -e creator  

To get all the unique values for the dc:creator field, sorted by count:

$ python oaidc_analysis.py test/output.xml -e creator | sort | uniq -c  

oai mods analysis

This has added support for reviewing nested MODS elements, as well as perform queries with xpath.

usage: oaimods_analysis.py data_filename.xml [options, see below]

positional arguments:

optional arguments:

To print a field report:

python oaimods_analysis.py test/DLTNphase1.mods.xml

To get all the values for mods:title (this does not mean just top level mods:titleInfo/mods:title - but any mods:title element wherever it appears in the record):

python oaimods_analysis.py test/DLTNphase1.mods.xml -e title

To get all the unique values for mods:form (again, wherever it appears) sorted by count:

python oaimods_analysis.py test/DLTNphase1.mods.xml -e form | sort | uniq -c

To get all the values that fit the 'mods:mods/mods:originInfo/mods:dateCreated[@encoding="edtf"]' Xpath query (i.e., all dateCreated for the object that have edtf encoding):

python oaimods_analysis.py test/DLTNphase1.mods.xml -x 'mods:originInfo/mods:dateCreated[@encoding="edtf"]'   

dpla analysis

To be written up.

marc analysis

To be written up.

To Be Enhanced

To be written up.