Harvest Tested & Working on: Python 2.7.10, 2.7.13, 3.5 Analysis Tested & Working on: Python 2.7.10, 2.7.13
A set of metadata harvesting and analysis scripts, largely built off the model/skeleton of Mark Phillips' wonderful work:
So please give any gratitude for these scripts to Mark Phillips, and any complaints to Christina.
A huge project and presentation prompted me to finally collect and share these. However, these scripts were originally built usually at midnight with huge metadata projects/migrations looming where I didn't have the tools I needed to properly review metadata sets.
As such, these scripts are rather haphazard and very, very alpha. In particular the DPLA work, which was built in the context of the Digital Library of Tennessee Service hub, as a way to review our metadata once in the DPLA, is the most likely to break in unexpected and beautiful ways.
I am currently working on refactoring these to be more stable, have testing coverage, and perhaps make into a Python library. Check out the Issues on this repo to see the work going on (or see what branches are currently active).
Because these scripts use new libraries*, and they change the original intent of the Phillip's work (working with nested XML & other data publication methods + representations beyond OAI-PMH and XML), these are a new repository.
*Some, not all, of the changes from the originating work include:
This was all built/test on python 2.7.10. It needs tweaking for python 3. I'm working on it - the analysis files except MARC work on python 3. The harvester doesn't work for python3 yet - considering moving to requests library instead of urllib/urlopen, which requires more 2 to 3 conversion work. Or, if you find something that works for 3, you can add it and submit a pull request. Please.
So, working with python 2.7:
$ git clone https://github.com/cmh2166/metadataQA.git
$ pip install -r requirements.txt
Now you should be ready to use the scripts.
This all works at present by using the harvest scripts to get a data file to your computer, then running an analysis script on that file. I'm looking into ways to have the analysis applied directly to the data streams instead of a local file.
Note: This script at present is set to default to pulling MODS from the UTK Islandora OAI feed and save to a 'output.xml' file.
usage: python oaiharvest.py [options, see below] -l link to OAI feed -o file to save to.
optional arguments:
This downloads all the MODS/XML data from the OAI feed at Florida State University, and saves it to the file 'fsuoai.mods.xml'.
$ python oaiharvest.py -m mods -o fsuoai.mods.xml -l https://fsu.digital.flvc.org/oai2
You can pass your DPLA API key to the script either using the -k flag or by setting it as an environmental variable DPLA_APIKEY.
usage: python dplaharvest.py [options, see below] -o file to save data to
optional arguments:
This downloads all the DPLA data that has a creation date after 2020
$ python dplaharvest.py -k YourLongAPIKey -a 2020 -o FileToSaveDataTo.json
All of the analysis scripts run similarly to what is described by Mark Phillips here for his own work: Metadata Analysis at the Command Line
Works most similarly to the original script created by Mark Phillips.
usage: oaidc_analysis.py data_filename.xml [options, see below]
positional arguments:
optional arguments:
To get a field report:
$ python oaidc_analysis.py test/output.dc.xml
To get all the values for the dc:creator field:
$ python oaidc_analysis.py test/output.xml -e creator
To get all the unique values for the dc:creator field, sorted by count:
$ python oaidc_analysis.py test/output.xml -e creator | sort | uniq -c
This has added support for reviewing nested MODS elements, as well as perform queries with xpath.
usage: oaimods_analysis.py data_filename.xml [options, see below]
positional arguments:
optional arguments:
To print a field report:
python oaimods_analysis.py test/DLTNphase1.mods.xml
To get all the values for mods:title (this does not mean just top level mods:titleInfo/mods:title - but any mods:title element wherever it appears in the record):
python oaimods_analysis.py test/DLTNphase1.mods.xml -e title
To get all the unique values for mods:form (again, wherever it appears) sorted by count:
python oaimods_analysis.py test/DLTNphase1.mods.xml -e form | sort | uniq -c
To get all the values that fit the 'mods:mods/mods:originInfo/mods:dateCreated[@encoding="edtf"]' Xpath query (i.e., all dateCreated for the object that have edtf encoding):
python oaimods_analysis.py test/DLTNphase1.mods.xml -x 'mods:originInfo/mods:dateCreated[@encoding="edtf"]'
To be written up.
To be written up.
To be written up.