Center-for-Research-Libraries / crl-serials-validator

Validate bibliographic and holdings data for shared print.
GNU General Public License v3.0
0 stars 1 forks source link

The CRL Serials Validator

Validate serials bibliographic and holdings data according to a set of user-defined rules.

About

The CRL Serials Validator takes serials bibliographic and holdings data, checks it against data downloaded from the WorldCat Search API, validates it based on rules set by the user, and declares each title valid (accurate and in scope) or invalid based on what it finds.

The CRL Serials Validator was originally built at the Center for Research Libraries for use with the the Print Archives Preservation Registry (PAPR), to aid in checking serials bibliographic data for accuracy and appropriate scope. It can be used for similar work with shared print data, or anywhere that you need to check a large amount of serials data for accuracy and relevance.

Generally, the CRL Serials Validator attempts to answer these questions:

Specific criteria can be set by the user, to make checking more or less strict as required.

The program can process input data in a variety of formats: MARC, tab-separated and comma-separated text files, and Excel (xlsx) spreadsheets. It produces an output spreadsheet with information about every title in the input set, any relevant errors found, and a list of titles that passed all of the selected checks.

If the user has a subscription to data downloads from the ISSN Centre, the user can additionally validate their data against this ISSN data.

Basic Requirements

To install and use the CRL Serials Validator you will need:

The CRL Serials Validator has been tested and used on Windows 10 and Linux (Ubuntu 20.04). It should also work on Mac OS, but hasn't been tested.

Quick Start

  1. Install Python 3. Add Python to your PATH.
  2. Install the needed Python dependencies by typing pip install -r requirements.txt.
  3. Put your input files in the input folder.
  4. Add the ISSN database (optional, if you have a subscription to it).
  5. Run the CRL Serials Validator with python crl_serials_validator.py.
  6. Setup your API keys.
  7. Tell the scripts what input fields your input files have.
  8. Validate the input data.

Running the CRL Serials Validator

Put your input files in the input folder. They should all go in the top level folder, not in any subfolders. Input files can be MARC text (.mrk), Excel, csv, or tsv. A tsv file can have a ".tsv" or ".txt" extension.

The CRL Serials Validator can be run by typing python crl_serials_validator.py in a command window. Note that MacOS and Linux users might have to use python3 instead of python.

From here, running the CRL Serials Validator should be relatively straightforward.

Set up your WorldCat API keys

If you have used the CRL MARC Machine on this computer you might be able to skip this step. Otherwise, choose this option and enter a valid API key and a name for it. A secret is necessary only for the Discovery API.

Choosing this option will cause a separate window to open. It might open under the command line window, so look for it on your taskbar if you don't see it.

Do a quick scan of any MARC input files

This runs through all of the MARC files in the input folder to check for common fields and print the results to the screen and the log file. This is intended to help you figure out what fields contain holdings, bib IDs, and so forth, and can be skipped if you are already familiar with your MARC input files.

This option won't appear if there are no MARC files in the input folder.

Specify fields in input files

Before you can analyze a file you have to tell the system what fields (for MARC records) or columns (for xlsx, csv, and tsv) contain relevant fields.

Choosing this option will cause a separate window to open. It might open under the command line window, so look for it on your taskbar if you don't see it.

MARC fields and subfields should be entered like "035a". Spreadsheet columns should be entered as numbers, with 1 as the first column.

Skip any fields that aren't in the input file.

This must be done before moving on to the next steps.

Specify disqualifying issues

Choose this option to determine what issues will cause the Validator to fail a specific title. The Validator will always check for every issue and report when it finds them, but will only fail titles on the issues you specify.

There is a glossary of disqualifying issues in the documentation.

Process input and WorldCat MARC to create outputs

This runs the Validator proper. The script will ask if you want to delete any files that are in your output folder. If you don't do so, the the Validator will add a number to the name of new files it creates. So if you already have CRL checklist.xlsx, it will add CRL checklist(1).xlsx.

The Validator will run and produce two spreadsheets for every institution, that will go in the output directory. In addition, if a MARC file has any validation errors the Validator will produce a text file detailing them.

While the Validator is running, the process will download MARC records from the WorldCat API for any OCLC number that it has not updated in the last year. This means that initial runs of a set of data can be much longer than later runs. The speed of the API work will depend heavily on network conditions, both locally and at OCLC.

For more on the outputs, see the glossary of terms for the "Originally from" output worksheet, the "For review" output worksheet, and the the "Checklist" output worksheet.

More information

Output files

Output files will be in the output directory, with separate output files for each institution in the input set. The main output file of the CRL Serials Validator is a spreadsheet called something like INSTITUTION_NAME for review.xlsx. Currently additional outputs include separate tsv files with the validated (INSTITUTION_NAME for loading.txt) and failed (INSTITUTION_NAME failed.txt) records as printed in the spreadsheet file; separate files of validated and failed MARC records when a MARC input file is given; and a filed of specialized output for making Local Holdings Records for the validated titled.

Data files

Beside its regular output files the Validator will create a file called marc_database.db (for storing downloaded MARC files), a file called api_keys.yaml (to store WorldCat API keys), and a general configuration file called validator_config.ini. The validator_config.ini will be put in the data folder in the CRL Serials Validator's main folder. The other two files will go in the user's data directory in a folder called CRL. On Windows this will usually be found at C:\Users\USERNAME\AppData\Local\CRL\CRL. On most Linux systems it will be at /home/USERNAME/.local/share/CRL.

To create a "portable" version of the application, move the marc_database.db and api_keys.yaml files to the data folder in the main application folder. After that the application will default to using those files and won't try to create files in other places on the user's drive.

If you hae a copy of the ISSN database, it should be put in the same directory that contains marc_database.db.

JSTOR

If you want the system to check whether or not a title is in JSTOR, add a file of JSTOR ISSNs to the data folder. The file should be a text list of only jstor ISSNs, and should have a name starting with "jstor" and ending with ".txt". It doesn't have to be called jstor.txt, but that would be an obvious option.

ISSN Database

The project can use, but does not require, data from the ISSN International Centre stored in an SQLite database. Users outside of CRL will need a separate license from the ISSN International Centre to use ISSN data. Tools for creating a database from raw ISSN MARC data will be included in a future version of the CRL Serials Validator.

Test data

There are test input files in the test_inputs folder. Copy some or all of them into the input folder to use them.

Other documentation