KBNLresearch / omSipCreator

Create ingest-ready SIPs from batches of optical media images
Apache License 2.0
7 stars 0 forks source link
code4lib

About

OmSipCreator is a tool for converting batches of disk images (e.g. ISO 9660 CD-ROM images, raw floppy disk images, but also ripped audio files) into SIPs that are ready for ingest in an archival system. This includes automatic generation of METS metadata files with structural and bibliographic metadata. Bibliographic metadata are extracted from the KB general catalogue, and converted to MODS format. OmSipCreator also performs various quality checks on the input batches. Finally, it can be used to remove erroneous entries from a batch.

Notes and warnings

At the moment this software is still a somewhat experimental proof-of-concept that hasn't had much testing at this stage. Neither the current batch input format nor the SIP output format (including METS metadata) have been finalised yet, and may be subject to further changes.

Also, the (bibliographic) metadata component is specific to the situation and infrastructure at the KB, although it could easily be adapted to other infrastructures. To do this you would need to customize the createMODS function.

Dependencies

OmSipCreator was developed and tested under Python 3.6. It may (but is not guaranteed to) work under Python 2.7 as well. If you run it under Linux, you need to install (a recent version of) MediaInfo. Installation instructions can be found here. OmSipCreator expects that the mediainfo binary is located under usr/bin (which is the default installation location when installing from a Debian package). A Windows version of MediaInfo is already included with OmSipCreator.

Installation

The recommended way to install omSipCreator is to use pip. The following command will install omSipCreator and its dependencies:

pip install omSipCreator

Usage

OmSipCreator has three sub-commands:

Verify a batch without writing any SIPs

omSipCreator [--nochecksums] verify batchIn

Here batchIn is the batch directory. Optionally you may use the --nochecksums / -n flag, which will bypass checksum verification (which can be useful to speed up the verification process for large files). Note that the prune and write commands (explained below) will always do a checksum verification.

Create a sanitised version of a batch

omSipCreator prune batchIn batchErr

Here batchErr is the name of the batch that will contain all PPNs that have problems. If batchErr is an existing directory, all of its contents will be overwritten! OmSipCreator will prompt you for confirmation if this happens:

This will overwrite existing directory 'failed' and remove its contents!
Do you really want to proceed (Y/N)? >

Verify a batch and write SIPs

omSipCreator write batchIn dirOut

Here dirOut is the directory where the SIPs will be created. If dirOut is an existing directory, all of its contents will be overwritten! OmSipCreator will prompt you for confirmation if this happens:

This will overwrite existing directory 'sipsOut' and remove its contents!
Do you really want to proceed (Y/N)? > 

How to use the verify, prune and write commands

The important thing is that any errors in the input batch are likely to result in SIP output that is either unexpected or just plain wrong. So always verify each batch first, and fix any errors if necessary. The

  1. Always first run omSipCreator in verify mode.
  2. If this results in any reported errors, fix them by running in prune mode.
  3. Double-check the sanitised batch by running in verify mode once more.
  4. Once no errors are reported, create the SIPs by running in write mode.
  5. Finally, fix any 'error' batches that were generated by the prune command (this may involve manual processing/editing), verify them and then create the SIPs by running in write mode.

Structure of input batch

The input batch is simply a directory that contains a number of subdirectories, each of which represents exactly one data carrier. Furthermore it contains a batch manifest, which is a comma-delimited text file with basic metadata about each carrier, and a log file with details about the imaging and ripping procedure. The diagram below shows an example of a batch that contains 3 carriers (one audio CD and two CD-ROMs):

├── 1c2d6edc-34a7-11e7-8332-7446a0b42b9a
│   ├── 01.flac
│   ├── 02.flac
│   ├── cd-info.log
│   ├── checksums.sha512
│   └── dbpoweramp.log
├── 3cba3e5e-34a7-11e7-8bd1-7446a0b42b9a
│   ├── cd-info.log
│   ├── checksums.sha512
│   ├── isobuster.log
│   ├── isobuster-report.xml
│   └── NEW.iso
├── 61c3e58a-34a6-11e7-98d9-7446a0b42b9a
│   ├── cd-info.log
│   ├── checksums.sha512
│   ├── isobuster.log
│   ├── isobuster-report.xml
│   └── SPELEN_MET_KIKKER.iso
├── batch.log
└── manifest.csv

Carrier directory structure

Each carrier directory contains:

  1. One or more files that represent the data carrier. This is typically an ISO 9660 (or HFS+ or UDF) image, but for an audio CD with multiple tracks this can also be multiple audio (e.g. WAV or FLAC) files. In the latter case, it is important that the original playing order can be inferred from the file names. In other words, sorting the file names in ascending order should reproduce the original playing order. Note that (nearly?) all audio CD ripping software applications do this by default.
  2. A file cd-info.log with output of the cd-info tool.
  3. A file isobuster.log with an Isobuster error code (only for carriers that contain a data session).
  4. A file isobuster-report.xml which is a report file in Digital Forensics XML format (only for carriers that contain a data session).
  5. A file dbpoweramp.log which is the dbpoweramp log file (only for carriers that contain audio).
  6. A file checksums.sha512 which contains the SHA-512 checksums of all files in the directory. Each line in the file has the following format:

    checksum filename

    Both fields are separated by 1 or more spaces. The filename field must not include any file path information. Here's an example:

    6bc4f0a53e9d866b751beff5d465f5b86a8a160d388032c079527a9cb7cabef430617f156abec03ff5a6897474ac2d31c573845d1bb99e2d02ca951da8eb2d01 01.flac
    ae6d9b5d47ecc34345bdbf5a0c45893e88b5ae4bb2927a8f053debdcd15d035827f8b81a97d3ee4c4ace5257c4cc0cde13b37ac816186e84c17b94c9a04a1608 02.flac
    ::
    ::
    49b0a0d2f40d9ca1d7201cb544e09d69f1162dd8a846c2c3d257e71bc28643c015d7bc458ca693ee69d5db528fb2406021ed0142f26a423c6fb4f115d3fa58e7 20.flac
    d9fa0b5df358a1ad035a9c5dbb3a882f1286f204ee1f405e9d819862c00590b1d11985c5e80d0004b412901a5068792cd48e341ebb4fe35e360c3eeec33a1f23 cd-info.log
    fa8898fc1c8fe047c1b45975fd55ef6301cfdfe28d59a1e3f785aa3052795cad7a9eff5ce6658207764c52fa9d5cf16808b0fc1cfe91f8c866586e37f0b47d08 dbpoweramp.log
    783ae6ac53eba33b8ab04363e1159a71a38d2db2f8004716a1dc6c4e11581b4311145f07834181cd7ec77cd7199377286ceb5c3506f0630939112ae1d55e3d47 ELL2.iso
    31bca02094eb78126a517b206a88c73cfa9ec6f704c7030d18212cace820f025f00bf0ea68dbf3f3a5436ca63b53bf7bf80ad8d5de7d8359d0b7fed9dbc3ab99 isobuster.log

Batch manifest format

The batch manifest is a comma-delimited text file with the name manifest.csv. The first line is a header line:

jobID,PPN,volumeNo,carrierType,title,volumeID,success,containsAudio,containsData, cdExtra

Each of the remaining lines represents one carrier, for which it contains the following fields:

  1. jobID - internal carrier-level identifier (in our case this is generated by our iromlab software). The image file(s) of this carrier are stored in an eponymous directory within the batch.
  2. PPN - identifier to physical item in the KB Collection to which this carrier belongs. For the KB case this is the PPN identifier in the KB catalogue.
  3. volumeNo - for PPNs that span multiple carriers, this defines the volume number (1 for single-volume items). Values must be unique within each carrierType (see below)
  4. carrierType - code that specifies the carrier type. Currently the following values are permitted:
    • cd-rom
    • dvd-rom
    • cd-audio
    • dvd-video
  5. title - text string with the title of the carrier (or the publication is is part of). Not used by omSipCreator.
  6. volumeID - text string, extracted from Primary Volume descriptor, empty if cd-audio. Not used by omSipCreator.
  7. success - True/False flag that indicates status of iromlab's imaging process.
  8. containsAudio - True/False flag that indicates the carrier contains audio tracks (detected by cd-info)
  9. containsData - True/False flag that indicates the carrier contains data tracks (detected by cd-info)
  10. cdExtra - True/False flag that indicates the carrier is an 'enhanced' CD with both audio and data tracks that are located in separate sessions (detected by cd-info)

Below is a simple example of manifest file:

jobID,PPN,volumeNo,carrierType,title,volumeID,success,containsAudio,containsData,cdExtra
383c78fa-34a6-11e7-926c-7446a0b42b9a,18594664X,1,cd-rom,Marjan Berk,ELL3,True,True,True,True
61c3e58a-34a6-11e7-98d9-7446a0b42b9a,230370241,1,cd-rom,Kikker is verliefd,SPELEN_MET_KIKKER,True,False,True,False
06e80cb6-34a7-11e7-8466-7446a0b42b9a,378374036,1,cd-audio,Na klar!. Luister- en kijkboxen. 6 vwo,,True,True,False,False
3cba3e5e-34a7-11e7-8bd1-7446a0b42b9a,378374036,1,dvd-video,Na klar!. Luister- en kijkboxen. 6 vwo,NEW,True,False,True,False

In the above example the third and fourth carriers are both part of a 2-volume item. Consequently the PPN values of both carriers are identical.

SIP structure

Each SIP is represented as a directory. Each carrier that is part of the SIP is represented as a subdirectory within that directory. The SIP's root directory contains a METS file with technical, structural and bibliographic metadata. Bibliographic metadata is stored in MODS format (3.4) which is embedded in a METS mdWrap element. Here's a simple example of a SIP that is made up of 2 carriers (which are represented as ISO 9660 images):

269448861
├── cd-audio
│   ├── 1
│   │   └── nuvoorstraks1.iso
│   └── 2
│       └── nuvoorstraks2.iso
└── mets.xml

And here's an example of a SIP that contains 1 "enhanced" audio CD, with separate audio tracks represented as FLAC files, and the data track as an ISO image:

18594650X/
├── cd-rom
│   └── 1
│       ├── 01.flac
│       ├── 02.flac
│       ├── 03.flac
│       ├── 04.flac
│       ├── 05.flac
│       ├── 06.flac
│       ├── 07.flac
│       └── ELL2.iso
└── mets.xml

A detailed description of the SIP strucure and its associated metadata can be found here.

Quality checks

When run in either verify or write mode, omSipCreator performs a number checks on the input batch. Each of he following checks will result in an error in case of failure:

In write mode omSipCreator performs the following additional checks:

Finally, omSipcreator will report a warning in the following situations:

Both situations may indicate a data entry error, but they may also reflect that the physical carriers are simply missing.

Developer documentation

See Documentation of modules and processing flow

Contributors

Written by Johan van der Knijff, except sru.py which was adapted from the KB Python API which is written by WillemJan Faber. The KB Python API is released under the GNU GENERAL PUBLIC LICENSE.

License

OmSipCreator is released under the Apache License 2.0. The KB Python API is released under the GNU GENERAL PUBLIC LICENSE. MediaInfo is released under the BSD 2-Clause License; Copyright (c) 2002-2017, MediaArea.net SARL. All rights reserved. See the tools/mediainfo directory for the license statement of MediaInfo.