OmSipCreator is a tool for converting batches of disk images (e.g. ISO 9660 CD-ROM images, raw floppy disk images, but also ripped audio files) into SIPs that are ready for ingest in an archival system. This includes automatic generation of METS metadata files with structural and bibliographic metadata. Bibliographic metadata are extracted from the KB general catalogue, and converted to MODS format. OmSipCreator also performs various quality checks on the input batches. Finally, it can be used to remove erroneous entries from a batch.
At the moment this software is still a somewhat experimental proof-of-concept that hasn't had much testing at this stage. Neither the current batch input format nor the SIP output format (including METS metadata) have been finalised yet, and may be subject to further changes.
Also, the (bibliographic) metadata component is specific to the situation and infrastructure at the KB, although it could easily be adapted to other infrastructures. To do this you would need to customize the createMODS function.
OmSipCreator was developed and tested under Python 3.6. It may (but is not guaranteed to) work under Python 2.7 as well. If you run it under Linux, you need to install (a recent version of) MediaInfo. Installation instructions can be found here. OmSipCreator expects that the mediainfo binary is located under usr/bin (which is the default installation location when installing from a Debian package). A Windows version of MediaInfo is already included with OmSipCreator.
The recommended way to install omSipCreator is to use pip. The following command will install omSipCreator and its dependencies:
pip install omSipCreator
OmSipCreator has three sub-commands:
omSipCreator [--nochecksums] verify batchIn
Here batchIn is the batch directory. Optionally you may use the --nochecksums
/ -n
flag, which will bypass checksum verification (which can be useful to speed up the verification process for large files). Note that the prune and write commands (explained below) will always do a checksum verification.
omSipCreator prune batchIn batchErr
Here batchErr is the name of the batch that will contain all PPNs that have problems. If batchErr is an existing directory, all of its contents will be overwritten! OmSipCreator will prompt you for confirmation if this happens:
This will overwrite existing directory 'failed' and remove its contents!
Do you really want to proceed (Y/N)? >
omSipCreator write batchIn dirOut
Here dirOut is the directory where the SIPs will be created. If dirOut is an existing directory, all of its contents will be overwritten! OmSipCreator will prompt you for confirmation if this happens:
This will overwrite existing directory 'sipsOut' and remove its contents!
Do you really want to proceed (Y/N)? >
The important thing is that any errors in the input batch are likely to result in SIP output that is either unexpected or just plain wrong. So always verify each batch first, and fix any errors if necessary. The
The input batch is simply a directory that contains a number of subdirectories, each of which represents exactly one data carrier. Furthermore it contains a batch manifest, which is a comma-delimited text file with basic metadata about each carrier, and a log file with details about the imaging and ripping procedure. The diagram below shows an example of a batch that contains 3 carriers (one audio CD and two CD-ROMs):
├── 1c2d6edc-34a7-11e7-8332-7446a0b42b9a
│ ├── 01.flac
│ ├── 02.flac
│ ├── cd-info.log
│ ├── checksums.sha512
│ └── dbpoweramp.log
├── 3cba3e5e-34a7-11e7-8bd1-7446a0b42b9a
│ ├── cd-info.log
│ ├── checksums.sha512
│ ├── isobuster.log
│ ├── isobuster-report.xml
│ └── NEW.iso
├── 61c3e58a-34a6-11e7-98d9-7446a0b42b9a
│ ├── cd-info.log
│ ├── checksums.sha512
│ ├── isobuster.log
│ ├── isobuster-report.xml
│ └── SPELEN_MET_KIKKER.iso
├── batch.log
└── manifest.csv
Each carrier directory contains:
A file checksums.sha512 which contains the SHA-512 checksums of all files in the directory. Each line in the file has the following format:
checksum filename
Both fields are separated by 1 or more spaces. The filename field must not include any file path information. Here's an example:
6bc4f0a53e9d866b751beff5d465f5b86a8a160d388032c079527a9cb7cabef430617f156abec03ff5a6897474ac2d31c573845d1bb99e2d02ca951da8eb2d01 01.flac
ae6d9b5d47ecc34345bdbf5a0c45893e88b5ae4bb2927a8f053debdcd15d035827f8b81a97d3ee4c4ace5257c4cc0cde13b37ac816186e84c17b94c9a04a1608 02.flac
::
::
49b0a0d2f40d9ca1d7201cb544e09d69f1162dd8a846c2c3d257e71bc28643c015d7bc458ca693ee69d5db528fb2406021ed0142f26a423c6fb4f115d3fa58e7 20.flac
d9fa0b5df358a1ad035a9c5dbb3a882f1286f204ee1f405e9d819862c00590b1d11985c5e80d0004b412901a5068792cd48e341ebb4fe35e360c3eeec33a1f23 cd-info.log
fa8898fc1c8fe047c1b45975fd55ef6301cfdfe28d59a1e3f785aa3052795cad7a9eff5ce6658207764c52fa9d5cf16808b0fc1cfe91f8c866586e37f0b47d08 dbpoweramp.log
783ae6ac53eba33b8ab04363e1159a71a38d2db2f8004716a1dc6c4e11581b4311145f07834181cd7ec77cd7199377286ceb5c3506f0630939112ae1d55e3d47 ELL2.iso
31bca02094eb78126a517b206a88c73cfa9ec6f704c7030d18212cace820f025f00bf0ea68dbf3f3a5436ca63b53bf7bf80ad8d5de7d8359d0b7fed9dbc3ab99 isobuster.log
The batch manifest is a comma-delimited text file with the name manifest.csv. The first line is a header line:
jobID,PPN,volumeNo,carrierType,title,volumeID,success,containsAudio,containsData, cdExtra
Each of the remaining lines represents one carrier, for which it contains the following fields:
Below is a simple example of manifest file:
jobID,PPN,volumeNo,carrierType,title,volumeID,success,containsAudio,containsData,cdExtra
383c78fa-34a6-11e7-926c-7446a0b42b9a,18594664X,1,cd-rom,Marjan Berk,ELL3,True,True,True,True
61c3e58a-34a6-11e7-98d9-7446a0b42b9a,230370241,1,cd-rom,Kikker is verliefd,SPELEN_MET_KIKKER,True,False,True,False
06e80cb6-34a7-11e7-8466-7446a0b42b9a,378374036,1,cd-audio,Na klar!. Luister- en kijkboxen. 6 vwo,,True,True,False,False
3cba3e5e-34a7-11e7-8bd1-7446a0b42b9a,378374036,1,dvd-video,Na klar!. Luister- en kijkboxen. 6 vwo,NEW,True,False,True,False
In the above example the third and fourth carriers are both part of a 2-volume item. Consequently the PPN values of both carriers are identical.
Each SIP is represented as a directory. Each carrier that is part of the SIP is represented as a subdirectory within that directory. The SIP's root directory contains a METS file with technical, structural and bibliographic metadata. Bibliographic metadata is stored in MODS format (3.4) which is embedded in a METS mdWrap element. Here's a simple example of a SIP that is made up of 2 carriers (which are represented as ISO 9660 images):
269448861
├── cd-audio
│ ├── 1
│ │ └── nuvoorstraks1.iso
│ └── 2
│ └── nuvoorstraks2.iso
└── mets.xml
And here's an example of a SIP that contains 1 "enhanced" audio CD, with separate audio tracks represented as FLAC files, and the data track as an ISO image:
18594650X/
├── cd-rom
│ └── 1
│ ├── 01.flac
│ ├── 02.flac
│ ├── 03.flac
│ ├── 04.flac
│ ├── 05.flac
│ ├── 06.flac
│ ├── 07.flac
│ └── ELL2.iso
└── mets.xml
A detailed description of the SIP strucure and its associated metadata can be found here.
When run in either verify or write mode, omSipCreator performs a number checks on the input batch. Each of he following checks will result in an error in case of failure:
In write mode omSipCreator performs the following additional checks:
Finally, omSipcreator will report a warning in the following situations:
Both situations may indicate a data entry error, but they may also reflect that the physical carriers are simply missing.
See Documentation of modules and processing flow
Written by Johan van der Knijff, except sru.py which was adapted from the KB Python API which is written by WillemJan Faber. The KB Python API is released under the GNU GENERAL PUBLIC LICENSE.
OmSipCreator is released under the Apache License 2.0. The KB Python API is released under the GNU GENERAL PUBLIC LICENSE. MediaInfo is released under the BSD 2-Clause License; Copyright (c) 2002-2017, MediaArea.net SARL. All rights reserved. See the tools/mediainfo
directory for the license statement of MediaInfo.