Bulkrax testing v.1: basic filesets and required metadata

emory-libraries / dlp-curate

Digital curation and preservation workbench for the Emory Preservation Repository.

11 stars 4 forks source link

This ticket relates to the larger goals of #1837. For a first version of testing Bulkrax, we want to achieve a minimal functional installation and configuration of Bulkrax in Curate. See their documentation for CSV Ingest and Configuration options.

Install Bulkrax and configure a CSV Importer that populates simple filesets (1 file per fileset) and Curate required metadata fields for works/filesets:

title [used for work titles and fileset labels] holding_repository [works only] date_created [works only] content_type [works only] emory_rights_statements [works only] rights_statement [works only] data_classifications [works only] visibility [works only? assume that work visibility dictates fileset visibility] deduplication_key [not used by Bulkrax but required for Zizia importer] source_collection_id [works only] pcdm_use [if needed, applicable to filesets only]

Determine how to populate Bulkrax's required source_identifier field

For a first pass, using a Bulkrax-generated id may be easiest

Ingest a representative basic image collection e.g. Oxford Asian Artifacts Collection

see starter CSV Importer template mapped to real metadata for the Oxford Asian Artifacts Collection.
Column names in red are required for Bulkrax. source_identifier and parents are currently TBD until we figure out the best strategy to populate a shared SOLR field across works, Collections, and filesets.
Our primary Curate id is generated at the time of ingest, so we can't predict it in advance. Works have deduplication_key, but this is not populated for Collections or Filesets currently.
We want all filesets to be related to their parent works (currently noted in deduplication_key) and all works related to the parent collection (currently noted in source_collection_id). The value in source_collection_id will vary depending on the environment (local, arch, test, prod) so this may need to be revised in the sample CSV template.
file paths are noted relative to our EFS share where Zizia imports files from, but may need to be revised depending on where the files need to be accessed from for import.

PR made: https://github.com/emory-libraries/dlp-curate/pull/1851

NOTE: Besides the headers listed above, model, file, parent, file_types are also necessary for successful imports with Bulkrax.

model: Collection. CurateGenericWork, FileSet; defaults to FileSet, but I don't recommend leaving this blank.
file: The method of importation that has shown to work is utilizing a ZIP file. The CSV would sit at root of it, while the images/other items would be contained inside of a files folder. Differing from Zizia, the files to be attached to each FileSet should all be listed in this field on the same line, and multiples (according to the documentation) should be separated by a semicolon. See this for an example: https://app.zenhub.com/files/158455630/d0128792-c3c2-481c-8391-b0d4eb0b29d2/download The files should only be listed by their name combined with extension.
parent: A reference to the container above this item. If importing a FileSet, this should be the containing CurateGenericWork. This should contain one of two strings:
- The value stored in the level above's deduplication_key.
- Its parent's id.
file_types: To accommodate our filetype customization, this field was needed. It should contain one string with the filename and filetype coupled together with a semicolon. Multiples should be joined by a pipe. For example: "AmericanTail.jpg:preservation_master_file|BackToTheFuture.png:service_file"; This field defaults to :preservation_master_file if a filetype isn't found correctly.

Also, another note about deduplication_key: The practice of providing a value here can continue, but, if left blank, Bulkrax will create a dynamic id for the field. Every individual Work or Fileset imported needs its own unique deduplication key when importing with Bulkrax. The field parent can use deduplication_key to link FileSet to Work. As well, bear in mind that Bulkrax actively uses source_collection_id to tie Works to Collections.

emory-libraries / dlp-curate

Bulkrax testing v.1: basic filesets and required metadata #1839