MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Add OAI to CSV toolchain to support migrations from Islandora 7.x to CLAW #463

Open mjordan opened 6 years ago

mjordan commented 6 years ago

https://github.com/Islandora-CLAW/CLAW/issues/452 asks whether we can use Drupal 8's migration API to batch ingest content into CLAW. I've got an MIK toolchain that harvests content from 7.x using OAI-PMH and writes out input for a Migrate Plus ingest. Still working on it while travelling but will have something substantially complete within a couple days.

mjordan commented 6 years ago

BTW, doing this work is also a good test of MIK's developer documentation. I'll probably be opening a couple issues resulting from this work.

mjordan commented 6 years ago

Related issue: #378.

mjordan commented 6 years ago

Got this to the point where you can harvest a collection via OAI-PMH and end up with a CSV file similar to the one prepared by @seth-shaw-unlv at the CLAW issue linked above. Sample .in file is:

; MIK configuration file for migrating content from an Islandora
; instance to the format required by the Migrate+ module, for ingesting
; into Islandora CLAW.

[SYSTEM]

[CONFIG]
config_id = MIK OAI to CSV toolchain
last_updated_on = "2018-04-16"
last_update_by = "Mark Jordan"

[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = doitest_collection
temp_directory = "/tmp/oai_to_csv_temp"

[METADATA_PARSER]
class = csv\DcToCsv
; The field identified in record_key is added to the output CSV containing the item's unique ID.
record_key = ID
; DC element names are used as CSV column headings.
dc_elements[] = title
dc_elements[] = identifier
dc_elements[] = description
dc_elements[] = format

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oai_to_csv_temp"
datastream_ids[] = OBJ

[WRITER]
class = OaipmhCsv
output_file = "/tmp/oai_to_csv_output/metadata.csv"
output_directory = "/tmp/oai_to_csv_output"
; metadata_only = true

[MANIPULATORS]

[LOGGING]
path_to_log = "/tmp/oai_to_csv_output/mik.log"
path_to_manipulator_log= "/tmp/oai_to_csv_output/manipulator.log"

Here's the resulting CSV file:

ID,title,identifier,description,format
oai%3Adrupal-site.org%3Adoitest_16,"autogen 6 - blurg",doitest:16,"This record was harvested on a Thursday.","nonprojected graphic"
oai%3Adrupal-site.org%3Adoitest_4,"Church Holy Rosary, Vancouver B.C.",doitest:4,"Holy Rosary Church in Vancouver, B.C."
oai%3Adrupal-site.org%3Adoitest_5,"Second test object.",doitest:3,"This record was harvested on a Thursday."
oai%3Adrupal-site.org%3Adoitest_6,"Has DOI?",doitest:6,"This record was harvested on a Thursday.",globe
oai%3Adrupal-site.org%3Adoitest_12,"autogen 6",doitest:12,"This record was harvested on a Thursday.","nonprojected graphic"
mjordan commented 6 years ago

Based on discussion at the April 18 CLAW Technical call, I've added an option to output an XML file containing the harvested DC or MODS instead of a CSV file. The generation of this output file is not done via OAI to CSV toolchain, but rather via a shutdown hook script used with the existing OAI Islandora toolchain:

[SYSTEM]

[CONFIG]
config_id = MIK OAI toolchain
last_updated_on = "2018-04-18"
last_update_by = "Mark Jordan"

[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = clawcall_collection
metadata_prefix = mods
temp_directory = /tmp/claw_call_tmp

[METADATA_PARSER]
; We don't use the new  csv\DcToCsv parser here.
class = mods\OaiToMods

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = /tmp/claw_call_tmp
datastream_ids[] = OBJ

[WRITER]
; We don't use the new OaipmhCsv writer here.
class = Oaipmh
output_directory = "/tmp/claw_call"
; This is the new shutdown hook script.
shutdownhooks[] = "php extras/scripts/shutdownhooks/concatentate_xml_files.php"

[MANIPULATORS]

[LOGGING]
path_to_log = "/tmp/claw_call/mik.log"
path_to_manipulator_log = "/tmp/claw_call/manipulator.log"