cwrc / islandora-etl

Islandora ETL (Extract / Transform / Load)
GNU General Public License v3.0
1 stars 0 forks source link

Islandora Legacy (Drupal 7) Exporter

A set of tools to

Features

Installing

Git clone the repository

Install Python 3+ (haven't tried with other versions)

Add Python libraries -- local user (not systemwide)

python3 setup.py install --user

Add Python libraries -- systemwide

sudo python3 setup.py install

Extract from Islandora Legacy

python3 islandora7_export.py --id_list test_data/z --server ${ISLANDORA_LEGACY:-https://example.com} --export_dir /tmp/z/
<metadata pid="" label="" owner="" created="" modified="">
  <media_exports>
    </media filepath="" ds_id="">
    <!-- a list of Islandora Legacy extracted datastreams with their path and datastream id -->
  </media_exports>
  <resource_metadata>
    <!-- a list of extracted metadata datastreams including MODS, RELS-EXT, ect. -->
  </resource_metadata>
</metadata>

Transformation and metadata inquiry tools

This script compares the Islandora Legacy content with the new imported via Islandora Workbench content in the new Islandora site to verify/audit the export, transformation, and loading phase. The comparison is made between the Islandora Legacy MODS metadata and the Islandora JSON-LD output.

Reference for the metadata conversion: Islandora MIG and Islandora MIG (Metadata Interest Group) MODS-RDF Simplified Mapping

How to find the Islandora fields available?

A list of available fields can be discovered via the --get_csv_template option within Islandora Workbench. The fields available depend on the combination of the Drupal config created either via the Islandora defaults profile or the Drupal config subsequently added initial Drupal setup.

version alignment with configured Drupal fields

How to specify the parent collection?

Care needs to be taken with collections otherwise resources can be added without a collection

Collections need to appear before children/members in the workbench CSV (see creating collections and members together)

2021-10-22: add some logic that attempts to order items in CSV by collection hierarchy: this only works if the items in the collection hierarchy are present and also not already in Islandora. Note: the url_alias should trigger a warning if one tries to add a collection that pre-exists.

Each item should have either a parent_id (if the parent collection is referenced in the workbench CSV) or field_member_of (if the parent collection pre-exists in Drupal). Note: if not, then resources will float without a parent. Creating collections and members together)

If items are added without a collection, the output_csv Islandora Workbench config will provide a way to update existing items (don't lose the file) assuming they have not changed via the UI. See Islandora Workbench documentation for details.

todo: flesh out potential problem areas around the collection hierarchy and loading

note: Islandora Workbench subdelimiter - using non-default

Due to archival records containing the | character, the Islandora Workbench subdelimiter is set to a custom value as the Workbench default is |. This requires updating (2022 version is ^|.|^)

Loading to Islandora

 python3 workbench --config ../workbench_config/workbench_config_test_02.yaml --check
 python3 workbench --config ../workbench_config/workbench_config_test_02.yaml

More information:

Note: to verify EDTF dates (faster than Islandora Workbench --check)

verify_edtf_date_in_csv.py

Auditing: running the after Islandora Workbench import verification script

Attempts to compare Islandora Legacy XML to the JSON-LD output of Islandora (Drupal 8+) node using the mappings defined by the Islandora MIG and with the document: Islandora MIG (Metadata Interest Group) MODS-RDF Simplified Mapping

python3 islandora_audit.py --id_list test_data/z --islandora_legacy https://example.com/ --islandora https://example_9.com/ --comparison_config test_data/comparison_config.sample.json

ToDo

How to gather a list of PIDs from an Islandora Legacy (aka Islandora07) collection

Purpose: to return a list of all the direct members of a specified collection. As of 2022-04-19: It doesn't traverse the descendent collections of the specified collection.

See the islandora_search.py script

python3 islandora7_search.py --input_file input_file_listing_collection_PIDs --server https://cwrc.ca --output_file output_file_to_store_results

Testing

To run tests:

python3 tests/export_unit_tests.py

Style

pycodestyle --show-source --show-pep8 --ignore=E402,W504 --max-line-length=200 .

FAQ

Media files fail to load via Islandora Workbench (or via the Drupal UI)

How to gather a set of PID from Islandora Legacy (Islandora 7)?

Todo

useful queries

List all models

for $i in /metadata/@models
group by $i
return $i

Lookup by PID

let $pid = "digitalpage:881e0ee6-52ed-4f05-9e8d-c5e51c5c1a31"
for $i in /metadata[@pid=$pid]
return $i