Add an Islandora toolchain

mjordan commented 7 years ago

It should be fairly simple to migrate from one Islandora instance to another, particularly with the load-all-datastreams functionality of the newspaper and book batch modules, and https://github.com/mjordan/islandora_batch_with_derivs.

Two use cases are:

move content from an Islandora instance whose local support is no longer viable into a consortial instance
migrate specific collections from one institution to another as the result of a standard custodial/ownership transfer

We could use OAI-PMH to "discover" objects, then use Islandora's REST interface to get their contents.

mjordan commented 7 years ago

~~@MarcusBarnes I also think that the Compound Batch module can ingest correctly named datastreams. Is that correct?~~

Edit, 2016-11-06: If we can preserve relationships expressed in RELS-EXT (see below), we do not need to treat compound objects differently than any other object.

mjordan commented 7 years ago

Just a note - we'll need a way of updating the PIDS in RELS-EXT to preserve relationships between the migrated objects once they get into the destination Islandora.

mjordan commented 7 years ago

We may be able avoid updating PIDS and reuse the source objects' PIDs. See https://github.com/mjordan/islandora_batch_with_derivs/issues/1 for more info.

mjordan commented 7 years ago

Not only can we retain the source objects' PIDs, we can also batch ingest their RELS-EXT datastreams (although there is an issue with duplicate RDF statements). See https://github.com/mjordan/islandora_batch_with_derivs/issues/1#issuecomment-258711053 for an example.

mjordan commented 7 years ago

To summarize the relevance of https://github.com/mjordan/islandora_batch_with_derivs/issues/1 to an Islandora toolchain:

we can ingest an object and all of its datastreams, including RELS-EXT, and retain the object's PID
because complex objects are defined solely by the relationships in RELS-EXT, we do not need to use content-model specific batch importers to populate the target Islandora. A general-purpose importer like Islandora Batch With Derivs is sufficient for ingesting objects of any content model.
any new relationships created on ingest, specifically collection membership, will be added to the object's existing RELS-EXT, potentially creating duplicate RDF statements in the RELS-EXT; therefore, an Islandora toolchain would need to provide a way of avoiding duplicate entries in the ingested object's RELS-EXT
~~an Islandora toolchain~~ the Islandora Batch With Derivs module would also need to provide some utility scripts that verified that none of the PIDs that are used by incoming objects already exist. Another useful tool would be a script that allowed for the batch modification of all incoming objects' namespaces in their RELS-EXTs; the most obvious use case here is that the target Islandora uses the 'islandora:' namespace, as did the source Islandora. The namespace of the incoming objects would need to be changed in order for them to be ingested.

mjordan commented 7 years ago

In the initial issue I said

We could use OAI-PMH to "discover" objects, then use Islandora's REST interface to get their contents.

That assumes that the Islandora OAI-PMH and REST modules are running on the source Islandora, and that it is possible to install them if they are not already installed. Probably also assumes that the source is running Islandora 7.x and not 6.x. The "fetch" components of this toolchain should not make those assumptions.

If we can avoid retaining the structure of complex objects (compound objects, newspaper pages, etc) and rely on the relationships expressed in RELS-EXT, we can probably get by with fetching content from the source Islandora using a list of PIDS. We then iterate over the list and fetch all datastreams for each object.

The PID list could be obtained in a variety of ways, depending on how much access we have to the source Islandora - a locally run drush script, a remote script that queries the resource index or Solr, etc.

mjordan commented 7 years ago

In cases where and OAI-PMH provider is not available, perhaps we can use a simple scraper to get all the members of a collection from the collection browse pages. Here's an example using the Goutte scraper library:

<?php

use Goutte\Client;
require_once __DIR__ . '/vendor/autoload.php';

$client = new Client();

$browse_url = 'http://digital.lib.sfu.ca/alping-collection/richard-harwood-dick-chambers-alpine-photograph-collection';
$pages = range(2, 68);
$object_urls = array();

print "Scraping object URLs from pages starting at $browse_url...\n";
foreach ($pages as $page) {
    $crawler = $client->request('GET', $browse_url . '?page=' . $page);
    $crawler->filter('dd.islandora-object-caption > a')->each(function ($node) {
        print $node->attr('href') . "\n";
    });
}

This scraper produces output like this:

/alping-658/view-looking-out-over-numerous-mountains-two-people-foreground
/alping-652/view-looking-towards-burrard-inlet-and-vancouver-bc
/alping-659/view-looking-out-over-numerous-mountains
/alping-653/mountaineering-group-standing-mountain
/alping-654/mountaineering-group-standing-mountain
/alping-656/five-people-viewing-mountains
/alping-655/group-people-mountaineering
/alping-787/man-preparing-food-over-stove
/alping-789/icicles-hanging-snow-covered-tree
/alping-785/numerous-trees-covered-snow
/alping-786/numerous-trees-covered-snow
/alping-781/woman-mountaineering
/alping-788/person-mountaineering

An advantage of this approach is that every Islandora instance will expose members of a collection for scraping, regardless of whether it's running on Drupal 6 or 7.

mjordan commented 7 years ago

Here's a better version:

<?php

use Goutte\Client;
require_once __DIR__ . '/vendor/autoload.php';

$client = new Client();

$browse_url = 'http://digital.lib.sfu.ca/alping-collection/richard-harwood-dick-chambers-alpine-photograph-collection';
$site_base_url = 'http://digital.lib.sfu.ca';
// This range corresponds to the number of pages in the collection's browse list, the second
// number being the "?page" value of the last page.
$pages = range(0, 68); 
$object_urls = array();

print "Scraping object URLs from pages starting at $browse_url...\n";

// Then scrape each of the parameterized browse pages defined in $pages.
foreach ($pages as $page) {
    $crawler = $client->request('GET', $browse_url . '?page=' . $page);
    $hrefs = $crawler->filter('dd.islandora-object-caption > a')->extract(array('href'));
    $object_urls = array_merge($object_urls, $hrefs);
}

// Extract the PID from each object URL. This will be specific to the URLs on the site
// e.g., specific to path auto URL patterns, etc.
foreach ($object_urls as &$url) {
    $url = ltrim($url, '/');
    $pid = preg_replace('#/.*$#', '', $url);
    $pid = preg_replace('#\-#', ':', $pid);
    $rels_ext_url = $site_base_url . '/islandora/object/' . $pid . '/datastream/RELS-EXT/download';
    print $rels_ext_url . "\n";
}

$count = count($object_urls);
print "Processed $count URLs\n";

Once we have the URL to each object's RELS-EXT datastream, we can get its content model, etc. and start harvesting content.

MarcusBarnes / mik

Add an Islandora toolchain #256