MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

Add an Islandora toolchain #256

Open mjordan opened 7 years ago

mjordan commented 7 years ago

It should be fairly simple to migrate from one Islandora instance to another, particularly with the load-all-datastreams functionality of the newspaper and book batch modules, and

Two use cases are:

We could use OAI-PMH to "discover" objects, then use Islandora's REST interface to get their contents.

mjordan commented 7 years ago

@MarcusBarnes I also think that the Compound Batch module can ingest correctly named datastreams. Is that correct?

Edit, 2016-11-06: If we can preserve relationships expressed in RELS-EXT (see below), we do not need to treat compound objects differently than any other object.

mjordan commented 7 years ago

Just a note - we'll need a way of updating the PIDS in RELS-EXT to preserve relationships between the migrated objects once they get into the destination Islandora.

mjordan commented 7 years ago

We may be able avoid updating PIDS and reuse the source objects' PIDs. See for more info.

mjordan commented 7 years ago

Not only can we retain the source objects' PIDs, we can also batch ingest their RELS-EXT datastreams (although there is an issue with duplicate RDF statements). See for an example.

mjordan commented 7 years ago

To summarize the relevance of to an Islandora toolchain:

mjordan commented 7 years ago

In the initial issue I said

We could use OAI-PMH to "discover" objects, then use Islandora's REST interface to get their contents.

That assumes that the Islandora OAI-PMH and REST modules are running on the source Islandora, and that it is possible to install them if they are not already installed. Probably also assumes that the source is running Islandora 7.x and not 6.x. The "fetch" components of this toolchain should not make those assumptions.

If we can avoid retaining the structure of complex objects (compound objects, newspaper pages, etc) and rely on the relationships expressed in RELS-EXT, we can probably get by with fetching content from the source Islandora using a list of PIDS. We then iterate over the list and fetch all datastreams for each object.

The PID list could be obtained in a variety of ways, depending on how much access we have to the source Islandora - a locally run drush script, a remote script that queries the resource index or Solr, etc.

mjordan commented 7 years ago

In cases where and OAI-PMH provider is not available, perhaps we can use a simple scraper to get all the members of a collection from the collection browse pages. Here's an example using the Goutte scraper library:


use Goutte\Client;
require_once __DIR__ . '/vendor/autoload.php';

$client = new Client();

$browse_url = '';
$pages = range(2, 68);
$object_urls = array();

print "Scraping object URLs from pages starting at $browse_url...\n";
foreach ($pages as $page) {
    $crawler = $client->request('GET', $browse_url . '?page=' . $page);
    $crawler->filter('dd.islandora-object-caption > a')->each(function ($node) {
        print $node->attr('href') . "\n";

This scraper produces output like this:


An advantage of this approach is that every Islandora instance will expose members of a collection for scraping, regardless of whether it's running on Drupal 6 or 7.

mjordan commented 7 years ago

Here's a better version:


use Goutte\Client;
require_once __DIR__ . '/vendor/autoload.php';

$client = new Client();

$browse_url = '';
$site_base_url = '';
// This range corresponds to the number of pages in the collection's browse list, the second
// number being the "?page" value of the last page.
$pages = range(0, 68); 
$object_urls = array();

print "Scraping object URLs from pages starting at $browse_url...\n";

// Then scrape each of the parameterized browse pages defined in $pages.
foreach ($pages as $page) {
    $crawler = $client->request('GET', $browse_url . '?page=' . $page);
    $hrefs = $crawler->filter('dd.islandora-object-caption > a')->extract(array('href'));
    $object_urls = array_merge($object_urls, $hrefs);

// Extract the PID from each object URL. This will be specific to the URLs on the site
// e.g., specific to path auto URL patterns, etc.
foreach ($object_urls as &$url) {
    $url = ltrim($url, '/');
    $pid = preg_replace('#/.*$#', '', $url);
    $pid = preg_replace('#\-#', ':', $pid);
    $rels_ext_url = $site_base_url . '/islandora/object/' . $pid . '/datastream/RELS-EXT/download';
    print $rels_ext_url . "\n";

$count = count($object_urls);
print "Processed $count URLs\n";

Once we have the URL to each object's RELS-EXT datastream, we can get its content model, etc. and start harvesting content.