MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

add logic for DIRECTORY_SEPARATOR to account for "/" or "\" in reposi… #432

Open flummingbird opened 7 years ago

flummingbird commented 7 years ago

…tory unique identifier

What does this Pull Request do?

We encountered a problem harvesting OAI, where someone's repository unique identifier had problematic characters IE: '/'

What's new?

We added logic to the src/writers/Oaipmh.php normalize_filename function, that would account for system directory separators. we chose to use hyphens to replace.

How should this be tested?

Set your repository unique identifier in islandora_oai config to http://example.com

Then try to harvest oai-pmh from there.

any sample oai-pmh sample config should work.

Interested parties @mjordan

Tag interested parties using @ mentions

mjordan commented 7 years ago

It's awesome you're using OAI toolchains! I'd love to know more.

mjordan commented 7 years ago

@flummingbird WRT you comments about testing, I'll update the docs in response. Thanks for the feedback.

mjordan commented 7 years ago

Also, we'd prefer if you open an issue describing the problem you've found before opening a PR. No sweat for this PR though, this is a good catch, thanks for finding it.

flummingbird commented 7 years ago

not sure what the WRT stands for... Most of the credit should go to @jpeak5

mjordan commented 7 years ago

Sorry, 'with regard to.'

mjordan commented 7 years ago

@flummingbird @jpeak5 looking at this a bit more closely, I wonder if we should define a list of potentially problematic characters to replace with _. For example, : and DIRECTORY_SEPARATOR. There may be others as well. We could implement like

$pattern = preg_quote('#(' . DIRECTORY_SEPARATOR . '|:)#');
$string = preg_replace($pattern, '_', $string);

Would this address the situation you encountered?

jpeak5 commented 7 years ago

indeed, we will give it a whirl in the am; thx Mark

Sent from my iPhone

On Jul 19, 2017, at 6:24 PM, Mark Jordan notifications@github.com<mailto:notifications@github.com> wrote:

@mjordan commented on this pull request.


In src/writers/Oaipmh.phphttps://github.com/MarcusBarnes/mik/pull/432#discussion_r128392952:

@@ -133,6 +133,8 @@ public function writeMetadataFile($metadata, $path, $overwrite = true) public function normalizeFilename($string) { $string = urldecode($string);

  • $pattern = '#'.DIRECTORY_SEPARATOR.'#';

A bit more end-of-the-day hacking around has produced:

/**
 * Convert %3A (:) and other characters in filenames into underscores (_).
 */
public function normalizeFilename($string)
{
    $string = urldecode($string);
    $bad_chars = [
        ':' => '_',
        DIRECTORY_SEPARATOR => '_',
    ];
    $string = str_replace(array_keys($bad_chars), $bad_chars, $string);
    return $string;
}

This replaces : in my local tests. Could you test to see if it catches the directory separator? The benefit of str_replace() is that you don't need to worry about nasty cross-platform issues (I hope).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MarcusBarnes/mik/pull/432#discussion_r128392952, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABEMrF3EQr7-yzWfH6SnbUDKwVnqvo-6ks5sPpA5gaJpZM4OdH6P.

jpeak5 commented 7 years ago

Mark,

Your end-of-day hack works just fine in our scenario. I'm curious though about the cross-platform issues you mention with preg_replace vs str_replace: I'd always thought that str_replace was just lighter, simpler, faster and that the preg_* family was for when you needed full regex.


public function normalizeFilename($string) {
    $bad_chars = [
      '#:#', 
      '#'.DIRECTORY_SEPARATOR.'#',
      ];
    return preg_replace($bad_chars, '_', urldecode($string));
  }

Any rate- I'd like to find out more about this OAI workflow, specifically, how, once you have the records, you ingest such that the Islandora items present thumbnails and links back to the original. I'm finally able next week to put some time into looking closely at this part of mik...

We plan to harvest collections from only a single institution to start with, but I'm pretty interested in knowing more about how you or others are leveraging OAI.

Alright- thx as always!

jrp

mjordan commented 7 years ago

@jpeak5 Thanks for testing. The cross-platform issue we encountered with preg_replace() was that by using DIRECTORY_SEPARATOR within patterns, on Windows this converted to \, which then acted as an escape character within the pattern, making many matches fail. According to this SO post, you do have to escape some needle characters within str_replace(). So my comment that using it instead of preg_replace() is not completely valid.

On to OAI... MIK's OAI toolchains spit out ordinary Islandora ingest packages, although usually with DC instead of MODS XML files. The OAI toolchains don't directly influence the behavior of the resulting Islandora objects in any special way. If I am understanding what you are describing, if you want the thumbnails of the objects (in collection browse lists or search result sets) to lead users offsite to the originating URLs, I think you'd need to do some stuff within Islandora itself to get that to work. That said, you could probably harvest thumbnails from the source objects to reuse in Islandora if you wanted, using a post-write hook script or a custom OAI filegetter. I'd have to know more about what you'd expect the Islandora objects corresponding to the harvested remote objects would be like to speculate on other solutions.

mjordan commented 6 years ago

@jpeak5 sorry to leave this PR open so long, but now that it sounds like you guys are fully back in operation, and that the OAI angle was mentioned in Mike Waugh's email announcing the launch of http://louisianadigitallibrary.org/, we thought we'd ping you. We'd be interested in knowing more about how you integrated the data you retrieved via OAI into the site.

mjordan commented 6 years ago

Also, mind if we link to a few of the collections on http://louisianadigitallibrary.org/ under the "Islandora content that has been prepared using MIK" section of the MIK README?

jpeak5 commented 6 years ago

Hi Mark,

We've definitely been busy - I'm sorry to say that I'd forgotten all about this!

Definitely no problem to link to the collections.

We are still working on problems with harvesting the last 4 of about 80 Tulane collections, so this story is by no means done (and we have much code cleanup and at least one more PR), but here's a brief sketch of what we've done:

We have used mik/oai to pull MODS + TN for all of the collections and items in Tulane University's standalone Islandora (at their request, of course).

We pulled MODS because it was expedient - we needed a field (containing the source URL) that seemed to be getting stripped out of the DC.

We injected a string into MODS.abstract containing a link back to the item in Tulane's site: https://github.com/lsulibraries/mik/commit/6613f5688d49f68e6dfadc6104d70a13b2869040#diff-3cb45e06ac756c68d4fc7249c5c6b6e4R41

Then, with some spaghetti, sorely in need of refactor, we alter the display of harvested items in our solr search results: https://github.com/lsulibraries/islandora_namespace_homepage/blob/master/islandora_namespace_homepage.module#L346

When we finish with this round of OAI, I'm hoping that we will fix up our code, PR, and then do some better documentation of what we've done (and how we intend to keep our Tulane items in sync with them...)...

jrp

mjordan commented 6 years ago

Then, with some spaghetti, sorely in need of refactor, we alter the display of harvested items in our solr search results: https://github.com/lsulibraries/islandora_namespace_homepage/blob/master/islandora_namespace_homepage.module#L346

Cool - so the thumbnail links back to the object at Tulane. What content model are those objects in your Islandora?

mjordan commented 6 years ago

When we finish with this round of OAI, I'm hoping that we will fix up our code, PR, and then do some better documentation of what we've done (and how we intend to keep our Tulane items in sync with them...)...

How you've used MIK to harvest the OAI, and integrated the resulting objects into Islandora, would make a super-mega-awesome Migration Guide for the MIK wiki... 😀, don't you think so @MarcusBarnes?

MarcusBarnes commented 6 years ago

Agreed: super-mega-awesome. :)

flummingbird commented 6 years ago

Hey Mark! a few things: All our collections were done with MIK!! Every dang one. Oai-pmh or not. I elected to use the sp_basic_image cmodel for tulane's items We'd love to work on the MIK-wiki

mjordan commented 6 years ago

@flummingbird neat. I just opened a PR to add links to sample collections.

Nice work guys!

mjordan commented 6 years ago

@flummingbird @jpeak5 inspired by your use of OAI and redirecting to a remote object, I've developed a solution pack that does a lot of the same things but treats the local objects as a separate content model. It's in our local gitlab at https://git.lib.sfu.ca/mjordan/islandora_solution_pack_remote_resource. Sorry I didn't let you know back when I was working on it, I got distracted..... anyway, I'd welcome any comments you have before publicizing it more widely. There may be a use case for it here in British Columbia, maybe even within my own library since we still have a couple bespoke repository platforms that we'll probably never fully migrate into our Islandora.

flummingbird commented 6 years ago

Hey Marcus,

We're currently auditing our OAI-PMH resources. and we were looking at trying out this content type.

I'm a bit flummoxed. I've been trying to use the '--url_list' flag, in the remote_resources_bach with very little luck. It clams that it has created an ingest set, but the set is always 0 items.

If I can get a proof of concept working for this, I think we may be interested in using the --url_list to replace our whole oai-pmh workflow.

It looks like you and Mark are working on it today... I'll keep trying the module out. but if you have some time I I'd love some pointers. I tried finding the project while logged into git lab, I guess it's hidden...

Let me know

-Will

On Wed, Oct 11, 2017 at 12:29 PM, Mark Jordan notifications@github.com wrote:

@flummingbird https://github.com/flummingbird @jpeak5 https://github.com/jpeak5 inspired by your use of OAI and redirecting to a remote object, I've developed a solution pack that does a lot of the same things but treats the local objects as a separate content model. It's in our local gitlab at https://git.lib.sfu.ca/mjordan/islandora_solution_ pack_remote_resource. Sorry I didn't let you know back when I was working on it, I got distracted..... anyway, I'd welcome any comments you have. There may be a use case for it here in British Columbia, maybe even within my own library since we still have a couple bespoke repository platforms that we'll probably never fully migrate into our Islandora.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MarcusBarnes/mik/pull/432#issuecomment-335886137, or mute the thread https://github.com/notifications/unsubscribe-auth/ADe_3xvTsyHMjpXYwFvkf8g7nPIfxL9Hks5srPsDgaJpZM4OdH6P .

mjordan commented 6 years ago

@flummingbird yes, I've been working on this lately to get ready for our local code4lib meetup, where I'm going to do a lightning talk on it.

I've ripped out the --url_list stuff and removed that functionality because the batch code was becoming too complex. To replace the ability to load from a CSV file, I've built a separate script that uses a proper CSV parser. I've combined it and the MIK postwrite hook script into this repo: https://git.lib.sfu.ca/mjordan/islandora_remote_resource_batch_tools. This tool lets you use a Twig template to crate DC or MODS files as well.

In this new workflow, you'd run php convert_csv.php --csv sample_data/test.csv --output_dir /tmp/output using your CSV file, and the output would be the same as the directory scan method the batch module uses. I figured that standardizing the input data used by the batch module would let people build their own tools to produce the data. I hope that this change is consistent with your workflow. I'd be very interested to hear more.

mjordan commented 6 years ago

@flummingbird I just spent some time fixing up the CSV tool so that you can give it a spin:

https://git.lib.sfu.ca/mjordan/islandora_remote_resource_batch_tools/tree/master/csv

As I mentioned above, this tool is intended to be a more robust replacement for the --url_list option in the batch module.

mjordan commented 6 years ago

Latest...

I've also added a GUI for loading a zip containing the data files, and resolved a few issues. I'll be porting over the remaining issues from gitlab to the GitHub repo soon. I'll be applying all updates to the GitHub versions and deprecating the gitlab repos from now on.

mjordan commented 6 years ago

OK, all ported to GitHub other than my notes for the code4lib LT.

flummingbird commented 6 years ago

awesome! I'm going to try it out !!! Thanks a ton Mark!

-Will

On Wed, Nov 8, 2017 at 12:12 AM, Mark Jordan notifications@github.com wrote:

OK, all ported to GitHub other than my notes for the code4lib LT.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MarcusBarnes/mik/pull/432#issuecomment-342720864, or mute the thread https://github.com/notifications/unsubscribe-auth/ADe_308QAnskudMMzla18a7wWBn-wNVDks5s0UYygaJpZM4OdH6P .

mjordan commented 6 years ago

@flummingbird I'm going to demo this solution pack at our local code4lib on Friday. If you've had a chance to test it out I'd love to get some feedback. No worries if you haven't though.

jpeak5 commented 6 years ago

@mjordan I replied via email, but since it hasn't shown up here after an hour, I'm gonna paste my reply below...


Mark,

I'm sorry I haven't written sooner to say that this is fantastic! - exactly the content type we need. We looked at this in the week before the Thanksgiving break, and we have added some classes and some drush functionality that will let us harvest directly from the command line.

https://github.com/mjordan/islandora_solution_pack_remote_resource/compare/7.x...lsulibraries:FR-oai

In the short-term, I expect that that is how we will run the ingest, but I have begun planning for a GUI sync framework that will allow management and reporting over collections that use the Remote Resource in addition to automated (likely cron) updates.

Thank you so very much for this, again! jrp

mjordan commented 6 years ago

@jpeak5 awesome. I agree that OAI is a very important source of data for this solution pack. I put together some helper scripts, https://github.com/mjordan/islandora_remote_resource_batch_tools, which include an MIK post-write hook script to convert that output of its OAI crawl to a format that the batch loader could use as input. But, including OAI functionality directly in the batch loader makes a lot of sense and removes the dependency on MIK.