Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 71 forks source link

Investigate using OpenRefine as part of the migration process #898

Open dannylamb opened 6 years ago

dannylamb commented 6 years ago

Many interested parties have mentioned using http://openrefine.org/ to clean up metadata before migrating to CLAW. We need users to investigate how it can be utilized to find external URIs for existing authorities and how it can clean up our MODS for us.

We also have to find out how to interact with it and where in the migration process it belongs. If we can call out to it over HTTP via an API, we may be able to integrate it using Drupal's migration framework. If not, it will have to be a step done before migration, while data is still in 7.x or working on a export, etc...

Download and install openrefine, then try it out and let us know what you think! More than one person can tackle this. No need for one person to hog all the metadata glory.

ajs6f commented 6 years ago

@dhlamb I don't see how it could be fully automated into a migration workflow. If nothing else, any actual mapping from "strings to things" is going to have a decent number of mistakes, and those will require human intervention. Or are you thinking of breaking up the automation to insert OpenRefine? In that case maybe a PHP-side client would be the bridge.

dannylamb commented 6 years ago

I'm like 99% sure this'll have to be something done before kicking off Drupal migrate, and something that you'd manually take to the extent you feel comfortable with, because you can do a lot with it. I don't think it's applicable to every migration, but enough folks have mentioned it that we should find a way to squeeze it into the workflow for those who want to use it. Even if that means "Just run OpenRefine first" and we can provide some guidance. If we're lucky, maybe we make a plugin that uses https://github.com/keboola/openrefine-php-client that people can turn on in their yml if they want it. But full automation is doubtful because everybody's repository is different.

ajs6f commented 6 years ago

Ok, cool, that's a much less ambitious / more practical approach than I thought was intended.

exsilica commented 6 years ago

Assigning myself @exsilica - lacking permissions for this repo

carakey commented 6 years ago

Assigning myself as well - @carakey

DigitLib commented 6 years ago

Maybe this tool? https://github.com/LibreCat/Catmandu I had a problem to export MODS to RDF..

carakey commented 6 years ago

Under development at LSU for converting from XML to CSV: https://github.com/lsulibraries/xml2csv

mbolam commented 6 years ago

I'm happy to jump in on this one, too -- but don't seem to have the permissions -- @mbolam

rtilla1 commented 6 years ago

@rtilla1

amcshane commented 6 years ago

Happy to jump on this @amcshane --- also happy to help compile the notes once folks have a chance to explore.

dannylamb commented 6 years ago

@carakey @exsilica I've sent you invites to the Islandora-CLAW organization and can assign you this ticket once you accept.

Thanks for signing on everyone!

mbolam commented 6 years ago

Regarding reconciliation using "Conciliator" -- https://github.com/codeforkjeff/conciliator. My troubles turned out to be rated to Java versions and my Mac pointing at an outdated version. Tested with latest version of Java and it is working on my desktop. No need for developer support, assuming people can get the Java 1.8 working on device.

amcshane commented 6 years ago

Are there any particular authority files folks want to make sure work? I've got a bunch of MARC that I can offer if anybody needs a bit of a mess to play with. It will not migrate prettily, I promise.

rtilla1 commented 6 years ago

@carakey and I were able to get xml2csv running on 15 of the sample MODS Islandora 7.x users have provided to MIG. Here's the branch with those resulting files: https://github.com/rtilla1/xml2csv . Point of interest: 15 well-formed MODS files have together 285 unique "fields" (every xpath that points to contents and which has a different combination of elements or attributes). There are holes in how these are being counted, but it's an interesting starting point.

amcshane commented 6 years ago

@rtilla1 -- once you import the xml as a csv and complete the clean-up work, do you export as a csv and use another tool to recreate the MODS or are you using the Templating export feature in OpenRefine?

i.e. https://gist.github.com/sallain/7604ffb0c155294fcfaf

carakey commented 6 years ago

The xml2csv project at https://github.com/lsulibraries/xml2csv has been updated to use the current mapping spreadsheet from the MIG - i.e., only the mapped xpaths are included in the csv output. The latest incorporates most of the action items from the 8/28 call.

Example of output: 15 MODS files as CSV in Google sheets

If anyone takes it for a spin, I'd love feedback.

amcshane commented 6 years ago

@carakey I'll give it a shot tomorrow morning!

rtilla1 commented 6 years ago

Using OpenRefine as part of the migration process requires a number of steps, some of which are complete, and a number of which have multiple steps left undone at this point.

  1. Export data from 7.x with CRUD as MODS
  2. Transform MODS with xml2csv tool into very specific columns with well-documented delimiters between compound or complex contents ##()
  3. Transform resulting tsv into reconcilable data with OpenRefine
  4. Reconcile each column as appropriate (subjects against LCSH, MeSH, AAT, names against LCSH, VIAF, and WikiData, etc.), sometimes grabbing additional data about the term (such as personal or corporate name)
  5. Transform reconciled data into columns so that a string is available for each taxonomy term, along with optional namespace:code or namespace:term data, and other information about each term
  6. Export each type of vocabulary into it's own CSV
  7. Re-transform the data (possibly removing the reconciliation data) so that the bibliographic records can be exported, making sure to carry over the PID from the CRUD export in Step 1.
  8. Import the vocabulary/taxonomy terms.
  9. Import the bibliographic records, re-matching them with the appropriate object.

Step 2(#913), 3(#914), 4, 5, and 7 need work. Step 1, 6, 8, and 9 are theoretically ready to go and have been tested with other applications or data.

seth-shaw-unlv commented 6 years ago

So, this might be crazy talk, but could we get OpenRefine to export these records back out as mods records in a single modsCollection and Agents as MADSXML? Then we won't have to deal with the nested name delimiters we've been talking about in Zoom meetings. I can migrate XML documents just as easily (and sometimes more so) than CSV.

mbolam commented 6 years ago

@seth-shaw-unlv -- One could potentially do templating in OpenRefine to export as MODS and/or MADS.

http://digitalscholarship.utsc.utoronto.ca/content/blogs/converting-spreadsheets-modsxml-using-open-refine

I've played around with templating, but not used it extensively. It probably wouldn't be too tough to come up with a "basic version" that at least handles the core elements we've been considering for the sprint.

carakey commented 6 years ago

Templates do exist; this is the top hit for MODS templating and I’m pretty sure some of our consortium partners have built off of this version: https://gist.github.com/sallain/7604ffb0c155294fcfaf


From: Michael Bolam notifications@github.com Sent: Thursday, August 30, 2018 3:12 PM To: Islandora-CLAW/CLAW Cc: Cara M Key; Mention Subject: Re: [Islandora-CLAW/CLAW] Investigate using OpenRefine as part of the migration process (#898)

@seth-shaw-unlvhttps://github.com/seth-shaw-unlv -- One could potentially do templating in OpenRefine to export as MODS and/or MADS.

http://digitalscholarship.utsc.utoronto.ca/content/blogs/converting-spreadsheets-modsxml-using-open-refine

I've played around with templating, but not used it extensively. It probably wouldn't be too tough to come up with a "basic version" that at least handles the core elements we've been considering for the sprint.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Islandora-CLAW/CLAW/issues/898#issuecomment-417451891, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQ1lPy3qfxCd8VbYDiXfWb3w_EHNb5XTks5uWEcwgaJpZM4WEBoy.

amcshane commented 6 years ago

Were folks actually able to perform reconciliation with the provided test MODS? The version I'm seeing retains data about MARC subfields in the text, making reconciliation against LOC (for example) impossible.

My assumption was that each subfield needed its own column, as well -- not unlike creating MARC records from delimited text files.