esmero / ami

Archipelago Multi Importer. A module of mass ingest made for the masses
GNU Affero General Public License v3.0
2 stars 4 forks source link

Add a FCREPO/Solr import plugin to help with Islandora 7 migrations #21

Closed DiegoPino closed 1 week ago

DiegoPino commented 3 years ago

What is this?

We got a lot of requests for this so its about time. This issue is to explain the design and how I plan on doing this, happy to get feedback, feature request, questions, etc.

How?

The idea here is to add another AMI Source Plugin that can deal with data coming directly from Solr (but not limited to, just as a start). The plugin will integrate with the current setup form the same way Google Sheets and CSV works right now and will provide the following options

  1. Server URL to your core
  2. A predefined Islandora Profile with an advanced override 2.1. Collection PID/Top Object PID (e.g book) to import from (one at the time) 2.2. Filter CMODELS 2.3. Binary Datastream(s) per CMODEL to fetch files from 2.3.1 (NEW). HOCR and other derived data streams could be marked as SBFlavors and go into Solr directly. We can start with HOCR first and then think how that fits others. I wonder if we could pass the responsibility (a 'does it apply' method) to each Strawberry Runners Plugin or have one special plugin that automatically takes ingested files of certain criteria and does the work without AMI level settings. Thanks @noahwsmith 2.4. Automatic build remote URL for datastream fetching 2.5. Offset and Number of Objects to fetch (this will be per Top object, because why would you want to limit the number of pages per Book?)
  3. Advanced Override will include: 3.1. Select Membership relations (default profile provides the most common fields already) 3.2. Do a shallow import (default profile is deep import) 3.3. custom filter (as a coma separated list of fields/values) 3.4. What fields to return (default profile is fgs*, datastream data based on selected DSIDs, mods*) 3.5. Make Compounds/books a single ADO or multiple ADOs. Default is (guess what) a single ADO. 3.6 Use PID (if UUID based) as UUID for new ADO. 3.7 Use PID (if UUID based or numeric) as UUID seed for UUIDv5 for new ADO (hashes the PID so every time we ingest the same set we get the same ADO UUID (cool right?)

Any other feature you feel is missing here? Concerns?

Note: Mapping/etc will be the same as in any other AMI setup

Output CSV will already contain "documents, images, label, UUID and parent ship relations computed for you" columns processed by the plugin

Future work: make file URIs/URLs/Paths to be computable via a template. Gives full control for @giancarlobi to override the paths for each file and use, already existing local paths.

@dmer @patdunlavey @giancarlobi @alliomeria please add your suggestion/comments here. Thanks

noahwsmith commented 3 years ago

This would be magic. Should I interpret 2.3 as being able to move OCR/HOCR without the need to regenerate?

DiegoPino commented 3 years ago

Oh, sure! HOCR (good catch) should become Strawberry Flavor Data Sources too => Solr Docs. I will work on designing that part too. Added a comment as 2.3.1 for that case.

patdunlavey commented 3 years ago

@DiegoPino If this could be delivered quickly, it could change our whole migration ballgame for CAR. In particular, the ability to avoid the export step from their old system would make it hugely more efficient, both in the bulk migration of already existing content, but also in the final period prior to cutover when we need to update the new site with the most recent additions and changes from the old one. It could cut down our content freeze to literally nothing. So, yes, huge! Am I right that, for binary file fetching, it would be essentially repository platform agnostic (e.g. fedora)? Where would the logic reside for mapping solr field value(s) (e.g. PID plus datastream ID) to a file uri? In twig?

DiegoPino commented 3 years ago

@patdunlavey in the 'default profile' or simple mapping we will assume a normal /datastream/DSID/download endpoint and built it for the user but the advanced option will provide you with a for-this-purpose only twig template to mangle/transform your endpoints the way you need/want. It will require docs.

How fast I can do this? Send coffee!! I already started, this goes well with the amount of time I'm devoting already to AMI so should not be slow (but 1-2 weeks for fully working at least)

patdunlavey commented 3 years ago

I spoke quickly, but perhaps not too quickly, about applicability of this to the CAR migration use case. The islandora solr instance does not have a lot of the data that we need to migrate (workflow metadata, mostly). These are in the drupal side. However, it seems like, if we add search_api_solr to the drupal site (either on the same core as islandora, or on a separate core or solr container) and index all the drupal fields, we could use the ami solr source. Can you think of any reason why not @DiegoPino ?

DiegoPino commented 3 years ago

@patdunlavey I do not see a not, but this would be a separate plugin profile/work or Advanced Setup I guess. Since Solr fields/data for Drupal will different vastly from what I can expect from a Fedora GSearch Driven config. If you manage to index all your D7 fields in Solr we can do some testing, I would need that something you drive and document using the features we provide here. Hope that makes sense. Let's start with Solr indexing from D7 and we go from there.

DiegoPino commented 3 years ago

Note: @patdunlavey I would also go a different CORE. Your Islandora Core will confuse the hell out Search API of D7.

DiegoPino commented 3 years ago

Also, this is implying also you want D8 migrator? For Islandora 8? 👀 because if this works for your D7 it will also work for the other one.

patdunlavey commented 3 years ago

Thanks @DiegoPino. Yes, understood that this would be on us to drive and document. In principle it seems like it should not be difficult.

Here's my thinking/understanding. I would expect that the new solr plugin would simply be feeding each object/entity's solr fields data into an array, keyed by solr field name. Each solr field's data could be a number, string, or array. Then it would send that array to the twig template, where the user would be responsible for mapping it to json. For filtering of the source data, we would need to provide a place to enter a solr query string. It would also make sense for the user to also be able to provide a field list parameter. Perhaps these would be in addition to what you were thinking would be needed for the UI for Islandora migration?

I'm not thinking that there need to be a "D7 migrator", or D8. Documentation, definitely, which we will do. But unless I'm really misunderstanding something (very possible!), I don't see where bundling a special profile would make sense.

Thanks for the recommendation to use a separate solr core. I had assumed that would make most sense, but without knowing why!

DiegoPino commented 1 week ago

Resolved