digirati-co-uk / iiif-manifest-editor

Create new IIIF Manifests. Modify existing manifests. Tell stories with IIIF.
https://manifest-editor.digirati.services/
MIT License
31 stars 2 forks source link

Import metadata #202

Closed tomcrane closed 1 year ago

tomcrane commented 2 years ago

Rather than manually add labels and metadata, load descriptive metadata from external sources (like a MARC record) and populate fields automatically.

tomcrane commented 2 years ago

This feature depends on #149

tomcrane commented 2 years ago

Canadiana Comments:

1. An app to allow for harvesting metadata from OAI-PMH servers

Available only for manifests with a seeAlso link defined where the destination is a set of XML types (OAI-PMH wrapped MARC/Dublin Core, or URLs to static files containing those formats)

In the Digirati OAI-PMH harvesting app, you can pull for the currently opened manifest.

Workflow

Initial Creation

Making changes to metadata:

2. Bulk harvesting of OAI-PMH metadata for many manifests

We think being able to paste/select a list of manifest identifiers and pull the metadata for that list would also be useful for our staff. The editor would be able to loop through a set of identifiers, load the document, make the same change to a set of documents and re-save. This relates to that type of bulk operation. It all comes down to how to specify the list of identifiers (IIIF collections may be an obvious choice, given that an unordered collection is simply a list of identifiers of collections and/or manifests).

3. Develop best practices for common data formats (Dublin Core, MARC)

Step 1

Create a Metadata Transformation API based on our existing MARC and DC schemes and parser: https://github.com/crkn-rcdr/cihm-metadatabus/tree/main/CIHM-Meta/lib/CIHM/Meta/dmd

It takes our format, then spits out IIIF metadata fields.

Step 2

Hook up the OAI-PMH harvester app in the editor to our Metadata Transformation API. The harvester app will take the response and save it to the manifest.

Step 3

Demo this functionality to IIIF community

Step 4

Work with the community to develop transformations from MARCXML and DC to appropriate IIIF fields.

tomcrane commented 2 years ago

Comments on 1. An app to allow for harvesting metadata from OAI-PMH servers

Note 1

Assumption: This step is you (humans) using the Sorting Room UI (#43) to split the "Reel" manifest into parts. During this process, you might give each manifest a label (for your own convenience) but little or no other metadata. But you will give it a seeAlso pointing at some metadata, even if you haven't yet created that metadata. The seeAlso might be a 404 at this stage (the Manifest Editor can warn about it but won't treat it as invalid).

The end result is that each created Manifest ends up with a seeAlso property something like:

    {
      "id": "https://crkn.ca/library/catalog/book1.xml",
      "type": "Dataset",
      "label": { "en": [ "Bibliographic Description in XML" ] },
      "format": "text/xml",
      "profile": "https://crkn.ca/profiles/bibliographic"
    }

Note 2

This is where the targets of the seeAlso are created. There is now metadata at the other end of https://crkn.ca/library/catalog/book1.xml, not a 404.

Note 3

We're assuming that this is a manual process. In order for a Manifest to acquire new metadata, you have to open it in the editor. Then you flip to the "OAI-PMH" app.

This app has very little UI, because it's going to pull from the Manifest's seeAlso. What it could do is offer a text box prepopulated with the URL of that seeAlso, and a button Import Metadata.

If there is more than one seeAlso, this app could:

Either way, when the user presses the button, the app is going to try to fetch from the displayed URL.

This app can be generic, and through configuration offer options something like:

{
    "label": { "en": ["OAI-PMH importer"] },
    "useSeeAlso": true,
    "permittedProfiles": [ "https://crkn.ca/profiles/bibliographic" ],
    "callBack": "myOaiPmhMarcMetadataHandler"
}

... where if useSeeAlso is false you just get an empty box to provide a URL; if permittedProfiles is empty, any seeAlso is allowed; and callBack is a function name that will be in scope.

TODO: how does it become in scope... drop it into a directory? What are the conventions? You should be able to write a plain JavaScript implementation of the callback. But we still might insist on ES6+ and modules, not just a function in a file.

Note 4

This is where the function specified in callBack is called:

(this is all tbc/straw person - probably don't want to tie this to a fetch Response object, might be used in other scenarios)

// This is what the generic plugin will call:
myMetadataHandler(source: string, body: string, status: int, id: string, vault: Vault)
// source: the URL that was called to obtain...
// body: the response from that URL as a string (if there is one)
// status: the HTTP status code of that response
// id: the id of the manifest being edited, which the callback can use to obtain the resource from...
// vault: the Vault instance that the callback needs to modify with the data it pulls out of the body.

The generic plugin has obtained the content from the external URL. This is where crkn write a chunk of JavaScript or TypeScript that understands the format of body and can assign that data to the Manifest.

Note 5

The plugin itself doesn't distinguish between create and update for fields. If you want that logic, implement it in the callback.

But that implies that you might want to disallow the attempt to populate. The callback can return something that indicates failure and a reason for the failure...

{
    "success": false,
    "error": { "en": ["This manifest already has a label"] }
}

(That's an unrealistic example as you most likely would want to update label from source metadata, but it shows that you could reject the attempt for whatever reason.)

For Digirati to do

For crkn to do

More discussion

Other ways this can work...

tomcrane commented 2 years ago

Comments on 2. Bulk harvesting of OAI-PMH metadata for many manifests

This seems like a task just as suited to a non-visual batch processing tool as the Manifest Editor.

If it is to be a custom plugin/app for the Manifest Editor, it's not the same plugin as mentioned above. But it might use the same callback! It's also doing something different from most apps that you would have in the Manifest Editor, whose job is to contribute something to the resource being maintained by the Shell (and using the shell's application services).

This app doesn't really care about the application services. All it needs to do is:

If there is one unambiguous seeAlso for each of those manifests then it doesn't need any user interaction as it processes its list. But if there isn't then... does it fail? stop and ask the user to pick/provide one?

In the previous comment, we didn't talk about how the user would save the updated manifest, because it's out of the scope of updating the manifest with data from an external source. It's part of the scope of #184.

But this tool is different; it needs to Publish each manifest as it churns through that list; it needs to call up to the Shell and invoke the publish command, making use of whatever storage/publish targets the current Manifest Editor instance has configured (using the default if more than one). Which implies it must use the Shell's vault - it needs to do what a human would do who is loading one Manifest after another into the editor and publishing before moving onto the next. Which implies the application services allow contained apps to invoke Publish for the resource-currently-being-edited, and to change the resource-currently-being-edited, rather than reserve those operations solely for the user interacting with the Shell.

tomcrane commented 2 years ago

Notes on 3. Develop best practices for common data formats (Dublin Core, MARC)

These are all good activities to do but we think the approach in 1) means that they can be independent - we don't need to do anything specific in the core Manifest editor.

A callback implementation can demonstrate mapping MARC to IIIF.

(i.e., there's no specific development task to estimate for this one).

RussellMcOrmond commented 1 year ago

Internally at CRKN we are talking about having this be a PUSH from our metadata software, rather than a PULL from the metadata editorl. This is partly because of the workflows we are using, and the desire to allow metadata to be updated by people only using metadata related software. For this we will only have wanted to come up with a common API for managing presentation documents #184 .

There may be a desire for different staff members to initiate the PUSH while looking at a specific record. In this case your suggestion of "we provide an impl of the callback that doesn't do the processing itself but uses iFrame and postMessage to hand off the processing to something else" would be sufficient.

We would need to implement something in our metadata service that would return JSON for a given ID, that would then patch the current document being edited. Otherwise the editor would need to save, the storage URL sent as part of request, and the document then re-loaded, which seems messy.

brittnylapierre commented 1 year ago

Our Preferred Solution for Metadata Pull: We provide an impl of the callback that doesn't do the processing itself but uses iFrame and postMessage to hand off the processing to something else.

Notes on 3. Develop best practices for common data formats (Dublin Core, MARC) ... Agreed - we can share our #2 processing solution widely to promote the metadata format we use

tomcrane commented 1 year ago

(discovery work for project)

If metadata is always pushed into Manifests by automated processes, then the Manifest Editor doesn't need any additional components, and the following statement (from me earlier) is false:

In order for a Manifest to acquire new metadata, you have to open it in the editor.

But https://github.com/digirati-co-uk/iiif-manifest-editor/issues/202#issuecomment-1241288190 suggests that sometimes you'll want to cause that missing metadata to appear in the manifest right now. Perhaps you're in the middle of a complex edit.

MVP: You never open the manifest in the editor to fill in this metadata. The manifest is always "edited" out of sight of the Manifest Editor, by external processes.

Simple PULL scenario: You open the manifest, and realise that this metadata block is outdated/missing. So you close the editor without doing anything, and go to a different service to manually force the update. You then open the manifest again (from its storage URL) and observe that the metadata has been modified in your absence (or you see it's still unchanged, so you close again and come back later). (This is technically the same as the MVP, just that the user initiated the change while editing).

More complex PULL scenarios, where you need to pull the data in: to be elaborated, custom component.

tomcrane commented 1 year ago

(not in scope)