avalonmediasystem / avalon

Avalon Media System – Samvera Application
Apache License 2.0
93 stars 51 forks source link

Explore use of IIIF Content Search for transcript search within a media item #5734

Closed elynema closed 5 months ago

elynema commented 6 months ago


IIIF Content Search API 2.0 allows remote search with a single IIIF resource. The responses is an item list that contains a series of painting annotations that reference each annotation where a match occurred. Each annotation in the response looks like it points back to a canvas (probably the same target as the originating annotation). The series of painting annotations is not aggregated by canvas, it doesn't look like, so we might have to do that ourselves.


IIIF Content Search also allows for paging of results, which should hopefully be easy to do by paging through Solr? This could be relevant for a media item with long transcripts and lots of hits.

IIIF Content Search also allows for a series of annotations following the item list that can provide clues on how to highlight matching text in the annotations where matches occur. This seems a bit cumbersome to me, and might be easier to do with Javascript in the page itself.

Simple response example from IIIF site:

  "@context": "http://iiif.io/api/search/2/context.json",
  "id": "https://example.org/service/manifest/search?q=bird&motivation=painting",
  "type": "AnnotationPage",

  "items": [
      "id": "https://example.org/identifier/annotation/anno-line",
      "type": "Annotation",
      "motivation": "painting",
      "body": {
        "type": "TextualBody",
        "value": "A bird in the hand is worth two in the bush",
        "format": "text/plain"
      "target": "https://example.org/identifier/canvas1#xywh=100,100,250,20"
    // Further matching annotations here ...

Done Looks Like

joncameron commented 6 months ago

This could be a Swarm topic to discuss as a team as well!

elynema commented 6 months ago

Currently, Ramp only loads transcript text when the user views the transcript. So if we implement search within via Ramp for all transcripts in a section, we'll have to have Ramp pre-load all the transcripts.

IIIF content search would likely provide an advantage over Javascript search in-browser if we want to search across all transcripts / annotations across all sections. We would not necessarily want to have to pre-load all that content to allow for a client-side search that searched across it all.

IIIF content search also provides reusable functionality in Ramp that could be used by other implementers. However, it may make sense to implement a first pass using a simple Javascript solution and then re-work later when the use case arises.

elynema commented 5 months ago

Our transcript annotations point out to an external file. What does the granularity of results look like? If we just return the entire transcript annotation as a result, we don't know anything about how many hits are within the transcript. All of our supplementing annotations are external files. Are the annotations that are returned as search results have to be symmetrical with annotations in the original manifest, or can we return annotations that are snippets within the original transcript?

Paging is optional, even if you have long lists of search results.

Could return search results as IIIF collection, and could utilize search API to search within them, possibly even to add the hit counts.

Key question is granularity of the search results. Do we return the entire transcript annotation, or do we break it out by cue times? Some transcripts are textual and do not have cues. Could TextQuoteSelector be used to point to chars, rather than just text before / after? Looks like this is allowed by TextPositionSelector; is this allowed by IIIF?

Response can include annotations that reference different canvases as targets, and point to the original annotation, allowing referencing to different originating annotations.

IIIF content search explicitly searches annotations, but not metadata.

elynema commented 5 months ago

If we are searching all transcripts across all canvases, then it probably makes more sense to do it as IIIF content search rather than loading all those transcripts into memory.

For our first pass, we intended to implement searching across all transcripts for a single canvas, which might be easier to do client-side in memory rather than implemeting IIIF content search.

For Ramp users without a IIIF content search service already in place, a JS based solution would be faster/easier. For users wanting to plug Ramp into existing IIIF infrastructure who already have IIIF content search implemented, they'd prefer that solution.

elynema commented 5 months ago

IIIF content search examples:

  1. Example IIIF Content Search in Digital Collections (digitalcollection.iu.edu) Search for "Broderick" then facet on Paged Resource, Go to the last result Notebook, January 23, 1955-May 9, 1955 . Note it doesn't appear to have the search term in any metadata but it does automatically set the search in the UV to be "Broderick" and it does find a hit. If you search multiple terms, hit count seems to be a summation of hit on each individual term. Search passes through from Blacklight to UV on the item. The search service appears to be version 0 and here's a url to this specific search: https://digitalcollections.iu.edu/catalog/rr171z590/iiif_search?q=Broderick

  2. Another example IIIF search result: https://miiify.rocks/iiif/content/search?q=london and manifest: https://miiify.rocks/manifest/diamond_jubilee_of_the_metro. This is search v.2. The example search looks like it is searching across multiple manifests, based on the target of the annotations. The granularity of text returned in the annotation varies: sometimes a single term and sometimes a paragraph.

  3. Here’s an Archipelago object that has content search enabled in Mirador. See the search button when you toggle the sidebar menu open on the left. https://studio.esmero.io/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc. Here’s a v2 search URL (looks like they’re using v1 on the site but also support v2): https://studio.esmero.io/iiifcontentsearch/v2/do/c33deb5f-97b2-4529-b823-55e5ed04ccdc/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0?q=kitchen+family. It seems to be returning annotations that match either term or both terms. Their content is OCR, and it looks like the annotation responses are chunked into paragraph level and the target is an x,y location on the canvas where the word actually appears.

  4. Here is a BL example: https://bl.iro.bl.uk/uv/uv.html#?manifest=https://bl.iro.bl.uk/concern/reports/400db5b7-ed6d-4bfb-898d-1f0bcd8a9d6e/manifest&config=https://bl.iro.bl.uk/uv/uv-config.json. Very slow.

  5. Princeton example: https://dpul.princeton.edu/sae/catalog/12263a4c-27ab-421c-b35f-b93ec0271a62. Princeton manifest with search v.0 service: https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest?manifest=https://figgy.princeton.edu/concern/ephemera_folders/12263a4c-27ab-421c-b35f-b93ec0271a62/manifest

  6. Yale example: https://collections.library.yale.edu/catalog/16807673. Looks like Yale is using search v.1 in their manifest: https://collections.library.yale.edu/manifests/16807673?manifest=https://collections.library.yale.edu/manifests/16807673. Here's an example of search results: https://collections.library.yale.edu/catalog/16807673/iiif_search?q=law+library. Looks like they may be using OCR divided up into about 1 line of text for annotation granularity. They are linking to each page with results, but not highlighting the search term on the page, which makes sense are the results are not providing an x,y location to target.

elynema commented 5 months ago

IIIF content search pros

IIIF content search cons

Javascript search within pros

Javascript search within cons

Third Wave reported their search service could use local search or call out to a search API. Could we enable this?

cjcolvar commented 5 months ago

Examples from discussion:

  "@context": "http://iiif.io/api/search/2/context.json",
  "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/search?q=April",
  "type": "AnnotationPage",

  "items": [
    // This item highlights the transcript by returning the whole cue
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      "target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts#t=00:00:00,00:00:16"
    // This item highlights the canvas like a marker in a playlist item
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      "target": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n#t=00:00:00,00:00:16"
    // This item highlights in a txt transcript by returning the whole paragraph
      "id": "http://localhost:3000/media_objects/hd76s004z/manifest/canvas/ft848q60n/lfdasjf-lkfdalk-12klf-389hflkdjs",
      "type": "Annotation",
      "motivation": "highlighting",
      "body": {
        "type": "TextualBody",
        "value": "CASSIDY CLOUSE:  All right, well my name is Cassidy Clouse. We’re here virtually speaking. It is <em>April</em> 17, 2017. We have with us today Abby Clapp. So, Abby, where are you from?",
        "format": "text/plain"
      "target": "http://localhost:3000/master_files/ft848q60n/supplemental_files/2/transcripts"
    // Further 'April' annotations here ...
elynema commented 5 months ago

Here is an example of IIIF content search of vtt annotations for a media file in Archipelago. They have a similar question about targeting the VTT vs. the canvas, and have experimented with both. They are using Mirador for their front-end.

I setup a small Video object demo for you here https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6. It is my very own media so no copy rights issues, two VTTs (can be enabled on the viewer) or downloaded there directly on the Digital Object page (see download tab) For this object we are using Mirador V4 alpha 2 so you can use the interface to search (Mirador will only hit V1), but the results won't interact with the media at all (my open question). The IIIF manifest V3 is dynamic like all of ours and can be seen at https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadata/iiifmanifest/Train%20Departure_manifest.jsonld so you can see the source of what is searchable The direct endpoints for the IIIF Content Search are: (try "train", "dark", etc... the Vtts can be downloaded so that should be ok) V1 https://studio.esmero.io/iiifcontentsearch/v1/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train V2 https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6[…]datadisplayexposed/iiifmanifest/mode/advanced/page/0?q=train Note: i disabled a few parts of the "standards/specs" in this content search responses to make it simpler to consume. You are not getting "number of results" etc. Also we are not using the extra "annotations" that could supplement (with a before/after text snipped) but we could/you can ask for it if you need it and i enable that Note 2: the output of the api is targeting the VTTs themselves. If you want to see a target against the canvas let me know and i turn the switch so you can compare outputs (basically different target, different motivation on the response).

"@context": "http://iiif.io/api/search/2/context.json",
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0",
"type": "AnnotationPage",
"items": [
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/1",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train over tracks.]",
"format": "text/plain"
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=0,8"
"id": "https://studio.esmero.io/iiifcontentsearch/v2/do/99161a75-43d8-42ee-8f18-e8d1855640b6/metadatadisplayexposed/iiifmanifest/mode/advanced/page/0/annotation/anno-result/2",
"type": "Annotation",
"motivation": "supplementing",
"body": {
"type": "TextualBody",
"value": " - [Sounds of train passing and sounds of train over tracks.]",
"format": "text/plain"
"target": "https://studio.esmero.io/do/99161a75-43d8-42ee-8f18-e8d1855640b6/iiif/subtitles/p1/3eff2938-151c-4bc2-be05-53267c0ec31b#t=8,13"
