IIIF / iiif-stories

Community repository for documenting stories and use cases related to uses of the International Image Interoperability Framework.

21 stars 0 forks source link

I would like to get access to the illustrations recognized by OCR #79

Open altomator opened 7 years ago

altomator commented 7 years ago

Description

OCR describes text components of a page but also illustrations. --> I would like to get access to the illustrations which have been recognized by the OCR

Example: http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26 --> http://gallica.bnf.fr/iiif/ark:/12148/bpt6k96128443/f26/744,707,819,3569/full/0/native.jpg

capture2

Additional Background

IIIF and ALTO on-going work

azaroth42 commented 7 years ago

Not sure what the use case is here. You can get access to the regions via the Image API?

tomcrane commented 7 years ago

...and in order to let consumers know what image API requests to make to get the illustrations, publish an annotation list for the illustrations and attach it to the canvas as otherContent. It should be straightforward to transform the ALTO to an annotation list for the illustrations just as many already do to turn ALTO into text transcription annotations. Then the end user can decide what size they want the illustration to be (assuming it's not a level 0 service).

The question is, what motivation to use to indicate that the target of the annotation is an illustration. oa:Identifying?

azaroth42 commented 7 years ago

That seems good to me, though as discussed offline, oa:classifying would be the right motivation. Identifying would say what it's an image of. (Whatever that thing is!)

tomcrane commented 7 years ago

as above:

    {
      "@id": "http://wellcomelibrary.org/iiif/b28047345/annos/contentAsText/a31i0",
      "@type": "oa:Annotation",
      "motivation": "oa:classifying",
      "resource": {
        "@id": "dctypes:Image",
        "label": "Picture"
      },
      "on": "http://wellcomelibrary.org/iiif/b28047345/canvas/c31#xywh=201,1768,2081,725"
    }

altomator commented 7 years ago

Thanks, that sounds good!

adamfarquhar commented 5 years ago

@altomator @tomcrane Do you have the full worked example for this? As you likely know, we've been providing a set of around 1m extracted images like this since 2013. It would be interesting to shift to IIIF for the same work, as it would make it easier to make them available in this way.

Does anyone have experience doing this with a large set of images, though? That is, if I have a million annotations and someone sensibly retrieves the annotated information (perhaps resizing on the way), what is the associated computational/retrieval burden?

I note that the Gallica link still works, but the Wellcome ones no longer do (as of 10-2018).

tomcrane commented 5 years ago

@adamfarquhar a couple of years ago I made this app to explore the OCR-identified images in Wellcome content:

http://tomcrane.github.io/wellcome-today/annodump.html?manifest=https://wellcomelibrary.org/iiif/b28047345/manifest&page=11

Across the corpus the number of images identified this way, by annotations on printed books, probably exceeds 10m. The "lines of text" checkbox toggles the textual annotations too.

The Wellcome one should work, but the client needs to relate the canvas target to the image API to make a request for the pixels.

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L172 for the code that does this.

tomcrane commented 5 years ago

@adamfarquhar addendum - the question of computation load depends on implementation decisions. In the Wellcome case the burden of retrieving the annotations themselves is small, as the annotation lists are direct transformations of METS-ALTO files. But if someone then asked for the millions of image regions identified by those annotations, that would put a lot of load on the image servers as the identified images regions are very unlikely to be cached responses. Here you would hope that consumers of the API behave sensibly and considerately.

altomator commented 5 years ago

We have not started yet to identify the illustrations in the IIIF manifest. But it's on our roadmap. I recently developped a PoC to showcase the interest to do this:

http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-app.xq&filter=0&start=1&action=first&module=0.5&locale=fr&similarity=&corpus=1418-v2&keyword=&kwTarget=&kwMode=&typeP=P&typeR=R&typeM=M&title=&fromDate=&toDate=&iptc=00&persType=person&classif=&CBIR=*&operator=and&colName=00&size=31&density=26

The GallicaPix leverage the OCR-identified images and Gallica IIIF compliance + a bunch of other BnF APIs. Performances is also a concern for us. A single request of GallicaPix can ask for thousands of thumbnails. Unfortunately, no roadmap on this topic!