beehind / beehind.github.io

Beehind: pilot workflows to capture prominent bee specimen and their historic and ecological associates
https://beehind.org
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

method to link image (segments) to digital content #8

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

:warning: crazy idea :warning:

In https://beehind.org , we have an illustration with image connected to text boxes. This illustration points to parts of an image and associates it with something else. In order to automatically generate this image, a method is needed to:

  1. point to an area in an image
  2. relate this pointer to some (other) digital content
  3. record the provenance of this relation

At first glance, relating of an image area to some (textual) content is similar to, or perhaps a general case of, OCR - optical character recognition. OCR relates some image (area) to some character. OCR is used to extract text from images.

The Internet Archive folks, prolific users of OCR, created an article https://archive.org/developers/ocr.html suggesting that three main ocr formats exist - hOCR (some html derivative) and two xml format (Alto and Page XML). hOCR and Alto are supported by tesseract, a commonly used OCR library.

Suggest to build a prototype that take a single part from the https://beehind.org illustration and encode it in hOCR, Alto or Page xml.

image

jhpoelen commented 1 year ago

Alternatively, see IIIF example https://courses.edx.org/courses/course-v1:HarvardX+MCB64.1x+2T2016/d16e07a5cec442eeb7cd9dfcb695dce0/ via https://iiif.io/demos/

jhpoelen commented 1 year ago

Note that the Internet Archive appears to have chosen for tesseract -> hOCR -based workflow.

The Internet Archive settled on using hOCR. At the time of writing, Tesseract does support outputting ALTO XML, but PAGE XML was not yet supported. hOCR was deemed sufficiently simple and flexible, with the added advantage that it is XHTML, which allows for viewing the documents in a browser. Various hOCR tools and libraries exist, as do hOCR viewers, such as hocrviewer-miradoc and hocrjs.

jhpoelen commented 1 year ago

@Daniel-Mietchen suggested to look into https://en.wikipedia.org/wiki/Hierarchical_Data_Format as well as layer/annotation features in map technologies (e.g., openstreet maps).

jhpoelen commented 1 year ago

ImageJ has a way to do measurements by drawing two lines (or some other shape). One two capture the scale bar shown in the picture. The other to capture the measurement taken. Manual work is needed to read the scale bar and translate the pixel distances to actual distances.

Same for Note for Nature (zooniverse)

Suggest to understand how ImageJ and Notes for Nature capture this information digitally and what file format is being used.

Screenshot from 2023-04-21 16-53-07 Screenshot from 2023-04-21 16-53-01 Screenshot from 2023-04-21 16-52-43

Daniel-Mietchen commented 1 year ago

Alternatively, see IIIF example https://courses.edx.org/courses/course-v1:HarvardX+MCB64.1x+2T2016/d16e07a5cec442eeb7cd9dfcb695dce0/ via https://iiif.io/demos/

I dug around a bit for IIIF-related documentation on Wikimedia projects, and what I found was a mostly stale collection of outdated pointers to dysfunct demos and bouts of enthusiasm modulated by lack of support, with https://commons.wikimedia.org/wiki/Commons:International_Image_Interoperability_Framework being the most useful resource.

One of the things it points to is https://github.com/IIIF/awesome-iiif, which has a section https://github.com/IIIF/awesome-iiif#image-servers with multiple IIIF server tools.

jhpoelen commented 1 year ago

@Daniel-Mietchen thanks for having a look at IIIF - it does appear that the framework is getting some traction in the natural history collections community. . . hmm, I wonder what is going on.