Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 71 forks source link

Highlight OCR #1580

Open wgilling opened 4 years ago

wgilling commented 4 years ago

Tesseract can be used to make HOCR again, but there are many challenges to display this.

This likely does not apply to objects that would be viewed using the PDFjs viewer because that handles search highlighting.

  1. editing OCR would be so much more tricky -- considering that the HOCR file would need to potentially be updated as well
  2. displaying the rectangles per search term was a much easier concept for the HTML via the CSS classes in the HOCR file, but a challenge would be how to use these rectangles to make the overlays in the OpenSeadragon viewer.
  3. make a corresponding actions trigger that can be used to generate to any objects that have already been ingested (similarly to the action to "Index node in Fedora")
jasonhildebrand commented 2 years ago

I understand that Islandora 8 does not support the ability to highlight search results when using openseadragon. I'm contributing our use case in the hopes that this feature will be prioritized soon.

In our case, we are digitizing PDF files using Abbyy Finereader, which supports OCR of German Gothic script (fraktur). It produces PDF files containing the scanned image, as well as the OCR'd text in a separate layer. You can open one of these files in a PDF reader and search it, and it will correctly highlight the location of the matching text.

When we import into Islandora 8, the PDF is converted to a service image, and this is displayed using openseadragon. To support our use case, I suppose that Islandora would need to determine the location of matched text using the uploaded PDF (since this information is not contained in the JPG service file), then produce overlay information for openseadragon.

seth-shaw-asu commented 2 years ago

The Islandora-Lite folks @ the University of Toronto Scarborough (tagging @kstapelfeldt and @Natkeeran) did a demonstration of their setup during IslandoraCon 2022 which included improvements in viewer-supported OCR. I believe they were using annotations served via IIIF, but I don't recall details. I look forward to watching their presentation again when it gets posted.

Natkeeran commented 2 years ago

To clarify, it is an early prototype. Please see additional info here: https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype)

@alxp (UPEI) is also looking into this feature.

wgilling commented 2 years ago

I'd love to first explore the Mirador Search and Annotations (Prototype) and work with @alxp on this solution since it seems like anybody who is using mirador already would be able to use this.

Jordan Dukart had referenced the mirador-textoverlay code here https://github.com/dbmdz/mirador-textoverlay and said that this was what UTSC and CMU were using, but mirador likely does not take an HOCR file per page but rather an intermediate format.

Also, Don Richards mentioned this https://dbmdz.github.io/solr-ocrhighlighting/0.8.1/ while he was researching the topic.

jasonhildebrand commented 2 years ago

FYI, we have implemented a solution to our use case which I noted earlier. Here is our approach at a high-level:

This approach was driven largely by the format of our source PDFs (and the need to complete our project on-budget). I don't know whether it is of interest to the Islandora community or not, but thought I would post here in case anyone is interested.