Open wgilling opened 4 years ago
I understand that Islandora 8 does not support the ability to highlight search results when using openseadragon. I'm contributing our use case in the hopes that this feature will be prioritized soon.
In our case, we are digitizing PDF files using Abbyy Finereader, which supports OCR of German Gothic script (fraktur). It produces PDF files containing the scanned image, as well as the OCR'd text in a separate layer. You can open one of these files in a PDF reader and search it, and it will correctly highlight the location of the matching text.
When we import into Islandora 8, the PDF is converted to a service image, and this is displayed using openseadragon. To support our use case, I suppose that Islandora would need to determine the location of matched text using the uploaded PDF (since this information is not contained in the JPG service file), then produce overlay information for openseadragon.
The Islandora-Lite folks @ the University of Toronto Scarborough (tagging @kstapelfeldt and @Natkeeran) did a demonstration of their setup during IslandoraCon 2022 which included improvements in viewer-supported OCR. I believe they were using annotations served via IIIF, but I don't recall details. I look forward to watching their presentation again when it gets posted.
To clarify, it is an early prototype. Please see additional info here: https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype)
@alxp (UPEI) is also looking into this feature.
I'd love to first explore the Mirador Search and Annotations (Prototype) and work with @alxp on this solution since it seems like anybody who is using mirador already would be able to use this.
Jordan Dukart had referenced the mirador-textoverlay code here https://github.com/dbmdz/mirador-textoverlay and said that this was what UTSC and CMU were using, but mirador likely does not take an HOCR file per page but rather an intermediate format.
Also, Don Richards mentioned this https://dbmdz.github.io/solr-ocrhighlighting/0.8.1/ while he was researching the topic.
FYI, we have implemented a solution to our use case which I noted earlier. Here is our approach at a high-level:
This approach was driven largely by the format of our source PDFs (and the need to complete our project on-budget). I don't know whether it is of interest to the Islandora community or not, but thought I would post here in case anyone is interested.
Tesseract can be used to make HOCR again, but there are many challenges to display this.
This likely does not apply to objects that would be viewed using the PDFjs viewer because that handles search highlighting.