Store extracted publication text as markup indicating page breaks, instead of one blog of text

caseyg commented 4 years ago

There are a variety of modules which do something very similar: extract text from PDF documents ingested by Omeka.

A: https://github.com/omeka-s-modules/ExtractText B: https://github.com/Daniel-KM/Omeka-S-module-PdfText C: https://github.com/symac/Omeka-S-module-ExtractOcr

They all appear to use the same poppler-utils/pdf2text library but have some differences.

We are currently using module A: ExtractText. Its approach is to create an extracttext metadata field and drop the entire PDF content in there as one gigantic blob of text.

For future search and design reasons, it would be better to have extracted text marked up with page breaks. A future search interface could highlight pages numbers on which string X appears, and an anchor-element could be used to deeplink to the text of a specific page.

For these reasons, among others, it would be preferable to have extracted text stored as XML or HTML with tags indicating page breaks.

(I believe this is the approach taken by module C: ExtractOCR which "extracts OCR text in XML from PDF files." but I haven't played around with it yet.)

This thread from the Omeka Forum also indicates that this is the "blessed" option for text-type content, as markup is not allowed in Omeka S metadata fields.

The HTML media type is basically our “blessed” option for text-type content. In Omeka Classic people describing text often resorted to putting large blocks of text in metadata fields, when that text was really the “data” the metadata was describing. Since Omeka S doesn’t allow HTML in metadata fields, we added the HTML media type to have an option for direct entry of rich text to allow for that kind of content in the system.

However, it still isn’t really plugged into search. “Sitewide” search that works across content types is still something we’re investigating how best to accomplish. It’s also something some outside developers have looked at with modules using things like external search engines such as Solr. The MySQL fulltext search support has been our method in the past with Omeka Classic, but it has many well-known shortcomings that often frustrate users.

Recommended next step: find out if module C: Extract OCR, works well for our purposes, and if it can reprocess already uploaded media.

elazar commented 4 years ago

@caseyg Still need more detail on this. Please add what you know to the issue description, then re-assign to me. Thanks.

caseyg commented 4 years ago

@elazar updated the issue description. I'm probably going to be offline for rest of today, but I think the best next step is to try out a different module, before we reinvent the wheel. I will leave unassigned for now, and feel free to poke around if you're interested! But I can also mess with the modules at some point to get more info, and let you know if I hit a sticking point.

disorientations / disorientations.org

Store extracted publication text as markup indicating page breaks, instead of one blog of text #25